Decision Boundaries in the sense of Naive Bayesian for Multidimensional cases (Naïve Decision Surface Network)

Naive Bayesian classifier is a fundamental statistical method that assents the conditional independence of features values by minimizing the probability errors within the classes. In practice, Naive Bayesian classifier often violated assumptions and is not robust to the noise with multidimensional cases. A useful way to signify classifier is through discriminant functions where the classifier assigns a feature vector to divide the feature space into decision surfaces separated by multidimensional boundaries. In this work, Naïve Decision Surface Network is proposed to build on discriminant quadratic functions that obtained for a multiclass, multi features problems. The action all of covariance, variance and correlation possibilities are addressed. An example is illustrated to demonstrate the computational and analytical simplifications and the results showed less classification rate error.


Introduction
Bayesian classifier is a statistical approach that quantifies the trade-offs between various decisions using probabilities. The Bayesian strategies for pattern classification minimize the expected risk, in another word just to get as not many misclassifications as possible. The probability functions for each group outline decision regions in which boundaries with the highest probability of misclassification assigns each value of data set to one of the available classes [1]. Such a procedure will divide the input space into decision regions Rk as shown in Fig. 1 (b), such that if all points falls in region R1 it will be assigned to a specific class. The boundaries between decision regions called decision boundaries and if regions Ri and Rj happen to be contiguous, then they separated by a decision surface in the multidimensional feature space as shown in Fig. 1 (a), otherwise decision regions could contain some number of disjoint regions [2].   [2]. Since bayes classification suffers from complexity with multidimensional problems, Naive bayes classifier approach raised stand on Bayesian theorem, where it is particularly appropriate when the inputs dimension is high. For simple decision problems; the decision surfaces separating optimal bayes decision regions are more complicated than the decision surfaces of the Naive Bayes model [3]. The classification as dividing the input space into decision regions and visualizing the decision surfaces can be achieved by studying the relationship between classification models and decision boundaries. The users can see the distance between data and their decision boundaries and they may know whether the classifier will lead to over-fitting or not [4]. The rule of discriminant function is maximizing the posterior probability to minimize the error rate for both the discrete and continuous problems.
The only distinction is that probabilities used as a replacement for of probability densities [5].
In this paper, we derive a general partitioning function for Naive Bayesian decision surface of three featured data set classes. In the next section we illustrate an example for three class's data set. The conclusion on results will be in the last.

Bayesian Discriminant Function
Bayesian theorem is essentially an expression of conditional probabilities where, conditional probabilities represent the probability of an event occurring given evidence.
Under estimating probability density functions (pdf's), it is important to define four general terms related to our subjects. The first is the prior or a priori distribution which implies of how a particularly the system formed. However, a uniform distribution can be used to model the prior probability as Gaussian model. Secondly, the likelihood is simply the probability of a specific class given the random variable. The third is a posteriori probability that it states the probability of an event occurring given evidence and this specifically what results from the Bayes rule. Finally, the evidence in bayes theorem is usually considered a scaling term that stated for a feature vector [6]: where posteriori for ( j of classes), and is the evidence given(feature vector). The likelihood function of with respect to , (i.e., the probability that belongs to ), and the prior probability ( ) reflects knowledge of the element frequency aimed at instances of a class. While, the evidence: The decision rule is: Hence, the classification processes done by estimating the membership of an inspection in a class based on the features observation. However, we might not always be making the best decision, so there must be a function that says to us which deed to take for every possible observation which is the risk function [7]. The overall risk given our interpretation for each action is computed as sum of associated risk of all the states, weighted by ( | ) the probability of occurrence of each state.

∑ ( | )
The rule to minimize risk and the corresponding error rate can be done by maximizing the posterior probability [8]. Thus, for the minimum error probability case that separated by a decision surface in the multidimensional feature space this described by the equation (6), However, from a mathematical point of view, there are equivalent functions considered as discriminant functions for all classes such that : , Consequently, the decision tests in (3) and (4)

Decision Surface
Now we assume that the likelihood functions of with respect to in the dimensional feature space follow the general multivariate normal (Gaussian) density because of its computational tractability and the truth that it models sufficiently a large number of cases, where, [ ] is the mean value of the class, is the covariance matrix, and n is the feature space dimension. However, because of the exponential form is involved densities, it is preferable to work with the following (monotonic) logarithmic function [10]. Therefore and by subsist (1) into (6) we get, Here, omitted since it exits in both side, and using (7), (8) and (9) we get the quadratic discriminant surface: The decision surfaces in general are formed in hyper-quadrics custom (i.e., ellipsoids, parabolas, hyperbolas). However, for multi-dimensional features d and multi-dimensional classes k, we get N of quadratic discriminant functions where; ∏ Her we can decompose the likelihood to a product of terms for feature vector since that the variables are statistically independent as Naive Bayes theorem assumes. Although, this assumption is not always accurate, it simplifies the classification task dramatically because it allows the likelihood to be calculated separately for each variable. Furthermore it turns out that the naive-Bayes classifier can be very robust to violations of its independence assumption. This has been reported to be implemented well for many real-world data sets and can be modeled with different density functions [12]. In this paper the normal Gaussian model implemented for the same reasons mentioned above.

Naïve Decision Surface Network
The proposed scheme can be described by Illustrating multiclass and multi feature data attributes as a decision boundary surface in a network form such that can be obtained by applying the case of equality with detrimental where the pattern can be assigned to either of any two classes [9]. Thus the decision surface can equivalently be based on (10) for minimum-error-rate classification and written as:  The middle layer is consisting of discriminant functions employed in the each layer nods that should be connected to the output layer nods with output weights. These weights distributed to the output nodes such that each one will be tuned according learning algorithm.
Note that the output nodes represent the classes.

The Naïve Decision Surface Network Algorithm
The algorithm for the proposed network has two phases; the learning phase and the testing phase. The data set for each phase are different to avoid over fitting problem. The learning algorithm can be summarized as below; The network can be learned initially by the learning data sets where the weights should take two forms only: the active denoted by 1 (highest Mahalanobis distance) and inactive denoted by 0, and at least two weights should be active for one class in output nodes in order to be classified in testing phase with test data sets. Remember that, for multiclass and multi feature data set we can obtain partitioning the feature space into regions, and so discriminant functions in the middle layer equal to N as in (11).

Simulation Example
In our example, we choose Iris flower data set (setosa, versicolor, and virginica) to illustrate the proposed classification scheme because it has been applied as a benchmark in Matlab classification demos in Statistics. For each of the species, 50 observations with four features: sepal length, sepal width, petal length, and petal width are recorded (see Iris.dat MATLAB). Since our feature attributes are independent, the normal distribution pdf can be used for likelihood: By using (13) and (14) the discriminant function D ji will be: In common, where, is the standard deviation for the feature vector in the class and is the mean of features vector for the class . To visualize the plotting of the decision boundary surfaces the reduction for feature dimension size by one is applicable using Mahalanobis distance. We get: Since there are only 4 possible triplets out of 4 features, the Mahalanobis distance will be calculated for all four, and the triplet that give the highest distance will be selected.
From the calculated distances, the triplet gives the highest Mahalanobis distance, which means that the combination of x, y, and z features is the best to discriminate between the two classes. However, we assume that the prior probabilities are equal: for sake of simplicity .The mean and the covariance matrices are obtained for setosa and versicolor respectively,

Since
, therefore and from (10) we get: Here, we will get a quadratic equation for the decision boundary between setosa and versicolor as stated below and graphed in the Fig. 3. Moreover, In our case we will perform pair of classes since the discriminant function should cover all the classes without duplicate , in contrast, since number of classes is n=3 then number of discriminant function N=3 see (11).
Here, the proposed Naïve Decision Surface Network (NDSN) will deal with each feature individually for each class and according to (16), if we assume that setosa set =A and versicolor=B , then we get the discriminant function D 1,2 between them will be as below, Generally, the standard deviation for the feature vector of the two classes can be considered that they are not equal . Fig. 4 shows the Naive decision surfaces of the case.  Table 1 and where the test error on an independent fisher iris flower data set is summarized.  Table 1, it is clear that the error rate of the proposed NDSN merhod is best in classification from the other well-known methods. Moreover, the confusion matrix showed that full of 50 of the Iris-setosm data were classified and 48 with 2 errors for Iris-versicolor.
While, 46 of Iris-virginica data set were correctly classified with 4 errors.

Conclusion
From side of complexity, sometimes the decision boundary surface more helpful to use as classifier instead of working directly with probabilities (or risk functions).The formal definitions of decision boundary surface for Bayes and naïve Bayes are presented in this paper. The visualization of decision boundaries gives a good understanding the distance and decision boundary among decision regions between the data sets. The proposed NDSN method is more convenient when the data sets involved pdf's that are complicate and their estimation is not an easy task. Moreover, it is preferable to compute decision surfaces directly by means of alternative costs such that it gives rise to discriminant functions and decision surfaces. The method in general is suboptimal with respect to Bayesian classifiers. An example is applied to illustrate the superiority of the proposed method over some well-known methods. However, this method can be applied to other datasets for the generalizing purposes.
The under fit or over fit problem is possible here but this procedure can helped to overview the classifier behavior whether the classifier will lead to over-fitting or not.