### 7.1 Artificial Neural Networks approach

The six different bacteria dataset were analyzed using three supervised ANN classifiers, namely the Multi Layer Perceptron (MLP), Probabilistic Neural network (PNN) and Radial basis function network (RBF) paradigms [2–4, 27]. Training of the neural networks was performed with 80% of the whole data set. The remaining 20% of the whole data were used for testing the neural networks. These percentages were selected arbitrarily and were applied for all data sets. The aim of this comparative study was to identify the most appropriate ANN paradigm, which can be trained with best accuracy, to predict the "type of ENT infections" or in other words "type of ENT bacteria".

#### 7.1.1 MLP classifier

The most common neural network model is the multilayer perceptron (MLP). This type of neural network is known as a supervised network because it requires a desired output in order to learn. The goal of this type of network is to create a model that correctly maps the input to the output using historical data so that the model can then be used to produce the output when the desired output is unknown. A graphical representation of an MLP is shown below in Figure 6.

#### 7.1.2 RBF classifier

These networks have a static Gaussian function as the nonlinearity for the hidden layer processing elements. The Gaussian function responds only to a small region of the input space around the Gaussian centered. The key to a successful implementation of these networks is to find suitable centres for the Gaussian functions. This can be done with supervised learning, but an unsupervised approach usually produces better results. For this reason, NeuroSolutions implements RBF networks as a hybrid supervised-unsupervised topology.

The simulation starts with the training of an unsupervised layer. Its function is to derive the Gaussian centres and the widths from the input data. These centres are encoded within the weights of the unsupervised layer using competitive learning. During the unsupervised learning, the widths of the Gaussians are computed based on the centres of their neighbors. The output of this layer is derived from the input data weighted by a Gaussian mixture.

Once the unsupervised layer has completed its training, the supervised segment then sets the centres of Gaussian functions (based on the weights of the unsupervised layer) and determines the width (standard deviation) of each Gaussian. Any supervised topology (such as a MLP) may be used for the classification of the weighted input.

The advantage of the radial basis function network is that it finds the input to output map using local approximators. Usually the supervised segment is simply a linear combination of the approximators. Since linear combiners have few weights, these networks train extremely fast and require fewer training samples. A graphical representation of an RBF is shown below in Figure 7.

#### 7.1.3 PNN classifier

The PNN networks are variants of the radial basis function (RBF) network. Unlike the standard RBF, the weights of theses networks can be calculated analytically. In this case, the number of cluster centers is by definition equal to the number of exemplars, and they are all set to the same variance.

### 7.2 "Maximum Probability Rule" based classification

The selection of a patient is done randomly from a large collection of patients having a specific disease, so that the collected samples can be treated as randomly drawn samples. Here we denote each sample as a unit "u", reflecting one specific case of disease or in other word one specific class of bacteria. Now the feature extracted from the sensor response is treated as the observation vector X_{
u
}of unit u. In our knowledge for each class we have some observation vectors (of dimension 18 × 1) of randomly selected sample units from that class. We denoted the number of units in a class in the knowledge by N_{
g
}when g represents that specific class [1, 28, 29].

#### 7.2.1 Basic idea: modeling of each class by a probability model

At the preliminary stage, we assign probability models to each of the classes. We assume that if we draw random samples from one class it will be selected from a fixed probability distribution, which is specific for that class. In general this means that the bacteria class has been modeled with that probability distribution model. So we assign a distribution to each of the classes. Assuming continuous probability models we specify a probability density function f(X/g) for class g.

#### 7.2.3 Decision rule: maximum (Bayesian) probability rule

**Baye's Rule**: The posterior and prior probabilities of class membership are related using typicality probability of class by Baye's Rule in the following manner,

P(g/{\text{X}}_{\text{u}})=\frac{{\pi}_{g}\cdot P({\text{X}}_{\text{u}}/g)}{{\displaystyle \sum _{{g}^{\prime}=1}^{k}{\pi}_{{g}^{\prime}}}\cdot P({\text{X}}_{\text{u}}/{g}^{\prime})}\left(3\right)

So for classification purpose of a new unit of observation u to any one of the classes, our **decision rule** becomes:

**Assign unit u to class g if P(g/X**
_{
u
}
**) > P(h/X**
_{
u
}
**) for g ≠ h**.

This is called the "**Maximum Probability Rule**".

Again k values of P(X_{
u
}/g) need to be determined for each unit. Because the denominator in 1 is constant for all class, the rule could more simply be based on the k values of *π*
_{
g
}·f(X_{
u
}/g) and 1 can be stated equivalently as,

P(g/{\text{X}}_{\text{u}})=\frac{{\pi}_{g}\cdot f({\text{X}}_{\text{u}}/g)}{{\displaystyle \sum _{{g}^{\prime}=1}^{k}{\pi}_{{g}^{\prime}}}\cdot f({\text{X}}_{\text{u}}/{g}^{\prime})}\left(4\right)

#### 7.2.4 Assignment of probability model

The "Maximum Probability Rule" involving posterior probability of class membership can be applied only if the probability density functions of the distributions of the classes are known [1, 29].

Each class may be modeled by assigning a density function f(X/g). Our only knowledge about class g is the N_{
g
}observation units of randomly drawn samples from class g. There are many commonly used techniques of estimating f(X/g) from this knowledge. Commonly these methods are divided in two classes, parametric and non-parametric ones.

Parametric approach: Specify a theoretical probability distributional model {\text{f}}_{\underset{\u02dc}{\theta}}
(X/g), assume that the data on hand fit the model and estimate the model parameters \underset{\u02dc}{\theta}
using the data, and construct a rule using these estimates.

Non-Parametric approach: Estimate the density values directly from data with no prior model specification, and construct a rule using these estimates.

We applied two significant methods from these two classes for classification of our ENT data. In parametric model multi-nominal model has been used, where in non-parametric approaches our choice was for "kernel density estimator". There are some other very successful non-parametric methods, like k-NN classification rule. It is true that the k-nn method is faster than kernel method. But where the minimization of classification error is also important then these two methods are equivalent when number of sensors is too large.

### 7.3 Non-parametric approach

In this approach we do not assume any pre-specified theoretical parametric form of **pdf** of a distribution. So it is also known as distribution-free method. There are four major types of non-parametric **pdf** estimators: the histogram, the kernel method, the k-nearest-neighbour method, and the series method. But in this case we will only use the kernel estimates of the pdf and a further development of this method, generally portrayed as adaptive kernel estimators.

#### 7.3.1 General kernel estimator based on Bayesian rule

In general the form of the kernel estimator is,

\widehat{f}({X}_{u}/g)=\frac{1}{{N}_{g}}{\displaystyle \sum _{i=1}^{{N}_{g}}K({X}_{u}-{X}_{i})}\left(5\right)

Imposing the conditions *K*(*z*) ≥ 0 and ∫*K*(*z*)*dz* = 1 on K, then it is easy to see that \widehat{f}
also satisfies \widehat{f}
(*z*) ≥ 0 and ∫\widehat{f}
(*z*)*dz* = 1 so that \widehat{f}
is a legitimate pdf. Using product form of the kernel estimates we get,

\widehat{f}({X}_{u}/g)=\frac{1}{{N}_{g}}{\displaystyle \sum _{i=1}^{{N}_{g}}{\displaystyle \prod _{j=1}^{p}\frac{1}{{h}_{jg}}K(\frac{{X}_{uj}-{X}_{ij}}{{h}_{jg}})}}\left(6\right)

Here the K is known as kernel function and {\left\{{h}_{jg}\right\}}_{j=1,p}
is called the smoothing parameters for the g-th class.

In our practice we use the **normal kernel functions** and take equal values for all {\left\{{h}_{jg}\right\}}_{j=1,p}
of g-th class, then our estimator becomes

\widehat{f}({X}_{u}/g)=\frac{1}{{N}_{g}}{\displaystyle \sum _{i=1}^{{N}_{g}}{\displaystyle \prod _{j=1}^{p}\frac{1}{{h}_{g}}}}\mathrm{exp}[-\frac{1}{2}(\frac{{X}_{uj}-{X}_{ij}}{{h}_{g}})]\left(7\right)

Here we estimated h values for each class g and denoted them as *h*
_{
g
}.

#### 7.3.2 Adaptive kernel estimator as IBC

A practical drawback of the kernel method of density estimation is its inability to deal satisfactorily with the tails of distributions without over smoothing the main part of the density. The data have reasonable amount of outliers and their densities are multimodal densities. So the general method smoothed the inner part of the density, where it should not be due to the close overlapping of the data. This pattern of the data speaks for a density estimator which can sense small masses of probability and also a robust one at the same time. From the kernel density point of view, if the window function can be adopted locally depending upon the local data, we can have the optimum one. So here we used one of the popularly known adaptive approaches, adaptive kernel method. In this method another added local smoothing parameter has been used with the global smoothing parameter. This local one has been estimated from a pilot estimate of the density. Previous works show that this method tackles the tail probabilities excellently [1, 28–30].

The adaptive kernel estimators will be,

\widehat{f}({X}_{u}/g)=\frac{1}{{N}_{g}}{\displaystyle \sum _{i=1}^{{N}_{g}}\frac{1}{{h}_{ig}^{-p}}}K({X}_{u}-{X}_{i})\left(8\right)

where the *N*
_{
g
}smoothing parameters *h*
_{
ig
}(*i* = 1,......., *N*
_{
g
}) are based on some pilot estimate of the density \tilde{f}
(*X*/*g*). The smoothing parameters *h*
_{
ig
}are specified as *h*
_{
g
}
*a*
_{
ig
}, where *h*
_{
g
}is a global smoothing parameter of a class and *a*
_{
ig
}are local smoothing parameters given by

\begin{array}{cc}{a}_{ig}={\left\{\tilde{f}({X}_{ig}/g)/{C}_{g}\right\}}^{-{\alpha}_{g}}& (i=1,\mathrm{.........},{N}_{g})\end{array}\left(9\right)

where \mathrm{log}{C}_{g}=\frac{1}{{N}_{g}}{\displaystyle \sum _{i=1}^{{N}_{g}}\mathrm{log}(\tilde{f}({X}_{ig}/g))}

and \tilde{f}
(*X*
_{
ig
}/*g*) > 0 (*i* = 1,........, *N*
_{
g
}), and where *α*
_{
g
}is the sensitivity parameter satisfying 0 ≤ *α*
_{
g
}≤ 1. In practical application we assumed *α*
_{
g
}to be 0.5.

Here in our case we have taken the general kernel estimator as the pilot estimate of the density \tilde{f}
(*X*/*g*).

Now plugging in these above estimates \widehat{f}
(*X*
_{
u
}/*g*) in the equation 2 we get,

\widehat{P}(g/{\text{X}}_{\text{u}})=\frac{{q}_{g}\cdot \widehat{f}({X}_{u}/g)}{{\displaystyle \sum _{{g}^{\prime}=1}^{k}{q}_{{g}^{\prime}}\cdot \widehat{f}({X}_{u}/{g}^{\prime})}}.\left(10\right)

Then the decision rule becomes,

**Assign unit u to class g if** *q*
_{
g
}·\widehat{f}
(*X*
_{
u
}/*g*) > *q*
_{
h
}·\widehat{f}
(*X*
_{
u
}/*h*) **for g ≠ h**.

when \widehat{f}
(*X*
_{
u
}/*g*) will be different for these two methods.

In both of these kernel methods we have to estimate some of the parameters, the window function **h** for each class. To simplify the problem we assumed same window function for each class, as because in kernel method this windows are smoothed again, so collective choice is not a bad idea. We selected the smoothing parameters simultaneously by minimization of the cross-validated estimate of the error of classification of the Baye's rule by plugging in these kernel estimates of the group-conditional densities from the knowledge.

It is pleasant to note that maximum probability rule is just a basic concept about modeling, and decision making rule. But it heavily depends upon the above described estimation step which is the heart of the modeling process. With the development of this estimation methods modeling becomes much more flawless and the classification technique becomes stronger. This phenomenon is noted at the time of testing. Adaptive kernel method with its superiority in estimation of tail probability gets an edge over the other methods.