In this approach we do not assume any pre-specified theoretical parametric form of **pdf** of a distribution. So it is also known as distribution-free method. There are four major types of non-parametric **pdf** estimators: the histogram, the kernel method, the k-nearest-neighbour method, and the series method. But in this case we will only use the kernel estimates of the pdf and a further development of this method, generally portrayed as adaptive kernel estimators.

#### 7.3.1 General kernel estimator based on Bayesian rule

In general the form of the kernel estimator is,

$\widehat{f}({X}_{u}/g)=\frac{1}{{N}_{g}}{\displaystyle \sum _{i=1}^{{N}_{g}}K({X}_{u}-{X}_{i})}\left(5\right)$

Imposing the conditions *K*(*z*) ≥ 0 and ∫*K*(*z*)*dz* = 1 on K, then it is easy to see that $\widehat{f}$
also satisfies $\widehat{f}$
(*z*) ≥ 0 and ∫$\widehat{f}$
(*z*)*dz* = 1 so that $\widehat{f}$
is a legitimate pdf. Using product form of the kernel estimates we get,

$\widehat{f}({X}_{u}/g)=\frac{1}{{N}_{g}}{\displaystyle \sum _{i=1}^{{N}_{g}}{\displaystyle \prod _{j=1}^{p}\frac{1}{{h}_{jg}}K(\frac{{X}_{uj}-{X}_{ij}}{{h}_{jg}})}}\left(6\right)$

Here the K is known as kernel function and ${\left\{{h}_{jg}\right\}}_{j=1,p}$
is called the smoothing parameters for the g-th class.

In our practice we use the **normal kernel functions** and take equal values for all ${\left\{{h}_{jg}\right\}}_{j=1,p}$
of g-th class, then our estimator becomes

$\widehat{f}({X}_{u}/g)=\frac{1}{{N}_{g}}{\displaystyle \sum _{i=1}^{{N}_{g}}{\displaystyle \prod _{j=1}^{p}\frac{1}{{h}_{g}}}}\mathrm{exp}[-\frac{1}{2}(\frac{{X}_{uj}-{X}_{ij}}{{h}_{g}})]\left(7\right)$

Here we estimated h values for each class g and denoted them as *h*
_{
g
}.

#### 7.3.2 Adaptive kernel estimator as IBC

A practical drawback of the kernel method of density estimation is its inability to deal satisfactorily with the tails of distributions without over smoothing the main part of the density. The data have reasonable amount of outliers and their densities are multimodal densities. So the general method smoothed the inner part of the density, where it should not be due to the close overlapping of the data. This pattern of the data speaks for a density estimator which can sense small masses of probability and also a robust one at the same time. From the kernel density point of view, if the window function can be adopted locally depending upon the local data, we can have the optimum one. So here we used one of the popularly known adaptive approaches, adaptive kernel method. In this method another added local smoothing parameter has been used with the global smoothing parameter. This local one has been estimated from a pilot estimate of the density. Previous works show that this method tackles the tail probabilities excellently [1, 28–30].

The adaptive kernel estimators will be,

$\widehat{f}({X}_{u}/g)=\frac{1}{{N}_{g}}{\displaystyle \sum _{i=1}^{{N}_{g}}\frac{1}{{h}_{ig}^{-p}}}K({X}_{u}-{X}_{i})\left(8\right)$

where the *N*
_{
g
}smoothing parameters *h*
_{
ig
}(*i* = 1,......., *N*
_{
g
}) are based on some pilot estimate of the density $\tilde{f}$
(*X*/*g*). The smoothing parameters *h*
_{
ig
}are specified as *h*
_{
g
}
*a*
_{
ig
}, where *h*
_{
g
}is a global smoothing parameter of a class and *a*
_{
ig
}are local smoothing parameters given by

$\begin{array}{cc}{a}_{ig}={\left\{\tilde{f}({X}_{ig}/g)/{C}_{g}\right\}}^{-{\alpha}_{g}}& (i=1,\mathrm{.........},{N}_{g})\end{array}\left(9\right)$

where $\mathrm{log}{C}_{g}=\frac{1}{{N}_{g}}{\displaystyle \sum _{i=1}^{{N}_{g}}\mathrm{log}(\tilde{f}({X}_{ig}/g))}$

and $\tilde{f}$
(*X*
_{
ig
}/*g*) > 0 (*i* = 1,........, *N*
_{
g
}), and where *α*
_{
g
}is the sensitivity parameter satisfying 0 ≤ *α*
_{
g
}≤ 1. In practical application we assumed *α*
_{
g
}to be 0.5.

Here in our case we have taken the general kernel estimator as the pilot estimate of the density $\tilde{f}$
(*X*/*g*).

Now plugging in these above estimates $\widehat{f}$
(*X*
_{
u
}/*g*) in the equation 2 we get,

$\widehat{P}(g/{\text{X}}_{\text{u}})=\frac{{q}_{g}\cdot \widehat{f}({X}_{u}/g)}{{\displaystyle \sum _{{g}^{\prime}=1}^{k}{q}_{{g}^{\prime}}\cdot \widehat{f}({X}_{u}/{g}^{\prime})}}.\left(10\right)$

Then the decision rule becomes,

**Assign unit u to class g if** *q*
_{
g
}·$\widehat{f}$
(*X*
_{
u
}/*g*) > *q*
_{
h
}·$\widehat{f}$
(*X*
_{
u
}/*h*) **for g ≠ h**.

when $\widehat{f}$
(*X*
_{
u
}/*g*) will be different for these two methods.

In both of these kernel methods we have to estimate some of the parameters, the window function **h** for each class. To simplify the problem we assumed same window function for each class, as because in kernel method this windows are smoothed again, so collective choice is not a bad idea. We selected the smoothing parameters simultaneously by minimization of the cross-validated estimate of the error of classification of the Baye's rule by plugging in these kernel estimates of the group-conditional densities from the knowledge.

It is pleasant to note that maximum probability rule is just a basic concept about modeling, and decision making rule. But it heavily depends upon the above described estimation step which is the heart of the modeling process. With the development of this estimation methods modeling becomes much more flawless and the classification technique becomes stronger. This phenomenon is noted at the time of testing. Adaptive kernel method with its superiority in estimation of tail probability gets an edge over the other methods.