PCA analysis
PCA is an effective linear method for discriminating between the e-nose responses to simple and complex odours [8]. The method consists of expressing the response vectors in terms of a linear combination of orthogonal vectors that account for a certain amount of variance in the data. The results of the PCA are shown in Figure 3. Three principal components were kept, which accounted for 99.87% of the variance (PC 1, PC 2 and PC 3 representing respectively 98.82%, 0.94% and 0.11%). 6 categories or clusters appear to be evident. The six clusters formed match the six types of bacteria so that the bacteria were completely separated in the principal component space but classification accuracy was upto 74%. From PCA analysis four classes of bacteria namely Staphylococcus aureus (sar), Haemophilus influenzae (hai), Streptococcus pneumoniae (stp) and Moraxella catarrhalis (moc) were properly classified though other two most common classes of eye bacteria like Escherichia coli (eco), Pseudomonas aeruginosa (psa) were not properly classified (see Figure 3). Most of the variance in the data is explained by considering only the first principal component (PC1), which implies that the sensor responses are highly correlated. As PC1 accounts for most of the information in the data, this suggests that the clusters were not made any more evident using PCA. That is linear PCA analysis is not informative for this type of data. The objective of this analysis was to establish simple classes for the different bacteria species in order to examine whether or not the data clusters could be separated, prior to the conventional pattern recognition stage.
Less correlated Sensor selection
Previous tests experiences with the Cyranose 320 system suggests that some of the sensors could be omitted for data analysis. This is because sensors are highly correlated in nature. The best representation of the information in the data can be achieved only if we can represent our data by using the least correlated sensors.
Hence we calculated the correlation coefficients of the sensors by evaluating the sensor response matrix using the MATLAB 6.1 function "corrcoef " [7]; where each row is an observation (The gathered response of the sensors), and each column is a variable (sensor). Using the MATLAB 6.1 function "corrcoef" [7] on the whole data set a matrix of correlation coefficient was achieved; then least correlated sensors were selected by doing column wise summation of the correlation coefficient matrix and sorting the minimum added values. Load values of the all sensors were also considered for least correlated sensors selection. It was evident that effectively the three least correlated sensors are sensor 23, 24 and 26 from correlation coefficient matrix.
Combined SOM, FCM and 3D – Scatter plot analysis: A new approach
SOM and FCM were applied to the data set in order to investigate clustering using the responses from the 32 sensors. A SOM network is a non-linear Artificial Neural Network (ANN) paradigm, which is able to accumulate statistical information about data with no other supplementary information than that provided by the sensors [9]. Various SOM networks were created and trained with the entire data set, subsequently samples were associated with one of the neurones and neurones were grouped together to form categories corresponding to each identified bacteria.
FCM is a fuzzy data clustering and partitioning algorithm in which each data point belongs to a cluster according to its degree of membership [10]. With FCM, an initial estimate of the number of clusters is needed so that the data set is split into C fuzzy groups. A cluster centre is found for each group by minimising a dissimilarity function. Fuzzy clustering essentially deals with the task of splitting a set of patterns into a number of more-or-less homogeneous classes (clusters) with respect to a suitable similarity measure such that the patterns belonging to any one of the clusters are similar and the patterns of different clusters are as dissimilar as possible. The similarity measure used has an important effect on the clustering results since it indicates which mathematical properties of the data set should be used in order to identify the clusters. Fuzzy clustering provides partitioning results with additional information supplied by the cluster membership values indicating different degrees of belongingness [10].
An innovative data clustering approach was investigated for these bacteria data by combining the 3-dimensional scatter plot, FCM and SOM network. This is depicted in the Figure 4. In multisensor space, normalised data sets were represented using 3-D scatter plots. From the FCM approach, a cluster centre is found for each group by minimising a dissimilarity function [7]. These cluster centres were plotted in multisensor space. So combining the 3D scatters plot and FCM, cluster centres were properly located in multisensor space and also within the data. Various SOM networks were created and trained with the entire data set; a [6 × 1] and a [3 × 2] SOM network performed best from all other SOM networks. In the Figure 4 there are six neurones at the bottom which indicate the initial weights of the SOM before training. After 5000 epochs it was clear that the six nodes were approaching to the six cluster centres (estimated using FCM), which is more clearly evident from Figure 4. So using these three data clustering algorithms simultaneously better 'classification' of six eye bacteria classes were represented. A [6 × 1] SOM network gave 96% accuracy for bacteria classification which was best accuracy as far as SOM networks are concerned along with FCM and 3D-Scatter methods. The objective of this analysis was to establish simple classes for the different bacteria species in order to examine if the data clusters could be separated for the conventional pattern recognition stage.
Evaluation of neural network-classification performance
Neural Networks
The six different bacteria dataset were analyzed using three supervised ANN classifiers, namely the Multi Layer Perceptron (MLP), Probabilistic Neural network (PNN) and Radial basis function network (RBF) paradigms. Training of the neural networks was performed with 40% of the whole data set. The remaining 60% of the whole data were used for testing the neural networks. These percentages were selected arbitrarily and were applied for all data sets. The aim of this comparative study was to identify the most appropriate ANN paradigm, which can be trained with best accuracy, to predict the "type of eye infections" or in other words "type of eye bacteria".
Performance of MLP, RBF and PNN
For MLP
A MLP network (with learning rate equal to 0.2 and a momentum term equal to 0.3) with 3–32 inputs and 6 output neurons was able to reach a success rate 75% in classification.
For RBF and PNN
Neurons are added to the network until the sum-squared error (SSE) falls beneath an error goal (0.000001) or a maximum number (40) of internal neurons was reached. It is important that the spread parameter be large enough so that the radial basis neurons respond to overlapping regions of the input space, but not so large that all the neurons respond in essentially the same manner [7]. For both the networks the spread parameter was set to 1.0.
PNN was able to correctly classify 94% of the response vectors where as the RBF network's level of correct classification was up to 98%.
T-test
A t-test was performed to assess if RBF, PNN were performing significantly better than the MLP in terms of the total number of patterns correctly classified. The null hypothesis H0 demonstrated that there was no significant difference between the mean number of patterns misclassified by the RBF and PNN. The hypothesis H0 was rejected at the 4% significance level (t = 2.19 for RBF and t = 4.49 for PNN).