The optics, electronics, and signal analysis of the PVS have been reported in more detail previously [16–18]. This present work focuses on the use of artificial neural networks as an alternative to classical statistical methods. The goal of this study was, if possible, to improve the classification algorithms, as well as to validate the performance of the research instrument on the same group of young test subjects that was used in the previous study, with the addition of two more subjects. All subjects’ data were analysed with both the methods from the previous paper, and the neural networks method reported here. For more detail on the human subject data, please see the “Data” section below. The neural network performance is compared with the four statistical methods that were applied to the same dataset earlier, and the ability to separate patients with abnormalities from controls was investigated.
Artificial neural networks
Artificial neural networks have been widely used, and the related theory has matured in the last three decades [29–35]. Feedforward neural networks (FNNs) are static, i.e. networks with no feedback elements and no delays. They are widely used to solve complex problems in pattern classification, system modeling and identification, and non-linear signal processing, and in analyzing non-liner multivariate data. One of the characteristics of the FNN is its learning (or training) ability [36]. It has a learning process in both hidden and output layers. By training, the FFNs can give correct answers not only for learned examples, but also for the models similar to the learned examples, showing their strong associative ability and rational ability which are suitable for solving large, nonlinear, and complex classification and function approximation problems. The classical method for training FNNs is the backpropagation (BP) algorithm [31] which is based on the gradient descent optimization technique.
Many tools have been developed for creating and testing ANN networks. Among them, probably the most significant one is the Neural Networks Toolbox for MATLAB from MathWorks, Inc. [37] The author employed this toolbox for creating, training, and testing the network with both calibration and clinical data. Another useful tool widely used in the field is the Netlab simulation software, designed to provide the central tools necessary for the simulation of theoretically well-founded neural network algorithms for use in teaching, research, and applications development. It consists of a library of MATLAB functions and scripts based on the approach and techniques described in the book Neural Networks for Pattern Recognition by Dr. Christopher Bishop [38], Department of Computer Science and Applied Mathematics at Aston University, Birmingham, UK.
Neural network architecture
For the present application, several ANN architectures were tested. To avoid overfitting (explained later under Generalization), a relatively simple architecture was selected (Fig. 1), consisting of one hidden layer and one output layer. This two-layer network has an input p containing four inputs (p1–p4), representing the normalized RBS spectral powers, respectively P2.5/P4.5, P3.5/P4.5, P5.5/P4.5, and P6.5/P4.5, that are generated during retinal birefringence scanning. The hidden layer contains four neurons. Each neuron is connected to each of the inputs through the input weight matrix IW:
$${\mathbf{IW}} = \left[ {\begin{array}{*{20}c} {iw_{1,1} } & \cdots & {iw_{1,4} } \\ \vdots & \ddots & \vdots \\ {iw_{4,1} } & \cdots & {iw_{4,4} } \\ \end{array} } \right]$$
(1)
The i-th neuron has a summer that gathers its weighted inputs iwi,j and a bias b1,i, to form its scalar output ni as:
$${\text{n}}_{\text{i}} = {\text{ iw}}_{{{\text{i}}, 1}} {\text{p}}_{ 1} + {\text{ iw}}_{\text{i,2}} {\text{p}}_{ 2} + {\text{ iw}}_{{{\text{i}}, 3}} {\text{p}}_{ 3} + {\text{ iw}}_{{{\text{i}}, 4}} {\text{p}}_{ 4} + {\text{ b}}_{{ 1,{\text{i}}}}$$
(2)
equivalent to a dot-product (inner product):
$${\mathbf{n}} = {\mathbf{IW}}*{\mathbf{p}} + \varvec{b}_{1}$$
(3)
where b
1
is a four-element vector representing the four biases, one for each neuron.
Each ni then is processed by a sigmoid transfer function f1 to deliver a neuron output ai. The 4-element output vector of the four neurons (and the hidden layer as a whole) can be represented in matrix form as:
$${\mathbf{a}} = {\text{f}}^{1} ({\mathbf{IW}}*{\mathbf{p}} + \varvec{b}_{1} )$$
(4)
The four neuron outputs are then fed to the output layer, which has a neuronal structure as well. Its scalar output y can be represented by the equation:
$${\text{y }} = {\text{ f}}^{ 2} \left( {{\mathbf{LW}}*{\mathbf{a}} + {\text{ b}}_{ 2} } \right) \, = {\text{ f}}^{ 2} \left\{ {{\mathbf{LW}}{\text{ f}}^{ 1} \left( {{\mathbf{IW}}*{\mathbf{p}} + {\mathbf{b}}_{{\mathbf{1}}} } \right) + {\text{b}}_{ 2} } \right\}$$
(5)
where LW is the output layer weight matrix and the scalar b2 is the output neuron’s bias:
$${\mathbf{LW}} = \, \left[ {{\text{lw}}_{ 1} {\text{ lw}}_{ 2} {\text{ lw}}_{ 3} {\text{ lw}}_{ 4} } \right]$$
(6)
The weights and biases were calculated during the training of the network, as explained later. The sigmoid transfer functions for the hidden layer f1 and for the output layer f2 were chosen to be the same, namely of type Log-Sigmoid transfer function (logsig):
$$\tt{logsig}\left( \texttt{{n}} \right)= \texttt{1}/\left( {\text{\tt{1 + exp}}\left( { - {\texttt{n}}} \right)} \right)$$
(7)
The function logsig generates outputs between 0 and 1 as the neuron’s net input goes from negative to positive infinity. As mentioned above, the neural network shown in Fig. 1 is a FFN. Feedforward networks consist of a series of layers. The first layer has a connection from the network input. Each subsequent layer has a connection from the previous layer. The final layer produces the network’s output. FFNs can be used for any kind of input to output mapping. A feedforward network with one hidden layer, and enough neurons in the hidden layers, can fit any finite input–output mapping problem. It can be used as a general function approximator. It can approximate, arbitrarily well, any function with a finite number of discontinuities, given sufficient a number of neurons in the hidden layer. Specialized versions of the feedforward network include fitting and pattern recognition networks. The pattern recognition networks are the ANN of choice when solving classification problems, such as ours. In pattern recognition problems, we want a neural network to classify inputs into a set of target categories. Thus, pattern recognition networks are FFNs that can be trained to classify inputs according to already verified target classes, in our case verified central fixation versus paracentral fixation.
Creating the neural network
Following the above reasoning, our neural network was created (as a network object) using the MATLAB Toolbox’s pattern recognition network creation function patternnet:
$${\texttt{net} = \tt{patternnet}}\left( {{\texttt{hiddenLayerSize}, \tt{trainFcn}}} \right);$$
(8)
The parameter hiddenLayerSize here is 4, corresponding to the four neurons in the hidden layer. One can change the number of neurons if the network does not perform well after training, and then retrain. The parameter trainFcn defines the training function, which in our case is ‘trainscg’, standing for the scaled conjugate gradient backpropagation method for updating weight and bias values during training [39]. It performed slightly better on our data than the popular and faster Levenberg–Marquardt (LM) training algorithm [40]. Backpropagation (explained below) is used to calculate derivatives of performance perf with respect to the weight and bias variables.
Data
To define a pattern recognition problem, data is generally arranged in a set of Q input vectors (measurements) as columns in a matrix. Then another set of Q target vectors is arranged, so that they indicate the classes to which the input vectors are assigned. Classification problems involving only two classes (as in our case) can be represented by target vectors consisting of either scalar 1/0 elements, which is the format used in this study. Alternatively, the target could be represented by two-element vectors, with one element being 1 and the other element being 0. In the general case, the target data for pattern recognition networks should consist of vectors of all zero values except for a 1 in element i, where i is the class they represent.
The calibration data were comprised of the same data set that was used in an earlier study [18]. Briefly, with the pediatric vision screener [16–18] we recorded signals from five asymptomatic normal volunteers, ages 10, 18, 24, 29 and 39, two female and three male, of them three Caucasian, one African American, and one Asian, all properly consented. The subjects were asked to look first at the blinking target in the center of the scanning circle, for central fixation (CF). Twelve measurements of duration 1 s were taken in order to obtain representative data while taking into consideration also factors like fixation instability and distractibility. The calculated FFT powers for each measurement were saved on disk. The same type of measurement was repeated with each of the subjects looking at imaginary “targets” on the scanning circle (1.5° off center) at 12, 3, 6, and 9o’clock. The spacing was chosen such that there would be a sufficient distance between the targets, to avoid confusion in the test subject, and to overcome the natural instability of fixation. More fixation points or more than 12 measurements per target have proven to diminish the efficiency of data collection, because of fatigue occurring in the test subjects. The data, consisting of powers P
2.5
, P
6.5
, P
3.5
, P
5.5
and P
4.5
for each of the 12 measurements of each eye of all five test subjects, were bundled into two groups: a group for central fixation (120 “eyes,” the “CF set”) and a group for paracentral fixation (480 “eyes,” the “para-CF set”). Data from these two controlled groups were used to create and calibrate the ANN. The data were organized as an input matrix of 4 rows and Q columns, with Q = 600 (120 measurements with CF and 480 measurements with para-CF). The target vector was a vector of length Q = 600, each element of which was either 1 (CF) or 0 (para-CF). These inputs and targets were used for training, validation and testing the network. One can reasonably argue that this number of subjects (5) and eyes (10) is insufficient for providing reliable calibration with regard to the two classes (CF versus para-CF). Yet, the variability of the RBS signals’ waveforms (and respectively the five derived frequency powers) depend to a much higher extent on the subject’s direction of gaze and the ability to fixate, than on the individual variability of the foveal and corneal birefringence. This invariability, especially to corneal birefringence, was achieved with the new design, as reported in our previous work [15–17]. The birefringence of the fovea is largely constant. It is the corneal birefringence that affects the signals. The cornea, acting as a retarder of a certain retardance and azimuth, influences the orientation of the polarization cross. In the design of the PVS [18], the corneal birefringence was compensated for by means of a wave plate (retarder), achieving broad uniformity across the population studied. The wave plate was optimized by means of a dynamic computer model of the retinal birefringence scanning system (including the retina and the cornea as part of a train of optical retarders) and based on the data from a database of 300 eyes [15].
The clinical data were also obtained with the pediatric vision screener (following an institutionally approved IRB protocol), and were almost identical with the data set used in [18], with the addition of just two more subjects, both of whom were independently verified by a pediatric ophthalmologist. Thus, the total was 39 test subjects: 19 properly consented patients with known abnormalities (9 male and 10 female, of which 12 Caucasian, 2 African American, and 5 Asian), ages 4–25 (only one above 18), and 20 control subjects with proven lack of vision issues (10 male and 10 female, of which 16 Caucasian, 1 African American, and 3 Asian), ages 2–37 (only 4 above 18), all properly consented. All were recruited from the patients of the Division of Pediatric Ophthalmology at the Wilmer Eye Institute, or the patients’ siblings. All subjects underwent a vision exam by an ophthalmologist, during which eye alignment and refraction were tested. Eye alignment was tested by means of the cover test. First the unilateral cover test was performed. During the unilateral cover test, the patient is asked to focus on a distant object while the doctor covers each of the eyes in turn. If either of the uncovered eyes has to move to focus on the object, this may be evidence of strabismus. The second part of the exam is the alternating cover test. The patient is asked to focus on an object while the eye cover is switched from one eye to the other. If the doctor detects eye movement after the eye cover is removed, this is an indication of phoria (tendency to deviation of the eyes from the normal when fusional stimuli are absent). A significant amount of phoria can lead to eyestrain and/or double vision. On the whole, verified information was available from a total of 78 eyes. Data were organized as an input matrix of 4 rows and Q columns (Q = 78 eyes). The target vector was a vector of length Q (Q = 78), each element of which was either 1 (CF) or 0 (para-CF). These inputs and targets, as well as the network outputs, were used for testing the performance of the ANN and comparing it to the statistical methods reported earlier [18].
Preprocessing and postprocessing
Neural network training can be made more efficient if one performs certain preprocessing steps on the network inputs and targets [35, 37]. The sigmoid transfer functions that are generally used in the hidden layers become essentially saturated when the net input is greater than three. If this happens at the beginning of the training process, the gradients will be very small, and the network training will be very slow. It is standard practice to normalize the inputs before applying them to the network. Generally, the normalization step is applied to both the input vectors and the target vectors in the data set. The input processing functions used here are removeconstantrows (removes the rows of the input vector that correspond to input elements that always have the same value, because these input elements are not providing any useful information to the network), and mapminmax (normalize inputs/targets to fall in the range [−1,1]). For outputs, the same processing functions (removeconstantrows and mapminmax) are used. Output processing functions are used to transform user-provided target vectors for network use. Then, network outputs are reverse-processed using the same functions to produce output data with the same characteristics as the original user-provided targets.
Dividing the data
When training multilayer networks, the general practice is to first divide the data into three subsets. The first subset is the training set, which is used for computing the gradient and updating the network weights and biases. This set is presented to the network during training, and the network is adjusted according to its error. The second subset is the validation set. It is used to measure network generalization, and to halt training when generalization stops improving. The error on the validation set is monitored during the training process. The validation error normally decreases during the initial phase of training, as does the training set error. However, when the network begins to overfit the data, the error on the validation set typically begins to rise. The network weights and biases are saved at the minimum of the validation set error. The test set has no effect on training and so provides an independent measure of network performance during and after training. Test set error is not used during training, but it is used to compare different models. It is also useful to plot the test set error during the training process. If the error on the test set reaches a minimum at a significantly different iteration number than the validation set error, this might indicate a poor division of the data set.
In the Neural Networks Toolbox for MATLAB [37], there are four functions provided for dividing data into training, validation, and test sets. They are dividerand (divide data randomly, the default), divideblock (divide into contiguous blocks), divideint (use interleaved selection), and divideind (divide by index). The data division is normally performed automatically when the network is trained. In this study, the dividerand function was used, with 70% of the data randomly assigned to training, 15% of the data randomly assigned to validation, and 15% of the data randomly assigned to the test set. This is the default partitioning in Neural Networks Toolbox. The appropriateness of this division is discussed in the “Discussion and limitations” below.
Initializing weights (init)
Before training a feedforward network, one must initialize the weights and biases. The configure command automatically initializes the weights, but one might want to reinitialize them. This is done with the init command. This function takes a network object as input and returns a network object with all weights and biases initialized. Here is how a network is initialized (or reinitialized):
$$\text{\tt{net} = \tt{init}}\left( {\text{\tt{net}}} \right)\text{;}$$
(9)
Performance function
Once the network weights and biases are initialized, the network is ready for training. The training process requires a set of examples of proper network behavior—network inputs p and target outputs t. The process of training a neural network involves tuning the values of the weights and biases of the network to optimize network performance, as defined by the network performance function. The default performance function for feedforward networks is mean square error mse-the average squared error between the network outputs y and the target outputs t [37]. It is defined as follows:
$$F = mse = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {e_{i} } \right)^{2} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {t_{i} + y_{i} } \right)^{2}$$
(10)
For a neural network classifier, during training one can use mean squared error or cross-entropy error, with cross-entropy error being considered slightly better [41]. We tested both methods on a subset of the data, and obtained slightly better results with the cross-entropy method. Which is why the network performance evaluation in this study was done by means of the cross-entropy method:
$${\texttt{net.performFcn}} = \lq\texttt{crossentropy}\hbox{'};$$
(11)
The MATLAB performance function has the following format:
$${\tt{perf}} = \tt{crossentropy}\left( \tt{net},\tt{targets},\tt{outputs},\tt{perfWeights} \right)$$
(12)
It calculates a network performance given targets (t) and outputs (y), with optional performance weights and other parameters. The function returns a result that heavily penalizes outputs that are extremely inaccurate (y near 1−t), with very little penalty for fairly correct classifications (y near t). Minimizing cross-entropy leads to good classifiers. The cross-entropy for each pair of output-target elements is calculated as:
$$\tt{ce} = {-}\tt{t}\; \tt{.^*} \;\tt{log} ( {\tt{y}} )$$
(13)
where .* denotes element-by-element multiplication. The aggregate cross-entropy performance is the mean of the individual values:
$${\tt{perf}} = \tt{sum}( {\tt{ce}} ( {\text{:}} ) )\tt{/numel} ( {\tt{ce}} )$$
(14)
In the special case of N = 1 (our case) when the output consists of only one element (y), the outputs and targets are interpreted as binary encoding. That is, there are two classes with targets of 0 and 1. The binary cross-entropy expression is:
$$\tt{ce} = {-}\tt{t}\; {\tt{.^*\;log}}\left( {\tt{y}} \right) - \left( {\tt{1 - t}} \right)\text{ }{\tt{.^* log}}\left( {\tt{1 - y}} \right)$$
(15)
where .* denotes element-by-element multiplication.
Training the network
For training multilayer feedforward networks, any standard numerical optimization algorithm can be used to optimize the performance function, but there are a few key ones that have shown excellent performance for neural network training. These optimization methods use either the gradient of the network performance with respect to the network weights, or the Jacobian of the network errors with respect to the weights. The gradient and the Jacobian are calculated using a technique called the backpropagation algorithm, which involves performing computations backward through the network. The backpropagation computation is derived using the chain rule of calculus and is described in more detail in [33] and in [31]. As a note on terminology, the term “backpropagation” is sometimes used to refer specifically to the gradient descent algorithm, when applied to neural network training. That terminology is not used here, since the process of computing the gradient and Jacobian by performing calculations backward through the network is applied in all of the training functions offered by MATLAB’s Neural Networks Toolbox. It is clearer to use the name of the specific optimization algorithm that is being used (i.e. ‘trainscg’, ‘trainlm’, ‘trainbr’, etc.), rather than to use the term backpropagation alone.
Neural networks can be classified into static and dynamic categories. Static networks (which are essentially the FFNs) have no feedback elements and contain no delays; the output is calculated directly from the input through feedforward connections. In dynamic networks, the output depends not only on the current input to the network, but also on the current or previous inputs, outputs, or states of the network. These dynamic networks may be recurrent networks with feedback connections or feedforward networks with imbedded tapped delay lines (or a hybrid of the two) [34]. For static networks, the backpropagation algorithm is usually used to compute the gradient of the error function with respect to the network weights, which is needed for gradient-based training algorithms [42].
The actual training was completed using the function from MATLAB’s Neural Networks Toolbox:
$$\left[ {\text{\tt{net,tr}}} \right]\text{ \tt{= train}}\left( {\text{\tt{net,x,t}}} \right)$$
(16)
with x being the input matrix (600 column vectors of size x), and t being the target vector of size 600 (total number of observations in the calibration set).
Network performance was calculated using the perform function
$$\text{\tt{performance = perform}}\left( {\text{\tt{net,t,y}}} \right)$$
(17)
which takes the network object, the targets t and the outputs y and returns performance using the network’s performance function net.performFcn (crossentropy in our case). Note that training automatically stops when generalization stops improving, as indicated by an increase in the cross-entropy error of the validation samples.
Generalization
Neural networks are sensitive to the number of neurons in their hidden layers. Too few neurons can lead to underfitting. Too many neurons can contribute to overfitting, in which all training points are well fitted, but the fitting curve oscillates significantly between these points, and so do the calculated coefficients. In ANN terms, the model does not generalize well. It is apparent from testing with an increasing complexity that as the number of connections in the network increases, so does the propensity to overfit to the data. The phenomenon of overfitting can always be seen as we make our neural networks deep (complex).
In this study, the number of neurons in the hidden layer was chosen empirically. On the clinical data, less than four input neurons did not provide the accuracy achieved with 4-8 neurons in the hidden layer, most likely because of underfitting. At about 10 neurons and upwards, the accuracy started to decrease again, because of overfitting. The choice of 4 hidden neurons was made for two reasons: (a) keep the network generalized (i.e. to avoid overfitting), and (b) to keep it simple and computationally fast. With respect to the number of hidden layers, no significant improvement was achieved with a two-hidden-layer structure, regardless of the number of neurons in each layer.
MathWorks suggests several ways to improve network generalization and avoid overfitting [37, 43]. One method for improving network generalization is to use a network that is just large enough to provide an adequate fit. The larger network we use, the more complex the functions the network can create. If a small enough network is used, it will not have enough power to overfit the data. One can check the Neural Network Design example nnd11gn in [33] to investigate how reducing the size of a network can prevent overfitting. Another approach is retraining. Typically each backpropagation training session starts with different initial weights and biases, and different divisions of data into training, validation, and test sets. These different conditions can lead to quite different solutions for the same problem. Therefore, it is a good idea to train several networks, in order to ensure that a network with good generalization is found.
The default method for improving generalization is the so-called early stopping. This technique is automatically provided for all of the supervised network creation functions in the Neural Networks toolbox, including the backpropagation network creation functions such as feedforwardnet and patternnet. As explained before, in this technique the available data are divided into three subsets. The first subset is the training set, which is used for computing the gradient and updating the network weights and biases. The second subset is the validation set. The error on the validation set is monitored during the training process. The validation error normally decreases during the initial phase of training, as does the training set error. However, when the network begins to overfit the data, the error on the validation set typically begins to rise. When the validation error increases for a specified number of iterations (net.trainParam.max_fail), the training is stopped, and the weights and biases at the minimum of the validation error are returned. The test set error is not used during training, but it is used to compare different models. It is also useful to plot the test set error during the training process. If the error in the test set reaches a minimum at a significantly different iteration number than the validation set error, this might indicate a poor division of the data set [43].
There is yet another method for improving generalization, called regularization. It involves modifying the performance function, which is normally chosen (mse, cross-entropy, or other). Using a modified performance function causes the network to have smaller weights and biases, forcing the network response to be smoother and less likely to overfit [43]. Regularization can be done automatically by using the Bayesian regularization training function trainbr. This can be done by setting net.trainFcn to ‘trainbr’. This will also automatically move any data in the validation set to the training set [37].