As stated above, the aim of this work is to assess the robustness of our system, focusing here our attention in the classifying of regions of interest. Our system consist in a two‐class pattern recognition problem (mass or normal tissue) based on feature extraction and classification of regions of interest (previously extracted from the DDSM database, in our experiments). As explained above, a conversion of the image to optical density is made first, as a preprocessing stage. Later, the regions of interest are resized to the appropriate size for the feature extractor, which is based in blind feature extraction using independent component analysis. Afterwards, the classification stage is carried out using neural network classifiers or support vector machine classifiers.
In this section, we shall provide a brief description of the mammogram database utilized, of the procedure implemented to build a set of mass and normal tissue prototypes (regions of interest, ROIs), and of the main characteristics of the selected image feature extractor. Finally, we shall briefly describe the classifiers used.
Data and prototype creation
The DDSM [2] is a resource made available to the mammographic image analysis research community. It contains a total of 2620 cases, each of which provides four screening views – mediolateral oblique (MLO) and craniocaudal (CC) projections of the left and right breasts. The database therefore has a total of 10 480 images. The cases are categorized into four major groups: normal, cancer, benign, and benign without callback. The experienced radiologists that participated in the elaboration of the DDSM provided BIRADS parameters (density, assessment and subtlety), the BIRADS abnormality description, and the proven pathology, for all the cases in the database. For each abnormality identified, the radiologists drew free‐form digital curves defining ground truth regions. We use these regions to define square regions‐of‐interest (ROIs) for use as mass prototypes. Each DDSM case includes additional information – patient age, date of study, and the make or brand name of the digitizer.
The DDSM database contains 2582 mass prototypes including both benign and malignant masses. Some are located on the border of the mammogram, and could not be used (see the following paragraph, dedicated to ROIs). Consequently, only 2324 prototypes were considered, namely, those which could be taken centred in a square without stretching. Some mass prototype examples are shown in Figure 1.
Regions of Interest
Ground truth regions are defined in the database by a chain code which generates a free‐hand closed curve. We use the chain code to determine the smallest square region of the mammogram that includes the manually defined region. Therefore, if the mass is located near one edge of the mammogram, this procedure may not be able to obtain a square region from the image, and the mass is then discarded from further consideration as a valid prototype. Figure 2 shows the ground truth regions coded by radiologists (solid red curve) in three different examples, and the area to be used as ROI (purple square box), in each case.
The DDSM mammograms were digitized with four different scanners of known optical density calibration and spatial resolution [2]. Three of the scanners provide a linear optical response, and the fourth a logarithmic response. To eliminate the dependence of the digitized mammograms on the origin, all the ROIs obtained were converted to optical densities using the referenced calibration parameters.
The regions generated are of different sizes. Therefore, since the chosen image feature extractor needs to operate on regions of the same size, the selected regions had to be reduced to a common size. Such reduction of the ROIs to a common size has been demonstrated to preserve the mass malignancy information [8]‐[10]. To determine the optimum region size, we re‐sized each ROI to two sizes: 32×32 and 64×64 pixels. We also tried other sizes such as 128 × 128 pixels, but the performance obtained with this size was not better than that obtained with the two smaller sizes, whereas the computation time was much greater. The re‐sizing was carried out using the bilinear interpolation algorithm provided by the OpenCV library [11].
Independent Component Analysis
Independent Component Analysis could be considered to be the next step beyond Principal Component Analysis (PCA) [12]. Its original motivation was to solve problems known as blind source separation (BSS). In particular, suppose that one has n signals. The objective is to expand the signals registered by the sensors (x_{
i
}) as a linear combination of n sources (s_{
j
}), in principle unknown [13].
{\mathbf{x}}_{i}=\sum _{j=1}^{n}{a}_{\mathit{\text{ij}}}{\mathbf{s}}_{j}
(1)
The goal of ICA is to estimate the mixture matrix A = (a_{
i
j
}), together with the sources s_{
j
}. The ICA model assumes that the observed signals are a linear transformation of some hidden sources: x = A · s. In general, the mixture matrix A is invertible, so that one has:
\mathbf{x}=\mathbf{A}\xb7\mathbf{s}\Rightarrow \mathbf{s}=\mathbf{W}\xb7\mathbf{x}\phantom{\rule{1em}{0ex}}\text{with}\phantom{\rule{1em}{0ex}}\mathbf{W}={\mathbf{A}}^{1}
(2)
It is important to remark that:

The key assumption of ICA estimation is that the hidden sources (s) are non‐Gaussian and statistically independent.

One cannot determine the variances (energies) of the independent components. Therefore, the magnitudes of the s_{
i
} can be freely normalized. And neither can one determine the order of the independent components.
This technique can be used for feature extraction since the components of x can be regarded as the characteristics representing the objects (patterns) [13]. We use the FastICA algorithm [14], proceeding as follows:

We start with N samples (N patches (vectors) of dimension p) forming the N × p patches matrix.

First, we centre the data by subtracting their means, i.e., from each element is subtracted the mean of its column.

To reduce the size of the input space we apply PCA, ordering the array of eigenvectors by its eigenvalues from highest to lowest and discarding those with lower eigenvalues, which will be those making a smaller contribution to the variance. Taking q (q < p) first components, we obtain the matrix K_{
PCA
} of dimension (q × p).

Now, taking as input this matrix and applying the ICA algorithm, we obtain the ICA transformation matrix, W, of dimension (q × q).

Finally, considering a new matrix (W_{
T
}), a product of the previous two ({\mathbf{W}}_{\mathbf{T}}={\mathbf{K}}_{\mathbf{\text{PCA}}}^{\mathbf{T}}\mathbf{\xb7}\mathbf{W} (p × q)), in which each row is a vector of the new base, we can extract q characteristics of each original input simply by multiplying the matrix W_{
T
} by each of these inputs:
\mathbf{s}=\mathbf{i}\mathbf{\xb7}{\mathbf{W}}_{\mathbf{T}}
(3)
As is also the case with many other transformations (wavelets, Gabor filters,... [15]), following this process we can express the image (or image patch) as a linear superposition of some basis functions (basis images in our case) a_{
i
}(x, y):
I(x,y)=\sum _{j=1}^{q}{a}_{j}(x,y){\mathbf{s}}_{j}
(4)
where the s_{
j
} are image‐dependent coefficients. This expression is similar to the ICA model, and the idea is illustrated with specific images in Figure 3. In particular, the figure shows how an image can be decomposed using the inverse of an ICA base (image basis) and the eigenimage coefficients obtained by applying that base. In this way, by estimating an image basis using ICA, one can obtain a base adapted to the data at hand.
Classification algorithm
In our system, the classification algorithm has the task of learning from data. An excessively complex model will usually lead to poorly generalizable results. It is advisable to use at least two independent sets of patterns in the learning process: one for training and another for testing. In the present work, we use three independent sets of patterns: one for training, one to avoid overtraining (validation set), and another for testing [16]. For the classification, we use NN and SVM classifiers [17].
Neural Networks
We implement the classical feed‐forward multilayer perceptron (BP) with a single hidden layer, and a variant of the Back‐Propagation algorithm termed Resilient Back‐Propagation (Rprop) [18] to adjust the weights. This last is a local adaptive learning scheme performing supervised batch‐learning in a multilayer perceptron which converges faster than the standard BP algorithm. The basic principle of Rprop is to eliminate the negative effect of the size of the partial derivative on the update process. As a consequence, only the sign of the derivative is considered in indicating the direction of the weight update [18]. The function library of the Stuttgart Neural Network Simulator environment [19] is used to generate and train the NN classifiers. To avoid local minima during the training process, each setting was repeated four times, changing the initial weights in the net at random. Furthermore, the number of neurons in the hidden layer was allowed to vary between 50 and 650 in steps of 50.
Support Vector Machines
The goal of using an SVM is to find a model (based on the training prototypes) which is able to predict the class membership of the test subset’s prototypes based on the value of their characteristics. Given a labeled training set of the form (x_{
i
}, y_{
i
}), i = 1, …, l where {\mathbf{x}}_{i}\in {\mathfrak{R}}^{n} and y ∈ {1, 1}^{l}, the SVM algorithm involves solving the following optimization problem:
\begin{array}{lc}\underset{\mathbf{w}\in {\mathfrak{R}}^{d},b,{\xi}_{i}\in {\mathfrak{R}}^{+}}{\text{min}}\hfill & {\u2225\mathbf{w}\u2225}^{2}+C\sum _{i=1}^{l}{\xi}_{i}\hfill \\ \hfill \text{subject to}\hfill & {y}_{i}\left({\mathbf{w}}^{T}\varphi \left({\mathbf{x}}_{i}\right)+b\right)\ge 1{\xi}_{i},\hfill \\ {\xi}_{i}\ge 0\hfill \end{array}
(5)
In this algorithm, the training vectors x_{
i
} are projected onto a higher‐dimensional space than the original. The final dimension of this space depends on the complexity of the input space. Then the SVM finds a linear separation in terms of a hyperplane with a maximal (and hence optimal) margin of separation between classes in this higher dimensional space. In the model, C (C > 0) is a regularization or penalty parameter to control the error, d is the final dimension of the projection space, W is the normal to the hyperplane (also known as the weights vector), and b is the bias. The parameter ξ is introduced to allows the algorithm a degree of flexibility in fitting the data, and K(x_{
i
}, x_{
j
}) ≡ ϕ(x_{
i
})^{T}ϕ(x_{
j
}) is a kernel function to project the input data onto to a higher dimensional space. We used the LibSVM [20] library with a radial basis function (RBF: K(x_{
i
}, x_{
j
}) = exp(γ ∥x_{
i
}  x_{
j
}∥^{2}), γ > 0) as kernel function. To find the optimal configuration of the parameters in the algorithm, γ, was allowed to vary between 5 and 20 in steps of 0.5, and the penalty parameter C between ‐5 and 10 also in steps of 0.5.
Outline of the process
In this section, we provide an overview of the structure of our system, describing the main steps required to configure the system to discriminate prototypes of masses from prototypes of normal tissue.
System description
The aim of this study is to evaluate the performance that our system can provide for detection of masses based on blind feature extraction using ICA and, using as classifiers, neural networks and SVM.
The main scheme that summarizes in a more graphical form all phases of this work is represented in Figure 4. In the first stage, the prototypes of masses are obtained as was explained in Section “Regions of Interest”, and those of normal tissue were selected at random from the normal mammograms. These normal tissue prototypes were initially captured with sizes ranging randomly from the smallest to the largest sizes found in the DDSM for masses. Then the FastICA algorithm [14, 21] is applied as described in Section “Independent Component Analysis” to obtain the ICA base (the ICA‐based feature extractor), with the log cosh function being used to approximate the neg‐entropy. These basis are generated with different configurations, different numbers of components, and using prototypes of different sizes. The second stage uses this generated basis to obtain the training sets and to train and test the classifiers. Finally, in the third stage, the test subset, which contains input vectors not used in the optimization of the classifiers, is used to provide performance results of our system.
System optimization
To determine the optimal configuration of the system, various ICA bases were generated to extract different numbers of features (from 10 to 65 in steps of 5) from the original patches, and operating on patches of the different sizes noted above (32 × 32 and 64 × 64 pixels).
The training process was performed two times – first training the NN classifiers, and then the SVM classifiers. The results thus obtained on the test subsets in a 10‐fold cross validation scheme are shown in Figure 5. This allowed us to find the optimal configuration of the feature extractor.
The study was done with a total of 5052 prototypes: 1197 of malignant masses, 1133 of benign masses, and 2722 of normal tissue.
We found that the optimal ICA‐based feature extractor configuration for an NN classifier was a feature extractor that operated on prototypes of 64 × 64 pixels, extracting 10 components (average success rate 86.33%), and for an SVM classifier was a feature extractor that also operated on prototypes of 64 × 64 pixels, extracting 15 components (average success rate 88.41%). The results to be presented in the following section were obtained using these optimal configurations.