Subjects
In this study, all the collected 2-D breast ultrasound (BUS) images were from female patients and contained a breast tumor. For each volunteer participant, only one case corresponding to the maximum cut surface of breast tumor was used to generate the datasets. This study included 531 cases of Category 3, 443 cases of Category 4A, 376 cases of Category 4B, 565 cases of Category 4C, and 323 cases of Category 5. Human subject ethical approval was obtained from a relevant committee at West China Hospital of Sichuan University before collecting ultrasound images. Each subject provided written consent prior to the research. Philips IU22 ultrasound scanner (Philips Medical System, Bothell, WA) with a 5- to 12-MHz linear probe was utilized while collecting the data.
Method overview
The CNN architecture is an extensively utilized deep learning technique for analyzing medical images [18, 30]. Typically, a CNN is constructed with several convolution layers [31, 32], maxpooling layers [33], and fully connected layers [34]. And the extensively utilized activation methods in the CNN include the rectified linear unit (ReLU), sigmoid, and tanh [35].
Based on CNN architecture, the schematic illustration of our breast tumor categorization system is exhibited in Fig. 1. First, all input images were scaled into a uniform size of 288 × 288. Second, the ROI-CNN was designed to automatically identify the rough localization of the breast tumor Since the predicted ROIs of the ROI-CNN may mix other non-tumor regions and loss several important texture or boundary information, the following refinement procedure, including area filtering and the perfect Chan–Vese (C–V) level-sets methodology [36], was introduced to enable the identified ROI to be better tailored to the real boundary of the breast tumor. Finally, the G-CNN model was applied to analyze the refined ROIs by rating them with a score of five, as five categories of breast tumors were involved in the classification.
CNN-based localization and grading models
Inherent speckle noise and low image contrast of the US images may bring unnecessary distraction while extracting features, thus making the automatic classification of the breast US images difficult. To extract effective features for the classification, the tumor identification network (ROI-CNN) and refinement procedure were first exposed on the whole BUS image to determine the effective ROI. Then, the following tumor grading network (G-CNN) can focus on extracting the discriminative features for classifying tumors.
The identification model—ROI-CNN
To effectively reduce the influence of other tissues, like Cooper’s ligaments, identifying the tumor from the corresponding whole BUS image is the first and most important procedure for implementing the automatic grading system. Our ROI identification network (ROI-CNN) was developed based on the fully convolutional networks (FCN) [37].
Considering that the tumor size varies among different patients, the designed identification network requires to be robust and effective on the tumor with different size. To increase the feasibility of ROI-CNN on variable target size, our designed ROI-CNN introduced a multi-scale architecture based on the typical FCN-16s network. Firstly, a typical VGG network (refer to the blue dashed box in Fig. 2) was utilized to extract features. After four times down sampling, the output size of feature map from the VGG is 18 × 18, to further extract high level features, we need to compress the size of the feature maps.
However, too small size of the feature maps cannot well reflect the detailed boundary information of the breast tumor. In this study, a atrous convolution layer (refer to the yellow block in Fig. 2) was then incorporated into our ROI-CNN, which can effectively enlarge the receptive field of filters and capture a larger context without increasing the amount of parameters or the cost of computation [38,39,40]. In the atrous convolution layer, the kernel size was set to be 3 × 3 and the dilation rate was set to be 2. Besides, concatenating feature maps from different depths was performed (refer to Fig. 2) to ensure that features with two different receptive fields can be merged together. Following the atrous convolution layer, a convolution layer was additionally used as a transitional layer between the atrous convolution layer and the top convolution layer to provide balanced number of features from the deep layer and shallow layer for the concatenation operation. In the transitional convolution layer, the kernel size was set to 1 × 1 and the number of filters was set to 512. For the output of the ROI-CNN, the predicted identification possibility in the breast tumor region should be higher than the non-tumor region.
The grading model—G-CNN
Effective classifier can enhance the distinguishing ability of tumor features from different categories, thus promoting the accurate classification. Clinically, apart from the inner texture of breast tumors, the texture and the boundary information are also significant for classifying the breast tumors into different grades [10, 14]. Therefore, the grading model needs to take the texture and the boundary features into consideration to enhance the expression of the grading features.
Usually, the texture and the boundary information are well represented in the low-level convolution layers, and the essential features can be well extracted with more convolution layers. In our proposed G-CNN, feature maps from different depths were concatenated together to make full use of the low-level and high-level information. Referring to Fig. 3, the G-CNN model was consisted of 9 blocks. The first four blocks (from Block 1 to Block 4) formed the encode path. Each block in the encode path shared the same structure, which contained two convolution layers and one max pooling layer. In the encode path, the number of feature channels of the convolution layer was doubled when followed by the max pooling layer. Following the encode path is the Block 5 contained of three convolution layers. The concatenate path was then followed with the four same blocks (from Block 6 to Block 9). Each block in the concatenate path was consisted of a concatenation layer and a convolution layer. In our G-CNN model, the feature maps from the lower layer was additionally concatenated with the feature maps from the deeper layer. Each block in the encode path was exploited to provide low-level features for the corresponding block in the concatenate path. For example, Block 3 and Block 7 were concatenated together. Note that, to ensure the size of the two inputs imported into the concatenation layer was consistent, the first three blocks in the encode path were followed by a convolution layer and a max pooling layer.
Totally, the G-CNN network contained 18 convolution layers. The batch normalization strategy [41] was encapsulated at the top convolution layer in each block, and the first two FC layers, to regularize the model. A L2 regularization operation was performed to reduce overfitting, which can enable better test performance via better generalization. The kernel size of all convolution layers was 3 × 3, and each layer was followed by ReLU [34]. All the max pooling layers was set to be 2 × 2 with a stride 2. At the end of the G-CNN were three fully connected (FC) layers that consisted of 4096 neurons, 1024 neurons and 5 neurons. A softmax layer followed the topmost FC layer with five neurons to conduct the grading output.
The refinement
Affected by the ambulant speckle noise and other tissues in the BUS image, the prediction of the ROI-CNN may involve non-tumor region besides the tumor region. Moreover, the contour of the predicted region may have a bias from that of the real tumor contour. Therefore, additional refinement is imperative to ensure the effectiveness of the predicted ROI.
To ensure that only the lesion was export to subsequent grading system and improve the accuracy of the final categorization, the rough ROI from the ROI-CNN, which enclosed the breast tumor region, was then further refined by the following steps: (1) remove the connected domain with a smaller area (smaller than 40% of the max area) and choose the connected region closest to the image center; and (2) refine the boundary with a typical C–V level-sets methodology [36].
$$E\text{(}C\text{)} = \mu_{1} \int\limits_{inside(C)} {\left| {I(x,y) - c_{1} } \right|}^{2} dxdy + \mu_{2} \int\limits_{outside(C)} {\left| {I(x,y) - c_{2} } \right|}^{2} dxdy + \alpha \kappa$$
(1)
where \(I\) is the image, \(C\) refers to the boundary of the segmented region, \(c_{\text{1}}\) and \(c_{\text{2}}\) are the respective averages of \(I\) inside and outside \(C\), and \(\kappa\) is the curvature of \(C\).
Implementation details
-
a.
Loss function
In the ROI-CNN, the Dice loss function can be expressed as follows,
$$L_{ROI} = 1 - \frac{{2\left| {A_{PRED} } \right| \cdot \left| {A_{GT} } \right|}}{{\left| {A_{PRED} } \right| + \left| {A_{GT} } \right|}}$$
(2)
where PRED denotes the predicted ROI, and GT corresponds to the ground truth. APRED and AGT refer to the predicted tumor area and the ground truth tumor area, respectively.
In the G-CNN, multi-class cross entropy [42] was employed as the loss function,
$$J(\theta ) = - \frac{1}{m}\left[ {\sum\limits_{i = 1}^{m} {y^{(i)} logf_{\vartheta } (I^{(i)} ) + (1 - y^{{^{(i)} }} )log(1 - f_{\vartheta } (I^{(i)} ))} } \right]$$
(3)
where m denotes the number of classes and y is the class label of each input. Both variables range from one to five. ϑ represents the parameters of the G-CNN, and fϑ corresponds to the mapping relationship from the input image I to the predicted output fϑ(I).
In the G-CNN, each input can generate an output vector with size 1 × m, where the category with the highest possibility was taken as the predicted result.
-
b.
Train process
Our proposed framework was implemented on Tensorflow and all experiments were conducted on a workstation equipped with a 2.40 GHz Intel Xeon E5-2630 CPU and an NVIDIA GF100GL Quadro 4000 GPU.
During the training phase of the ROI-CNN, the layers in the blue dotted box (refer to Fig. 2) were initialized with a VGG model [43] based on a pre-trained image classification dataset provided by ImageNet Large-Scale Visual Recognition Challenge in 2012 (ILSVRC-2012 CLS). The other layers in the ROI-CNN were initialized with a Gaussian randomizer. The minibatch size involved 16 images, and the optimizer SGD [44, 45] was set with a learning rate of 0.0001 and a momentum of 0.9 until convergence was attained.
In the training phase of the G-CNN, Random initialization was employed to yield better performance and faster convergence. 16 images were set as the minibatch size, and the SGD optimizer was set with learning rates of 0.001 which would be gradually decreased by a factor of 0.9 until convergence was attained.
Performance evaluation
To validate the effectiveness of the grading scheme for breast tumors from US images, the localization and grading results were evaluated by comparing the corresponding manual annotations and labeling from the three physicians. The experiments implemented two aspects to assess our grading system. One was the effect of different options in tumor identification stage on the final grading results, and the other was the discriminative capability for different breast tumor categories.
The accuracy of the identified tumor
Three metrics were utilized to quantitatively evaluate the similarity between the predicted contour and the ground truth contour, including the Dice similarity coefficient (DSC) [46, 47], Hausdorff distance between two boundaries (HDist) [47, 48], and average distance between two boundaries (AvgDist) [47]. DSC was employed to examine the overlapping areas between the two comparisons. HDist and AvgDist were exploited to measure the Euclidean distance between a computer-identified tumor boundary and the boundary determined by physicians. Higher DSC, lower HDist, and lower AvgDist corresponded to more similarity between the two boundaries. Furthermore, AUC values and ROC curves were exploited to evaluate the performance of different experiments with a variable scope of ROIs.
Experiment configurations
Image data involved in each stage of the categorization system
The input size of our two-stage grading system was set to 288 × 288. And fivefold cross-validation was employed to construct the training and testing datasets.
-
Image annotation
Each involved image was scored by three physicians with more than 3 years of experience performing BUS examinations based on the BI-RADS criteria. If the physicians differed in their annotations of the category, they discussed and then made consensus on the final category of the breast tumor.
-
Data preprocessing and augmentation
Due to the sample size of volunteer patients is limited, effective data preprocessing and augmentation is imperative for medical image datasets. The premise of augmentation is that the ROI must be incorporated into all augmented data regardless of the type of transformation exposure on the dataset.
-
1.
Data augmentation in the ROI-CNN model
In the ROI-CNN training stage, the augmentation times of each input were set to the same with the number of training epochs. This type of augmentation can enhance the randomization of input data and reduce the possibility of overfitting of the trained model, thus improving the robustness of the ROI-CNN model. Each input image was followed by the subsequent procedures in each calculated epoch, including random brightness, random contrast, random movement, random flip, and standardization. Each input can export N times outputs while experiencing N epochs. Conversely, in the testing phase, only standardization was exposed using input samples.
-
2.
Data augmentation in the G-CNN model
In the G-CNN training stage, to maintain the shape textures of breast tumors for the final classification, only geometric translation and flipping were involved. The original datasets were augmented four times with random movement, in which two augmentations were followed by horizontal flipping.
Effect of identification accuracy on the final grading
The coverage of localization, which denotes the area of the ROI, theoretically affects the feature mapping and may influence the final grading. To investigate the effects of the accuracy of the identified breast tumor on the final categorization from the BUS images, three types of import into the G-CNN with the corresponding experiments were involved and denoted as “No ROI-CNN”, “No Refined ROI-CNN”, and “Refined ROI-CNN”. “No ROI-CNN” corresponded to the experiment in which the input to the G-CNN directly applied the C–V level-sets method to input US images and lacked the prediction on the rough localization by the ROI-CNN. In the “No Refined ROI-CNN” experiment, the output of the ROI-CNN was not refined and was directly exported to the G-CNN. In the “Refined ROI-CNN” experiment, the original US images underwent complete processing procedures in our designed scheme.
The parameters of our designed method, experiment “Refined ROI-CNN”, were set as follows; (1) μ1, μ2, and α in equation (4) were all set to 1; (2) the maximum number of contour evolution iterations was set to 50. The parameters μ1, μ2, and α in the C–V level-sets experiment was the same as those in our refined ROI experiment. But the maximum number of contour evolution iterations in the “No ROI-CNN” experiment was set to 1000.
One-stage vs. two-stage categorization of BUS images
Making full use of the effective features is likely to achieve better categorization of breast tumor. To investigate the superiority of our two-stage system on grading BUS images, the accuracy of the predicted categorization of tumor was employed as an indicator, we compared the categorization of two-stage grading system with that of the one-stage grading architecture, which directly classified input breast US images into six classes, including the background and five breast tumor categories.
There are two types of the two-stage methods, one is with the refinement procedure, and the other is without the refinement procedure. In each type of the two-stage methods, we compared our G-CNN model with the other two typical classification network, one is the VGG network [43], and the other is the ResNet50 network [49]. Totally, there are six experiments in the two-stage methods. For the one-stage classification methods, three experiments are included: (1) experiment “One-stage G-CNN”, which directly classified the input into 5 categories with the our proposed G-CNN architecture (refer to Fig. 3); (2) experiment “One-stage VGG”, which directly classified the input BUS image into 5 categories with the VGG architecture; (3) experiment “One-stage ResNet”, which directly classified the input BUS image into 5 categories with the ResNet50 architecture.