Ultrasound data acquisitions
In ultrasound imaging, the degree of lung involvement is related to several typical sonograms. The A-line is a horizontal reverberation artifact of the pleura caused by multiple reflections, representing the normal lung surface [32]. The B-line represents the interlobular septum, which is denoted by a discrete laser-like vertical hyperechoic artifact that spreads to the end of the screen, and it can be represented as the B1-line [33]. The fusion B-line is a sign of pulmonary interstitial syndrome, which shows a large area filled with the B-line in the intercostal space, and it can be represented as the B2-line [26]. Pulmonary consolidation is characterized by a liver-like echo structure of the lung parenchyma, with a thickness of at least 15 mm [27], as shown in Fig. 7.
We used three datasets from four medical centers to build and evaluate the model: ultrasound images collected by the Stork ultrasound system (Stork Healthcare Co., Ltd. Chengdu, China) at Ruijin Hospital, Mindray ultrasound system (Mindray Medical International Limited, Shenzhen, China) at Shanghai Public Health Center, Philips ultrasound system (Philips Medical Systems, Best, the Netherlands) at Wuhan Sixth People's Hospital and Hangzhou Infectious Disease Hospital. The Stork dataset was collected with an H35C (2–5 MHz) convex array transducer, the Mindray dataset with an SC5-1 (1–5 MHz) convex array transducer, and the Philips dataset with an Epiq 5, Epiq 7 C5-1 (1–5 MHz) convex array transducer.
Multimodal generation and fusion
According to doctors’ experience in recognizing sonograms, parallel echo rays of the A-line, beam-like echo rays of the B-line, and the accumulation of exudate of lung consolidation are used as markers for classification. The gradient field is highly sensitive to the parallel echo rays of the A-line, and K-means clustering can better highlight the beam-like echo rays of the B-line [28]. As shown in Fig. 8a, we produced the gradient field and K-means clustering images as two new modalities for extracting shallow features.
There are many methods to fuse multimodal inputs, and concatenate-based fusion is an intuitive fusion method [34], but this method is more suitable for situations where each modality is equally important for classification. Extracting features first and then concatenating is also a very popular fusion method [35], while the number of parameters and GPU memory limit its application. In this paper, we proposed a brand-new fusion network, as shown in Fig. 8b; this network used the minimum network parameters to achieve multimodal automatic weight distribution, thus underlining the embedding of the other two modalities on the original image. Two 1 × 1 convolutions were used to update the weights of the K-means modality and gradient field modality in the easiest way. After elementwise summation, we highlighted it on the original image by elementwise multiplication with the original image and finally added it to the original image to obtain the final fusion input.
ResNeXt with CRF attention block for classification
Shallow features were extracted by the traditional methods in “Multimodal generation and fusion” section. To extract deep features more effectively, we chose deep and wide ResNeXt as the backbone network for classification. ResNeXt [22] is a combination of ResNet [21] and Inception [36], which improves accuracy through wider or deeper networks. Each of its blocks is a measurable dimension in addition to the width and depth dimensions. It inherits the strategy of repeating layers of ResNet, but increases the number of paths and uses split conversion and merge strategies in a simple and scalable manner. ResNeXt with the CRF attention building block is shown in Fig. 8d. Our whole network replaces the building block in ResNeXt with our CRF attention building block. In detail, there is one first layer and three residual layers in our network, and every first layer and residual layer has one and three grassroots CRFA building blocks, respectively.
The CRF attention module comprised channelwise and receptive field attention modules, denoted as CA and RFA, respectively (Fig. 8c). The CA module attempts to assist the learning of layer-specific features and explores channelwise dependencies for the selection of useful features. Specifically, given an intermediate input feature channel set \(U \in {R^{H \times W \times C}}\), a squeeze operation is performed on the input image \(F_{sq} (U)\), that is, global average pooling (GAP), to encode the entire spatial feature on a channel as a global feature:
$$F_{sq} (U) = \frac{1}{W \times H}\sum\limits_{i = 1}^{W} {\sum\limits_{j = 1}^{H} {U(i,j), \, U \in R^{W \times H \times C} } }$$
(5)
The squeeze operation obtains the global description feature, and another operation is required to capture the relationship between the channels, namely, the excitation operation \(F_{ex} (F_{sq} (U))\):
$$\begin{gathered} F_{ex} (F_{sq} (U)) = \sigma \left( {W_{2} \delta \left( {W_{1} F_{sq} \left( U \right)} \right)} \right), \hfill \\ \delta \left( x \right) = \max \left( {0,x} \right),\sigma (x) = \frac{1}{{1 + e^{ - x} }} \hfill \\ \end{gathered}$$
(6)
where \(\delta ( \cdot )\) and \(\sigma ( \cdot )\) are the ReLU activation and sigmoid function, respectively. ReLU is a ramp function which has gradient one for positive inputs and zero for negative inputs. Sigmoid function maps the input from 0 to 1. \(W_{1} \in R^{{\frac{c}{16} \times c}}\) and \(W_{2} \in R^{{\frac{c}{16} \times c}}\) are the learning weights of the two fully connected layers. The excitation operation can learn the nonlinear relationship between channels. Finally, the learned activation value of each channel (sigmoid activation) is multiplied by the original feature on \(U\):
$$\hat{U}_{C} = F(U,F_{ex} (F_{sq} (U))) = U \cdot F_{ex} (F_{sq} (U)),U \in R^{W \times H \times C}$$
(7)
given the same input feature channel set \(U \in R^{H \times W \times C}\), we first conducted two transformations \(\tilde{F}:U \to \tilde{U} \in R^{H \times W \times C}\) and \(\hat{F}:U \to \hat{U} \in R^{H \times W \times C}\) with kernel sizes of 3 and 5, respectively. Then, the results of multiple branches are combined by summing the elements as follows:
$$U{ = }\tilde{U}{ + }\hat{U}$$
(8)
For the output features \(\tilde{U}\) and \(\hat{U}\), squeeze and excitation are performed, respectively, as in Eq. 2. Additionally, we used soft attention across channels to select different spatial scales of information, which is guided by the compact feature descriptor z:
$$\hat{U}_{RF} = \frac{{e^{{A_{c} z}} }}{{e^{{A_{c} z}} + e^{{B_{c} z}} }}F_{ex} (F_{sq} (\tilde{U})) + (1 - \frac{{e^{{A_{c} z}} }}{{e^{{A_{c} z}} + e^{{B_{c} z}} }})F_{ex} (F_{sq} (\hat{U})),U \in R^{W \times H \times C}$$
(9)
where \(A,B \in R^{{\frac{C}{16} \times C}}\). The final feature map of RFA is obtained through the attention weights on various kernels as in the above equation.
With the CA and RFA modules, the CA and RFA results are further integrated with the add operation, as shown in Fig. 6d:
$$\overline{U} = \hat{U}_{C} + \hat{U}_{RF}$$
(10)
Detailed procedures are as follows: (1) extract the most common 6 types of datasets in Fig. 5 from the training set in equal proportions randomly to avoid an imbalanced sample and ensure that each category can be learned. (2) Augment the data by rotation and normalize the intensity of the image. (3) Select the classifier with the best performance and test it on the test set to obtain the corresponding prediction results.
Establishment of scoring standards
We predicted the patient's per part ultrasound video of multiple examinations through the trained MCRFNet and classified and scored sonograms according to the paper [37]. A-line indicates that the patient is normally ventilated, with a score of 0; A&B-line indicates that the patient has mild lung ventilation loss, with a score of 1; B1-line indicates that the patient has moderate lung ventilation loss, with a score of 2; B1&B2-line indicates that the patient has severe lung loss of ventilation, with a score of 2.5; B2-line indicates that the patient has very severe loss of lung ventilation, with a score of 3; consolidation indicates that the patient has a solid lung change characterized by dynamic air bronchial signs, with a score of 4. After the classification result is quantified, the sum is divided by all the frames to obtain the final lung function severity score, which is 0 to 4.
Training strategy
For the Stork, Mindray, Stork & Mindray, and Stork & Mindray & Philips datasets, we used an independence test to verify the performance of the classifier. All the images were resized to 128 × 128, and a training batch consisted of 8 randomly selected images. We regularized the model by using dropout during training, and the neural network parameters were then trained by maximizing log-likelihood using the momentum optimizer with an initial learning rate of 0.1. Then, every 30 epochs, the learning rate dropped by 10 times, stochastically minimizing the cross-entropy between annotated labels and predictions. Our experiments were implemented using TensorFlow on a PC with an Intel Xeon E5, 64G RAM, Nvidia TITAN Xp 12G.