Sharp loss: a new loss function for radiotherapy dose prediction based on fully convolutional networks

Background Neural-network methods have been widely used for the prediction of dose distributions in radiotherapy. However, the prediction accuracy of existing methods may be degraded by the problem of dose imbalance. In this work, a new loss function is proposed to alleviate the dose imbalance and achieve more accurate prediction results. The U-Net architecture was employed to build a prediction model. Our study involved a total of 110 patients with left-breast cancer, who were previously treated by volumetric-modulated arc radiotherapy. The patient dataset was divided into training and test subsets of 100 and 10 cases, respectively. We proposed a novel ‘sharp loss’ function, and a parameter γ was used to adjust the loss properties. The mean square error (MSE) loss and the sharp loss with different γ values were tested and compared using the Wilcoxon signed-rank test. Results The sharp loss achieved superior dose prediction results compared to those of the MSE loss. The best performance with the MSE loss and the sharp loss was obtained when the parameter γ was set to 100. Specifically, the mean absolute difference values for the planning target volume were 318.87 ± 30.23 for the MSE loss versus 144.15 ± 16.27 for the sharp loss with γ = 100 (p < 0.05). The corresponding values for the ipsilateral lung, the heart, the contralateral lung, and the spinal cord were 278.99 ± 51.68 versus 198.75 ± 61.38 (p < 0.05), 216.99 ± 44.13 versus 144.86 ± 43.98 (p < 0.05), 125.96 ± 66.76 versus 111.86 ± 47.19 (p > 0.05), and 194.30 ± 14.51 versus 168.58 ± 25.97 (p < 0.05), respectively. Conclusions The sharp loss function could significantly improve the accuracy of radiotherapy dose prediction.

In precision radiotherapy, personalized schemes should be established for evaluating the planning quality. Automated radiotherapy planning and quality control have been commonly based on integrating and summarizing historical data of expert-level treatment plans as well as building models to predict reasonable and achievable dosimetric indicators for new cases [1,2]. Numerous models have been developed for predicting achievable OAR constraints [2][3][4][5][6][7][8][9] and dose-volume histograms (DVH) [10][11][12].
Because a dose distribution is generally a better descriptor of radio-therapeutic prognosis than a single DVH index, more recently, dose distribution prediction from computed tomography (CT) images and regions of interest (ROI) has been the focus of several studies [13][14][15][16][17][18][19][20][21][22][23][24]. Also, deep learning methods have been recently proposed for predicting 3D dose distributions. Prominent examples of these methods utilize enhanced variants of the traditional convolutional neural networks (CNN), namely, the fully convolutional networks (FCN). A FCN is obtained through replacing the last fully connected layer in a CNN by a convolutional layer. The FCN architecture has been already used for medical image segmentation [25][26][27][28], and it has demonstrated outstanding performance in radiotherapy dose prediction for a variety of cancer types, including nasopharyngeal cancer, lung cancer, prostate cancer and breast cancer [16,[19][20][21][22][23].
In radiotherapy, the region of clinical concern usually accounts for only a small part of the whole imaged region. This data imbalance can lead to a decrease in the accuracy of a dose prediction model, because the model would then tend to make dose predictions closely matching the dosage values for the clinically irrelevant regions. The problem of data imbalance in deep learning has been previously addressed and successfully alleviated through special loss functions such as the focal loss designed for dense object detection and the dice loss employed in natural language processing tasks [29,30]. Further studies have been conducted to boost the dose prediction accuracy in radiotherapy by network structure adjustment, network parameter optimization and adding beam influence factors to network inputs [16,[19][20][21]. However, loss function design for dose prediction in radiotherapy has been generally overlooked. Inspired by earlier performance improvements based on the focal and dice loss functions, we propose a novel 'sharp loss' function for better handling of data imbalance problems, and incorporate this loss in an FCN framework for dose prediction in radiotherapy. The sharp loss is a dynamically scaled variant of the classical mean-squared-error (MSE) loss, where the scaling factor decreases in low-dose regions. The effectiveness of this proposed loss function is demonstrated using a U-Net architecture for dose distribution prediction in radiotherapy of left-sided breast cancer. Figure 1 shows the training and validation loss curves for the MSE loss and the sharp loss with different values of γ . The U-Net architecture equipped with the sharp loss achieved smaller loss than the same architecture with the MSE loss. In the first 30 epochs, the validation loss with the sharp loss function and factors of γ = 50 and γ = 100 show more variations between the cross-validation folds than those of the other loss function variants. However, all the training and validation loss curves tended to be flat after 32 epochs. This indicates that the U-Net model has good generalization for dose prediction. Table 1 compares the mean absolute difference (MAD) values of different dose regions in the test data for the MSE loss and the sharp loss with different values of γ. Clearly, dose prediction results based on the MSE loss achieve small MAD values in all regions, but the low-error voxels are concentrated outside the body. Inside the body, the MAD values are large for high ground-truth dose values. The standard deviation (std) explain the performance of the prediction model in different dose area. The smaller the std, the better the generalization performance of the model in different dose regions. The std of the prediction error inside the body was 312.57 cGy with the MSE loss. This standard deviation becomes smaller when the sharp loss is used with γ = 1, 25, 50 or 100. In particular, the standard deviation has the smallest value for γ = 100. For γ < 100, the  prediction precision inside the body increases as the value of γ becomes larger, and the sharp loss prediction results were better than those of the MSE loss. However, when γ > 100, the precision deteriorates again as the value of γ increased. Table 2 shows the MAD values of different ROI types for the MSE and sharp loss functions. The sharp loss function with γ = 100 shows the best performance for the planning target volumes (PTV), the ipsilateral lung, the heart and the contralateral lung. For the spinal cord, similar prediction errors are obtained using the sharp loss function with γ = 1, 25, 50 and 100. These errors are lower than those resulting from other loss variants. Table 3 shows the MAD values for different ROI types of the test data, along with statistical significance measures of the differences between the results obtained with the MSE loss and the sharp loss with γ = 100. Significant statistical differences were found for all ROI types, except for the case of the contralateral lung. The results demonstrated that using the sharp loss (γ = 100) leads to superior dose prediction results in comparison to the MSE loss. Figure 2 shows boxplots of the MAD values obtained for each ROI type of the test data. This figure indicates that the prediction results obtained with the sharp loss at γ = 100 are more accurate and stable than the results obtained by other loss variants.

Discussion
This study demonstrated the effectiveness of the proposed sharp loss function for dose prediction in radiotherapy. As Fig. 3 shows, the sharp loss function can be divided into two stages. A ground-truth value D 0 was defined as the demarcation point of these two stages, i.e.,   When D i [0, D 0 ) , the value of the sharp loss decreases with the decrease in dose. When D i [D 0 , 6000] , the value of the sharp loss is approximately equal to the MSE loss. The γ factor affects the loss function in two ways. First, this factor influences the value of D 0 . Second, the γ factor affects the slope in the range [0, D 0 ) . As the γ value increases, the loss of prediction error decreases in low-dose regions.
The γ factor was experimentally fine-tuned and its best value was found to be 100. When γ = 100, D 0 = 455.71 cGy, which means that the sharp loss function would enable the U-Net model to pay more attention to the regions where the dose is larger than 455.71 cGy. With γ = 100 and a ground-truth dose Dp i = 0 , the sharp loss = 0.047 the MSE loss. When γ < 100, the D 0 value becomes larger than 455.71 cGy, and hence imprecise prediction results may be obtained in the region where the ground-truth dose values fall between 455.71 cGy and D 0 . When γ > 100, loss values close to 0 would decrease quickly, and this can lead to imprecise prediction results in the region where the ground-truth dose values are between 0 and 455.71 cGy. This phenomenon can be observed in Table 1. The increase of γ reduces the dose prediction accuracy outside the body, but improves the model performance inside the body, especially in the region where the dose is larger than 3000 cGy. However, this effect is not obvious when γ is larger than 100. Since the whole region inside the body is of clinical concern when making a treatment plan and deserves the same level of attention for dose prediction, the standard deviation of the prediction error inside the body was calculated for each loss. The standard deviation has the lowest value when γ = 100. This means that paying sufficient attention to various dose regions inside the body is warranted to ensure balance in the U-Net optimization process. Table 2 further illustrates the advantages of the proposed sharp loss with γ = 100. For most of the considered organs, this loss achieved the best dose prediction performance. Even though the sharp loss function with γ = 100 shows the lowest standard deviation of the MAD prediction error, the prediction results still vary across different regions. The largest MAD values occur in the regions where the dose values are between 2000 and 3000 cGy, while the smallest MAD values are observed in the regions with dose values smaller than 500 cGy or larger than 5000 cGy. These results can be ascribed to two reasons. First, the predicted results are prone to higher errors in regions with high dose gradients. Those regions usually have doses between 2000 and 5000 cGy. The average dose gradient is shown in the last column of Table 2. Second, a patient case can be associated with different expertise dose distributions. Therefore, even when the predicted dose values satisfy the optimal solution, there could still be some differences between the predicted and ground-truth dose values. Figure 4 presents a sample comparison of the clinical ground-truth dose distribution and the predicted dose distributions for the MSE and sharp loss functions, with the corresponding DVH. In this case, the prediction result using MSE loss shows a large error in the PTV region. There is a dose deficiency at the margin of PTV, and the predicted doses of some PTV voxels adjacent to the air are close to zero. Obviously, (1) when D i = D 0 , sharp loss (D i ) = 0.99 × MSE loss. the sharp loss improves the dose distribution prediction in the ROI of the PTV. As discussed above, the sharp loss has a better correction in the areas with large dose gradients, while tumors near the body surface has a great dose drop at the air boundary. The sharp loss may bring greater improvement to the cases that the PTV near the body surface.
To provide a more realistic and intuitive estimate of the dosimetry, clinically relevant metrics are shown in Table 4. The homogeneity index (HI) is defined as (D 1 − D 99 )/5000cGy , and the conformation index (CI) is defined as The clinical metrics of prediction results obtained with the sharp loss at γ = 100 are more accurate than the traditional MSE loss. The FCN is widely used in radiotherapy dose prediction at present. Kajikawa used CNN-based methods to predict radiotherapy dose of prostate cancer [22]. Chen found that ResNet101 performed better than VGGNet16 in nasopharyngeal cancer [19]. Several studies used the U-Net, which is characterized by U-shaped architecture and upsampling and down-sampling operations, to train small size medical dataset efficiently [20,23]. Babier developed a 3D generative adversarial network for dose prediction and demonstrated superior performance on oropharyngeal cancer cases than previous approaches [31]. The above studies enhance the model performance by improving the network structure. Besides, Barragán-Montero improved the prediction results by adding beam configuration into the input data [21]. However, the above methods are not targeted to solve the problems caused by data imbalance.
Data imbalance is a key obstacle that deteriorates performance in tasks of dense object detection or semantic segmentation. Small organ segmentation in medical imaging is particularly quite challenging due to the large imbalance between the object and background classes [32,33]. Such imbalance also emerges in dose prediction tasks in radiotherapy. Among all voxels of the 10 test cases of this study, about 74.52% of the voxels were outside the body and the corresponding dose values were just 0. Table 1 shows that the MAD value outside the body was much lower than that inside the body (32.93 versus 186.55 cGy) when the MSE loss function was used. This indicates that a large number of the voxels outside the body overwhelms the region inside the body. Even inside the patient body, the voxels whose dose values are lower than 500 cGy accounted for 77.30% of the total voxels. As a result, the MAD value for the region with the dose range 0-500 cGy is the lowest among all regions inside the body, and the high-dose region was ignored. Data augmentation is a common method to alleviate the effects of data imbalance in deep learning. An augmentation algorithm applies multiple types of transformations to the training data in order to increase the proportion of the samples for the weakly represented class. However, data augmentation is only suitable for cases where the input images and the associated ground-truth data are scale-independent, but dose distributions in radiotherapy are dependent on the X-ray penetration depth. Loss function redesign is another way to deal with the problem of data imbalance, and this has been validated by previous studies [29,30]. In this study, the sharp loss was proposed to solve the data imbalance problem, and this loss has indeed demonstrated improved prediction performance. To the best of the authors' knowledge, this is the first study that uses such novel loss function for improving the 3D dose prediction performance. Nevertheless, this study has several limitations. First, the proposed sharp loss function was applied in conjunction with the U-Net architecture only and the data were limited to breast cancer cases. The next phase of this work will involve more types of deep networks as well as cancer types associated with other sites. Second, the recommended sharp loss factor, γ = 100, was selected based only on a limited validation dataset. With Table 4 The clinical metrics of ground-truth and predicted dose distribution using MSE loss and sharp loss (γ = 100) different networks or datasets, alternate suitable values of γ could be sought. Because of the different characteristics of radiotherapy planning for different human organs, the universality of the proposed sharp loss function will be investigated and scrutinized in future work. Third, the model was tested on 10 cases, which limited the evaluation of trained model generalization. The dataset will be expanded in the future. Furthermore, since one of the important applications of dose prediction is automatic planning, the related automatic plan research would be conducted in future work.

Conclusions
This study proposes a novel loss function, named sharp loss, for FCN-based dose distribution prediction in radiotherapy. This loss function was first used to improve the outcomes of dose prediction modeling. The sharp loss function has achieved better prediction accuracy than that of the traditional MSE loss. With different values of the nonnegative parameter γ greater than or equal to one, the value γ = 100 exhibited prediction results that are superior to those obtained with other values. In addition, the same γ value reduced the imbalance-induced prediction errors in regions of different doses. The sharp loss function could enable the prediction model to focus on the regions of more clinical significance. The improved predictions of dose distributions provide an effective foundation for quality assurance, automation, and efficiency in planning radiotherapy.

Dataset description
In this study, 110 patients in the early stages (Stage I and Stage II) of left-sided breast cancer were involved retrospectively. These patients received radiotherapy in the Zhejiang Cancer Hospital from January 2017 to December 2020. Breast scanning was performed using a GE LightSpeed-RT CT simulator or a Philips large-aperture CT simulator. The CT layer thickness was 5 mm and the pixel matrix had a size of 512 × 512. After scanning, the PTV, heart, ipsilateral lung, contralateral lung, and spinal cord were delineated by senior physicians with 10 years of expertise. No regions of interest overlapped with each other. Each patient received breast-conserving radiotherapy on the left side, where the internal breast region and cervical lymph nodes were not subjected to radiation. Actual treatment was carried out using volumetric-modulated arc therapy (VMAT) techniques with 6-MV double-arc X-ray. Also, the start and stop angles of the X-ray gantry were tangent to the irradiated breast. The prescribed dose was 5000 cGy spread over 25 fractions, and scaled to cover 95% of the PTV for all cases. The ROI delineation and treatment planning were completed using the RayStation radiotherapy planning system (Version 4.5, RaySearch Laboratories AB, Sweden). All plans were optimized further using a trial-and-error process to conform to the as-low-as-reasonably achievable principle for normal tissues. Moreover, all plans were reviewed by two experienced dosimetrists and one senior oncologist.

Data preprocessing
For each patient in the training set, the ROI structures and the corresponding dose distribution were converted from the Digital Imaging and Communications in Medicine (DICOM) format to matrices in Python. In this conversion process, each voxel was assigned a specific value based on the ROI type. Specifically, the voxels in the ROIs of the PTV, the ipsilateral lung, the heart, the contralateral lung, the spinal cord, and other tissues were assigned values of 1.000, 0.833, 0.667, 0.500, 0.333, and 0.167, respectively, whereas voxels outside of the body contour were assigned a value of 0. For each case, every ROI matrix retained 64 slices in the head-foot direction, and all PTV was contained within the 64 slices. In the transverse section, each slice matrix was bilinearly interpolated into a size of 256 × 256 to reduce the computational cost. The dose distribution matrices were derived from the DICOM data, and each dose value was scaled to be in the range [0, 1] through division by 6000 cGy. For each case, the dose matrix was bilinearly interpolated into a size of 64 × 256 × 256, which corresponds to the ROI image matrix size.

Fully convolutional networks
The U-Net FCN architecture was used to predict dose distributions in breast cancer. This choice is motivated by earlier successful applications of this architecture in medical image segmentation as well as dose prediction for head and neck cancer, prostate cancer and breast cancer [20,22,23]. The U-Net model used in this study has a total of 10 layers, with a network input of 3-dimensional 64 × 256 × 256 ROI matrices with 1 channel. Each of the first 5 layers is a down-sampling layer that contains two 3 × 3 × 3 convolution modules with valid padding, followed by a rectified linear unit (ReLU), one batch normalization module, and one 2 × 2 × 2 maximum pooling operation. The 6th, 7th, 8th, and 9th layers are up-sampling layers. Each of these layers has one concatenation module, two 3 × 3 × 3 convolution modules with valid padding, followed by a ReLU module, and one batch normalization module. Finally, the 10th layer has a 1 × 1 × 1 convolution module followed by a linear activation function for output, and the output of this layer is the dose distribution matrix. The overall architecture of the proposed network is shown in Fig. 5.

Sharp loss
The sharp loss function is introduced in this work to better address the problem of data imbalance for dose distribution prediction in radiotherapy. Indeed, training data for this problem typically show an extreme imbalance between the high-dose volume (> 500 cGy) and the low-dose volume (< 500 cGy). The sharp loss function is derived from the MSE loss: where n was the total number of voxels, Dp i is the predicted dose for voxel i, and D i is the ground-truth dose for voxel i. Figure 3 shows a plot of the MSE loss (in blue) as a function of the ground-truth dose value. One remarkable property of this loss is that even a voxel dose that is close to 0 incurs a loss with a non-trivial magnitude. When small loss values are summed over a large number of low-dose voxels, the clinically more relevant high-dose voxels will be overwhelmed.
The following experiments stress that remark: the large number of low-dose voxels encountered during training of the U-Net architecture dominates the MSE loss. In fact, the low-dose voxels, especially those outside the body, comprise the majority of the MSE loss and hence dominate the loss function gradient. In order to redirect the network optimization process to focus on the more relevant higher-dose voxels, the sigmoid function is employed to reduce the weight of the low-dose voxels in the network loss. The sigmoid function is mathematically defined as:   Figure 6 shows a plot of this function (in blue). Because the sigmoid function is smooth and easily differentiable, it is widely used in deep learning to map real numbers into the unit interval [0, 1]. In this work, the definition of the sigmoid function was modified in terms of two aspects. First, since the predicted dose values were scaled to be in the range [0, 1], the γ factor was allowed to exceed the unit value in order to increase the loss gradient in the dose value range, especially for low-dose values. A modified sigmoid function with γ = 25 is shown as the orange curve in Fig. 6. A larger γ value resulted in a steeper function behavior in the low-dose region. Second, the loss curve was moved toward the positive direction of the x axis by subtracting 0.03 from x to keep the groundtruth dose values positive. This function modification can lead to more efficient exploitation of the gradient regions of the sigmoid function. This modified function is shown in Fig. 6 as the green curve.
As an alternative to the MSE loss, we propose in this study to reshape the loss function to assign lower weights to low-dose voxels, and thus focus the network training on the highdose voxels. The modified sigmoid function introduced above was proposed as a modulating factor for the MSE loss. We name the modulated loss function the 'sharp loss' , which is defined as: where γ ≥ 0 is a tunable parameter. The factor 1/ 1 + e −(D i −0.03)×γ was derived from the sigmoid function as explained above. Figure 3 shows plots of the sharp loss function for several γ values. For this proposed loss function, when a ground-truth dose value is close to 0, the loss for this voxel is assigned a low weight and approaches 0. Since most of the zero-dose voxels are outside the body, the prediction precision of the low-dose voxels will not cause large errors inside the body. In the following experiments, we analyzed the dose prediction performance with the sharp loss function for γ = 1, 25, 50, 100, 250, and 500.

Network model training
Ten breast cancer cases were randomly selected from the 110 available cases to form the test set, while the remaining 100 cases comprised the training set. The training data included a total of 100 three-dimensional ROI arrays and their corresponding dose matrices. The ROI arrays were used as the input data while the corresponding dose matrices were employed as the output data for prediction model training. A tenfold cross-validation scheme was employed. The network construction was implemented in the Python-based application programming interface of Keras (https:// keras. io/). The Adam algorithm was used for loss function minimization. The batch size was set to 2, and the initial learning rate was set to 0.001. Also, when the validation loss did not decrease in 5 iterations, the learning rate was attenuated at a rate of 0.1 until it reached 10 −7 . The training was carried out using two Intel Xeon Gold 5118 CPUs @ 2.30 GHz, and one NVIDIA TITAN RTX PC with a 192-GB RAM. After network training, the ROI arrays of the test set were fed into the trained model, and the corresponding dose prediction matrices were obtained.