Inter-site harmonization based on dual generative adversarial networks for diffusion tensor imaging: application to neonatal white matter development

Background Site-specific variations are challenges for pooling analyses in multi-center studies. This work aims to propose an inter-site harmonization method based on dual generative adversarial networks (GANs) for diffusion tensor imaging (DTI) derived metrics on neonatal brains. Results DTI-derived metrics (fractional anisotropy, FA; mean diffusivity, MD) are obtained on age-matched neonates without magnetic resonance imaging (MRI) abnormalities: 42 neonates from site 1 and 42 neonates from site 2. Significant inter-site differences of FA can be observed. The proposed harmonization approach and three conventional methods (the global-wise scaling, the voxel-wise scaling, and the ComBat) are performed on DTI-derived metrics from two sites. During the tract-based spatial statistics, inter-site differences can be removed by the proposed dual GANs method, the voxel-wise scaling, and the ComBat. Among these methods, the proposed method holds the lowest median values in absolute errors and root mean square errors. During the pooling analysis of two sites, Pearson correlation coefficients between FA and the postmenstrual age after harmonization are larger than those before harmonization. The effect sizes (Cohen’s d between males and females) are also maintained by the harmonization procedure. Conclusions The proposed dual GANs-based harmonization method is effective to harmonize neonatal DTI-derived metrics from different sites. Results in this study further suggest that the GANs-based harmonization is a feasible pre-processing method for pooling analyses in multi-center studies.


Background
Diffusion tensor imaging (DTI) has been widely used to assess structural alterations associated with the brain development or lesions on neonates [1,2]. However, the sample size is always limited due to the difficulty of the neonatal data acquisition [3]. To improve the statistical power, the multi-center/multi-scanner study is a common strategy [4]. DTI-derived metrics are reproductive when magnetic resonance imaging (MRI) scanners and acquisition protocols are equivalent [5]. However, differences related to the variety of scanners, magnetic fields, coils, and/or acquisition protocols usually exist in multi-center studies [6,7]. Such site-specific effects will introduce measurement variability, which hinder the ability to obtain 'truly' quantitative measures, which may lead to false findings [8]. Therefore, the site-specific variations have to be removed prior to integrating datasets. The inter-site (or inter-scanner) harmonization is the essential step in multi-center studies.
Recently, several harmonization methods have been proposed based on the phantom or directly based on human brain datasets. The phantom-based harmonization is a simple and feasible approach to correct systematic differences across sites [4,9]. However, it is not suitable for retrospective studies to monitor real-time states of MRI scanners by phantoms. Furthermore, it may be not adequate to capture tissue-specific differences [7]. The harmonization based on human brain datasets may overcome these problems related to the phantom-based harmonization [7,[10][11][12][13]. There are three categories of approaches based on human brain datasets [7,11]: the model-free approach based on the rotation-invariant spherical harmonic (RISH) features, the meta-analysis, and the statistical covariates methods. For the RISH-based method, regional complexities of biological properties in the brain have been considered during the harmonization [10]. It is able to capture tissue-specific differences. However, the accuracy of the transformation from diffusion-weighted images to the representation of spherical harmonic basis depends on the gradient direction number [14]. The demand of the high angular resolution during the data acquisition will limit its clinical applications. Approaches based on DTI-derived metrics, such as fractional anisotropy (FA) and mean diffusivity (MD), may be relatively feasible in clinical applications. As a comparison, the harmonization based on the meta-analysis can be performed on DTI-derived metrics. The meta-analysis strategy [15,16] harmonizes data through calculating z-scores, with the hypothesis of Gaussian distribution of the metric. However, the distribution is non-Gaussian because of the limited sample size in most cases [11]. Similarly, harmonization methods based on statistical covariates can also be performed on DTI-derived metrics. Among the methods based on statistical covariates, the linear scaling based on the whole brain or the target region of interest is the easiest to be implemented [4,17]. Compared with the scaling, ComBat demonstrates better performances by estimating additive and multiplicative factors in each voxel [17,18]. However, the parametric distribution is difficult to be determined for various imaging models [7]. It is also difficult to determine whether assumed parameters in the ComBat are enough to reflect scanner-related or site-related effects. Moreover, the feasibility of the previous harmonization methods proposed on datasets of adults remains to be investigated on datasets of neonates.
To solve the problem related to the parameter selection and capture characteristics of the data distribution, the framework of generative adversarial networks (GANs) is an effective approach [19,20]. Moreover, deep learning-based algorithms can reliably capture the nonlinear mapping relationship between different sites or scanners [13]. GANs employ two neural networks, the generator and the discriminator, to yield high-quality synthetic images. By setting opposite objectives, the generator and the discriminator are adversarial to each other. As the training goes on, each network will be improved. Finally, the generator can yield sharp vivid images. Based on the power of GANs to generate synthetic images, the dual learning architecture has been adapted to achieve the unsupervised image-to-image translation [21][22][23]. This motivates us to consider the same demand in the harmonization task.
In this work, we try to propose an inter-site harmonization approach by using the dual GANs ( Fig. 1) with the Markovian discriminator (Fig. 2). Based on DTI-derived metrics (FA and MD) of age-matched neonates from different sites, this study introduces the GANs to find the complex nonlinear mapping relationship between two different domains. Performances of the proposed approach are compared with three conventional methods: the scaling based on the whole brain white matter (global-wise scaling), the scaling in voxels (voxel-wise scaling), and ComBat [17]. The highlight of this work is to propose a dual GANs-based harmonization method and evaluate its performance on neonatal datasets.

DTI metrics affected by sites
To demonstrate site effects on FA, Fig. 3  weight of the enrolled neonates. However, site 2 holds significantly (P < 0.05) higher FA than site 1 (as shown in Fig. 3). Moreover, Pearson correlation coefficient between the inter-site difference and the averaged FA is 0.1771 (as shown in Figure S1, Additional file 1). This suggests that the correlation between FA values and site effects is weak. By using tract-based spatial statistics (TBSS), it can be found that the original data without harmonization have significant inter-site differences (P < 0.05) in nearly the whole white matter region (as shown in Fig. 4). Similar site effects can also be found on MD ( Figure  S2A and B, Additional file 2). Though the inter-site differences in the distribution of MD is less obvious than that of FA, TBSS shows significant differences in MD between sites. This is in agreement with the previous results on subject from 8 to 19 years [17]. Therefore, it is necessary to harmonize the data prior to the pooling analysis.

Dual GANs reduce site-related effects on DTI-derived metrics
As shown in Fig. 4, the proposed method achieves comparable results as the voxel-wise scaling and the ComBat methods. They can eliminate the inter-site significant differences in the white matter region. However, quantities of voxels with inter-site significant differences still exist for the global-wise scaling method. This suggests that methods performed at the voxel level are more efficient than that performed at the global level. To quantify differences between sites, Figs. 5 and 6 show the absolute error and the root mean square error (RMSE) before and after harmonization. The voxel-wise scaling, the ComBat, and the proposed dual GANs methods can reduce the absolute error and RMSE, compared with those of the original data. Furthermore, the proposed method holds the lowest median values in the absolute error and RMSE, according to the results of the sixfold cross-validation.
Note that, the results in Figs. 4, 5, and 6 are based on the dual GANs with the 2-dimensional (2D) kernel in the axial orientation. Compared with the sagittal or the coronal orientation, the harmonization based on dual GANs with 2D-kernel in the axial orientation holds lower RMSE (as shown in Table S1, Additional file 3). It can achieve comparable results with the 3-dimensional (3D) kernel (as shown in Table S1, Additional file 3 and Figure S3, Additional file 4). Moreover, the harmonization based on dual GANs with the  2D-kernel in the axial orientation can remove site effects on MD as well (as shown in Figure S2, Additional file 2). As in the case of 3 sites, one of the sites can be selected as the reference. As shown in Figure S4, Additional file 5, site 2 is considered as the reference. The dual-GAN method can remove the differences across sites ( Figure S4B and D, Additional file 5).

Application in the pooling analysis of white matter development
To reveal age-related changes during the pooling analysis, this study performs Pearson correlation between postmenstrual age and FA along splenium of the corpus callosum (SCC), left and right corticospinal tract (CST) on neonates without MRI abnormalities.

Fig. 4
Inter-site differences in fractional anisotropy (FA) values before and after harmonization using different methods. Red represents that site 2 holds significantly (P < 0.05) higher FA than site 1. Blue represents that site 2 holds significantly (P < 0.05) lower FA than site 1. Green represents regions without significant difference between two sites. Note: dual generative adversarial networks are performed with the 2-dimensional kernel on axial slices. For inter-site comparisons after harmonization, differences with P < 0.05 in any run of the sixfold cross-validation are overlaid together on the template map Positive correlation between the postmenstrual age and FA can be found (as shown in Fig. 7). The dual GANs harmonization method increases the number of locations with significant correlation (P < 0.05), compared with the correlation before harmonization. Moreover, correlation coefficients after harmonization averaged over different runs of the cross-validation are also larger than those before harmonization. Furthermore, inter-gender differences (Cohen's d between males and females) can be maintained by the harmonization procedure (as shown in Figure S5

Discussion
This study proposes a dual GANs-based harmonization method for neonatal DTIderived metrics from different sites. With smaller errors than conventional methods, the proposed method effectively removes site-related effects during the TBSS analysis. Moreover, the method can preserve the age-related and the gender-related variations of FA during the harmonization procedure.

Dual GANs vs. scaling and ComBat methods
Differences between sites or scanners, specifically including differences in magnetic fields, coils, and acquisition parameters, always cause nonlinear changes in MRI signals [7]. Together with potential unknown factors, the above differences make the relationship between two sites complex. To remove effects associated with sites or scanners, various studies have proposed several methods to harmonize the DTI data [4,17]. Consistent with the previous finding [17], the global-wise scaling does not work well to harmonize neonatal datasets from different sites in this study. This is due to the spatial heterogeneity of site-related effects throughout the white matter [10]. As comparisons, the scaling and the ComBat at the voxel-wise level can overcome the problem of the global-wise scaling and perform well on neonatal datasets. However, it is not able to fully capture the nonlinear inter-site relationship by using a prior assumed model with several observing significant factors. Different from conventional methods like the scaling and the ComBat, the proposed method introduces dual GANs to map the complex nonlinear relationship between different sites. This relatively complex mapping procedure takes longer time than the scaling and ComBat methods. On the same local computer,  the scaling and ComBat methods take about 2 × 10 −5 s/slice, while the proposed method takes 2 × 10 −2 s/slice during the inference procedure. Fortunately, the proposed method demonstrates advantages in reducing inter-site differences with smaller absolute errors and RMSE. The age-related and the gender-related variations of FA can be preserved during the harmonization procedure. Furthermore, the dual GANs-based harmonization can increase the correlation coefficient between FA and the postmenstrual age during the pooling analysis. In the case of multiple sites (more than 2 sites), one of the sites should be selected as the reference during the harmonization procedure. Results in this study suggest that dual GANs can be an alternative method for the data harmonization in multi-center studies.

Dual GANs vs. other deep learning-based methods
Besides the assumed model-based harmonization methods, several deep learning approaches have also been reported [12,13]. Similar to the motivation of those deep learning-based methods, this work tries to map the complex relationship between sites through the convolutional neural network. Different from previous methods, the proposed approach is based on DTI-derived metrics instead of RISH features, considering the limited gradient direction number [14]. Meanwhile, this work uses the Markovian discriminator in the harmonization framework. This can improve the ability of GANs to capture the local information [22,23], though the local information may be still not enough to achieve details in site effects. Furthermore, the dual GANs-based harmonization can work well on unpaired datasets during the training, which will improve the flexibility during applications.
As for the selection of kernel styles of dual GANs, the 2D-kernel in the axial orientation shows great performances, compared with the 3D-kernel and 2D-kernels in the coronal and the sagittal orientations. This may be associated with the acquisition mode. In this current work, the axial acquisition mode is used. The intra-slice resolution is 1.41 × 1.41 mm 2 , while the slice thickness is 2.5 mm or 4 mm. Therefore, the harmonization depending on axial slices may be more suitable for these datasets. Considering the efficiency of training (2D vs. 3D: 0.07 vs. 0.70 s/slice) and the RMSE, this work focuses on the GANs with the 2D-kernel. During applications, the appropriate orientation should be selected according to the acquisition mode.

Limitations
Despite the promising results, this study also has some limitations. Firstly, the proposed method should be used with caution during applications, though it works well on the enrolled neonates in this study. Model parameters in dual GANs are mainly dependent on the target and the reference images. Therefore, dual GANs should be trained again when they are performed on new datasets to capture the specific information. Secondly, the mapping between sites in this work may be not perfect. This work harmonizes DTIderived metrics of age-matched neonates from two sites. The strategy based on the same travelers across different sites may overcome this limitation. However, it is not practical to acquire data from different sites on the same neonates during the same period. Thirdly, the harmonization method is performed on the DTI-derived metrics (FA and MD) instead of the raw diffusion-weighted data. This will increase the cost of training for various metrics.

Conclusions
In conclusion, the proposed dual GANs-based harmonization method is effective to harmonize neonatal DTI-derived metrics from different sites. Results in this study further suggest that the GANs-based harmonization is a feasible pre-processing method for pooling analyses in multi-center studies. Our future work will focus on the harmonization approach for the raw diffusion-weighted data and try to improve the computational efficiency.

Methods
This study is approved by the local institutional review board. Informed written consents have been obtained from parents of neonates.

GANs
GANs have achieved impressive results in the image generation, the image editing and the image translation tasks [19,20]. The main idea of the adversarial training is introducing an auxiliary discriminator to handle the difficulty of evaluating the quality of generated images. The discriminator can be regarded as a binary classifier to distinguish synthetic images from real images. During the training process, the generator tries to generate high-quality synthetic images to satisfy the discriminator, while the discriminator tries to discriminate those synthetic images. In practice, the generator and the discriminator are usually implemented as two independent neural networks. To form the adversarial relationship, GANs use the cross-entropy to define the objective of both networks: where G and D denote the generator and the discriminator separately. x ∼ p data (x) represents the real data. z ∼ p z (z) is the random noise taken by the generator. The min and max denote that two networks are going to optimize this objective function in opposite directions. The capacity of the generator and the discriminator will be improved during the training process. Once the generator is well trained, it can generate high-quality images, hard to be distinguished from real images. Based on dual GANs, the harmonization pipeline is designed as shown in Fig. 1. Details about the dual learning architecture, the objective, and the network configuration are introduced as follows.

Dual learning architecture
The dual learning architecture is firstly proposed to reduce the requirement on labeled pairs in the machine translation [24]. The main idea of such architecture is to avoid the need of the paired training data. Moreover, the dual learning architecture can also help constraining a one-to-one mapping between the source and the target domains [21]. As shown in Fig. 1, the workflow of the dual GANs-based harmonization can be briefly summarized as follows: given the metric image (such as the FA) x ∈ p x∼site1 (x) , generator G A : site 1 → site 2 is employed to generate the harmonized output x ′ = G A (x) . Discriminator D A is then trained to distinguish the harmonized result xʹ from the DTI-derived metric y ∈ site 2 . To guarantee a meaningful mapping, generator G B : site 2 → site 1 is used to generate the corresponding reconstructed metric x ′′ = G B x ′ = G B (G A (x)) of the original input x. And a reconstruction loss �x − G B (G A (x))� is employed to force the reconstructed result x" to obey the original distribution. Notice that, the generator and the discriminator are trained simultaneously in this dual task. Similarly, the metric y ∈ p x∼site2 y is used to generate the harmonized result y ′ = G B y . And then the reconstructed metric is generated by: y ′′ = G A y ′ = G A G B y . The reconstruction error is defined as y − G A (G B y ) , the distance between the original and the reconstructed metrics.
A previous study [21] showed that the conventional model cannot guarantee the oneto-one mapping, since the ability of the generator is theoretically infinite without the dual learning architecture [19,20]. In other words, there are quantities of mappings between two domains. Though generators can always find a mapping without any constraint, the mapping is not one-to-one. Such harmonization could not bring us the meaningful relationship. Therefore, the dual learning architecture is essential for the GANs-based harmonization.

Objective
The key to the great performance of GANs is the use of the adversarial loss between the generator and the discriminator. However, it is difficult to achieve the balance between the generator and the discriminator. As observed in the previous work [25], the failure of GANs' training is associated with the traditional format loss function based on the optimization toward the Kullback-Leibler divergence between the real and the generated probability. When there is little or no overlap between them, especially at the early training stage, the gradient from the discriminator will vanish and the training will stall. The Wasserstein distance is continuous and provides a usable gradient, which makes the training process more stable. Thus, we employ the loss function based on the Wasserstein distance. The corresponding adversarial loss function is defined as: where G A denotes the generator, D A denotes the discriminator, x ∈ p x∼site1 (x) and y ∈ p x∼site2 y denote the input metrics from site 1 and site 2, separately.
Different from approximating the Lipschitz continuity based on weighting clips [22], the gradient penalty approach [26] is employed in this work. In practice, the gradient penalty approach can speed up the training process. Thus the adversarial loss becomes: In the above equation, x is sampled uniformly along a straight line between a pair of real and generated images. gp is a constant used to balance function D's outputs and gradient-influenced factors. In our experiments, we set gp to 10 according to the previous work [26]. The reconstruction loss is also introduced to force translated samples to obey the domain distribution. It has been proved that L 2 distance usually causes blurry results during the image generation [27]. Thus, the reconstruction loss is defined by L 1 distance.
where G B (G A (x)) and G x G y y represent the reconstructed metrics. These reconstructed metrics will be similar to original metrics x and y, when L recon converges to the minimum.
According to experimental results of the previous study [23], removing the adversarial loss substantially degrades the image quality, as does removing the reconstruction loss. Decreases in quantitative measures, such as FCN-scores and the classification performance, also suggest that both the adversarial loss and the reconstruction loss are important to improve the translation quality. Thus, the final loss function is defined as: where is a constant used to balance loss functions, because both the adversarial and the reconstruction losses are important to generate high-quality harmonized results. In our experiments, we set to 20 according to the previous work [22].
Reconstruction loss is designed to preserve the global information. To introduce more local details, the Markovian discriminator is used in this current work (as shown in Fig. 2). With the employment of Markovian discriminators, the feedback from discriminators encourages generators to concentrate on the local information. Thus, the adversarial loss and the reconstruction loss are complementary to each other.

Network configuration
The network architecture is shown in Fig. 2. This work uses the identical network architecture for both GANs. U-net is used as the backbone [28]. Generators are configured with the equal number of convolutional and transposed convolutional layers. For networks with 2D-convolutional kernels, the encoder part is composed of convolutional layers with a kernel size of 5 × 5 and stride-2 in the width and the height orientations, followed by a Leaky rectified linear unit (ReLU) function and Batch normalization layers. For networks with 3D-convolutional kernels, the encoder part is composed of convolutional layers with a kernel size of 5 × 5 × 5, followed by a Leaky ReLU function and Batch normalization layers. To combine the low-level information, feature maps from convolutional layers are passed by skip connections, and concatenated with those calculated in corresponding transposed convolutional layers with the identical output size. The combination of information from front layers can help generators to reserve more low-level features. As for the discriminator, we follow the recommendation given in a previous research [26]. All the Batch normalization layers in the discriminator are removed. Thus, discriminators are configured with fully convolutional networks using modules of form convolution layers followed by a Leaky ReLU function. Harmonization methods based on global-wise parameters are tended to lose local details. The mapping relationships are not identical across different regions of white matter [10]. Thus, we consider the local information through employing a Markovian discriminator. The Markovian discriminator tends to consider local features, compared with the traditional discriminator [22,23]. It discriminates input images at the patch level rather than the whole image. Differences in structures between the Markovian discriminator and the traditional discriminator are shown in Fig. 2. With a relatively smaller receptive field, such discriminator will concentrate on local details. The overall discriminative output possibility is computed by averaging all responses. Consequently, receiving the feedback from discriminators, generators can be induced to concentrate on the local information. In this study, the patch size is fixed at 30 × 30 for the discriminator with 2D-convolutional kernels and 30 × 30 × 8 for the discriminator with 3D-convolutional kernels, considering the matrix size of DTI-derived metric images. This is different from the size of 70 × 70 employed in previous studies [22,23].

Participants
This study enrolls 84 term neonates without any MRI abnormalities or evidences of any clinical episodes that might cause cerebral damages. As shown in Table 1, the data include DTI on 42 neonates (28 males and 14 females, gestational age range: 37.43-42.00 weeks, median = 40.00 weeks) from site 1 and 42 neonates (28 males and 14 females, gestational age range: 37.14-41.71 weeks, median = 39.71 weeks) from site 2.

MRI acquisition
The acquisition parameters of DTI are listed in Table 2. DTI is performed on two sites by using the same scanner version (General Electric, 3.0 T, Signa HDXT, WI, USA) with the eight-channel head coil. The single-shot spin echo planar imaging sequence is used for the DTI acquisition. DTI protocol in site 1 is carried out with the following parameters: 35 gradient directions; b values = 0 and 1000 s/mm 2 ; repetition time/echo time = 5500/95 ms; slice thickness = 4 mm without gap; field of view = 180 × 180 mm 2 ; and matrix size = 128 × 128. As for the protocol in site 2, DTI is carried out with parameters: 30 gradient directions; repetition time/echo time = 11,000/69.5 ms; b values = 0 and 600 s/mm 2 ; slice thickness = 2.5 mm; while the field of view and the matrix size are the same with site 1.

Table 1 Demographic information of neonates without magnetic resonance imaging abnormalities from two sites
Mann-Whitney U test is used to test the inter-site differences in the gestational age, the postmenstrual age, and the birth weight. Differences in the gender ratio are tested by using the Chi square test

Data processing
The eddy current correction is performed initially by using the tool in the FMRIB Software Library (FSL) [29]. Brain regions are then extracted by using the Brain Extraction Tool in FSL. Artifact-corrupted images are excluded automatically prior to the tensor estimation [30]. FA and MD maps are calculated by using the FMRIB diffusion toolbox in FSL. The image registration is performed by using an optimized pipeline [31]. Firstly, the target FA in the native space is selected from subjects in this study. Secondly, images of all the subjects are registered to the target FA by using the combination of the linear and the nonlinear registration. Finally, all individual FA images are normalized to the neonatal FA template [32]. Other DTI-derived metrics (such as MD) are also normalized to the neonatal template space by using the transformation parameter of FA.
To extract DTI-derived metric values along white matter tracts, the tract probabilistic map (cmrm.med.jhmi.edu) is used to determine regions of the left and right CST and the SCC, vulnerable tracts associated with punctuate white matter lesions [33]. FA values are measured at 100 equivalent levels on each tract defined on the atlas [34]. Firstly, images of all subjects are normalized to the neonatal template. Secondly, measurement planes are equally spaced on the tract probabilistic map corresponding to the neonatal template. Measurements are then averaged on each plane. Finally, metrics are measured at 100 equivalent levels. These 100 planes are described as "locations" along the white matter tract in the results section.

Training procedure of the dual GANs-based method
In this study, the sixfold cross-validation is used for the model training and validation. During the training, we firstly train discriminators and then generators. This work employs the mini-batch Stochastic Gradient Descent and the RMSProp solver. The training process is performed by looping over each training sample until the convergence. In this work, we train the model by 300 epochs to get the loss function converged ( Figure  S6, Additional file 7). The training takes about 8 h by using a single Nvidia Geforce GTX 1080Ti GPU. To choose the suitable kernel style and the orientation, this study performs the dual GANs by using the 3D-kernel (5 × 5 × 5) and the 2D-kernel (5 × 5) in the axial, coronal and sagittal orientations. To evaluate the feasibility of the dual-GAN method in case of multiple sites, the ComBat method [17] is used to generate the simulated data in a third domain different from site 1 or site 2. In this work, FA maps of the site 3 are the simulated data by transforming the data of site 1 to the third domain. As shown in (Additional file 5: Figure S4A and C), there are significant differences between site 1 and site 3, as well as site 2 and site 3. Data of site 2 is selected as the reference. The training is performed between the other site and the reference.

Conventional methods
This study compares the proposed harmonization method with three conventional methods: the global-wise scaling, the voxel-wise scaling, and the ComBat. This work performs these methods based on the description in the previous study [17]. For the global-wise scaling, metric values are averaged in the whole white matter region. Then these averaged values are used to calculate the scaling factor. As a comparison, the voxelwise scaling calculates the scaling factor at the voxel level. For the ComBat, the harmonization is performed at the voxel level by using the code from https ://githu b.com/Jfort in1/ComBa tHarm oniza tion.

Statistical analysis
Mann-Whitney U test is used to test inter-site differences in the gestational age, the postmenstrual age, and the birth weight. Differences in the gender ratio are tested by using the Chi square test. Tests are considered statistically significant at P < 0.05. To reveal distribution differences in FA of the two different sites, this study calculates the histogram of FA values in the white matter for each subject. The comparison in averaged FA values between sites is performed by using the Mann-Whitney U test. To investigate the correlation between FA values and site effects, Pearson correlation is performed between the inter-site difference and the averaged FA. Inter-site differences are also tested by using the general linear model in TBSS [31]. The permutation number is set at 10,000. Tests in TBSS are considered significant at P < 0.05 after the family-wise error rate correction and the threshold-free cluster enhancement. For inter-site comparisons after harmonization, differences with P < 0.05 in any run of the cross-validation are overlaid together on the template map. To quantify differences between sites, absolute errors and the RMSE of metric values are also calculated on the white matter before and after harmonization by using different methods.
To investigate age-related alterations before and after harmonization, Pearson correlation is performed between FA along white matter tracts (left CST, right CST and SCC) and the postmenstrual age. To evaluate whether inter-gender differences could be preserved during the harmonization procedure, the effect size (Cohen's d between males and females) is calculated before and after harmonization [7]: