In this section, we present the general framework for adversarial training of our hippocampal subfields segmentation models. Figure 2 describes the proposed model in details. The model consists of two major parts. The first part is generative network based on the modified U-net, which is trained to conduct the 2D segmentation of the slices extracted from brain nuclear magnetic image. The other part, an adversarial network with convolutional neural network is employed to discriminate the expert annotation and the segmentation images generated by generative network.
Generative adversarial network
The generative adversarial network, which consists of two components: a generator G and a discriminator D, is a deep learning framework that trains the generative model and the discriminative model alternately. The general idea of the generative adversarial network is an adversarial process amongst the models pitting against each other to improve the performance of the networks, where the generator counterfeits the sample images to deceive the discriminator and the discriminative model determines whether the images are fake or not. The sufficient competition and confrontation between the models leads to the improved performance of the models so that the generative images are indistinguishable from the real sample images.
The formula of the generative adversarial network optimal training process is as follows:
$$ V\left( {G,D} \right) = \mathop {\hbox{min} }\limits_{G} \mathop {\hbox{max} }\limits_{D} \left( {{\rm E}_{{x\sim P_{data} \left( x \right)}} \left[ {\log D\left( x \right)} \right] + {\rm E}_{{z\sim P_{z} \left( z \right)}} \left[ {\log \left( {1 - D\left( {G\left( z \right)} \right)} \right)} \right]} \right) $$
(1)
where x is a data sample. GANs have achieved state-of-the-art in some generation tasks, such as image synthesis, video generation. The conditional GAN has been proposed to solve some ill-posed problems, such as text-to-image translation, image-to-image translation, and image super-resolution. The conditional GAN receives an additional input as a condition to guide the generation of the images.
As shown in Eq. (2), the combination loss function that consists of two terms is used to optimize the models. The first term is a multi-class cross-entropy term encouraging the generative model to generate the segmentations of high accuracy. We use g(x) to denote the class probability map over M classes with the size of H × W × M produced by the generative model given an input image x with the size of H × W. The second term, which is based on the adversarial model, will be large if the adversarial model can discriminate the generative segmentations from the expert manual segmentations. This term will penalize the mismatches in the higher-order label statistics due to the fact that the adversarial model has a field-of-view covering the entire image. We use a(x, y)∈[0,1] to denote the probability predicted by the adversarial model that y is the expert manual segmentation label map of x or the label map produced by the generative model g(·). For a dataset of N training images xn with the corresponding label maps yn, the loss function is defined as follows
$$ \ell \left( {\theta_{g} ,\theta_{a} } \right) = \sum\limits_{n = 1}^{N} {\ell_{mce} \left( {g\left( {x_{n} } \right),y_{n} } \right)} - \left[ {\ell_{bce} \left( {a\left( {x_{n} ,y_{n} } \right),1} \right) + \ell_{bce} \left( {a\left( {x_{n} ,g\left( {x_{n} } \right)} \right),0} \right)} \right] $$
(2)
where θg, θa are the parameters of the generative model and of the adversarial model respectively. In the Eq. (2), \( \ell_{mce} \left( {\hat{y},y} \right) = - \sum\nolimits_{i = 1}^{H \times W} {\sum\nolimits_{m = 1}^{M} {y_{im} } } \ln \hat{y}_{im} \) denotes the multi-class cross-entropy loss of the predictions \( \hat{y} \), and \( \ell_{bce} \left( {\hat{a}a} \right){ = } - \left[ {a\ln \hat{a} + \left( {1 - a} \right)\ln \left( {1 - \hat{a}} \right)} \right] \) denotes the binary cross-entropy loss. When training the generative model, we minimize the loss with respect to θg, while maximizing it with respect to θa when training the adversarial model.
The generative model
Based on the U-net, the generative model named UG-net, utilizes the encoder-decoder architecture as shown in Fig. 2. The encoder, consists of several conventional “convolution + pooling” layers, attempts to extract the high-level features as opposed to the decoding part employed to reconstruct the segmentation ground truth label maps by upsampling layers. Highly condensed features, which are very effective for image segmentation, are extracted by the convolution layers, but some important local information has been missed in this processing. To retain the local information, the feature maps extracted by the convolution layers are concatenated with the corresponding output feature maps of the upsampling layers. Some modifications are made for the original U-net in the architecture and training strategies to suit our task and dataset. In the paper, instead of using two consecutive convolution layers prior to the pooling layer in U-net, we remove one of the convolution layers to reduce the parameters of the model. This modification aims to prevent overfitting since U-net may be easily overfitted in our task with more parameters. In addition, zero padding convolution layers are adopted to maintain the spatial dimension of the images and feature maps through each convolution layer. Data augmentation and dropout operations are also used to against the overfitting on several upsampling layers. In the last convolution layers, we only use one 3 × 3 filter to reduce the number of output feature channels to one, same to the channels of the expert segmentations.
The adversarial model
The adversarial model architecture is illustrated in Fig. 3. The model takes the hippocampal image and the corresponding label map as input. The label map is either the expert segmentations or produced by the generative model. At the beginning, the images and the label maps are processed by two separate branches to allow different level representations for the two different signals. Following the observation of Pinherio et al. [23], the same number of channels is set up for each input signal to avoid that one of them dominates the other when fed to the subsequent layers. The two inputs are represented with 64 channels by concatenating the two signal branches. Then the signals are delivered into a series of convolution and max-pooling layers, thus a binary class probability is produced to determine whether the label map is the expert segmentation or not.