From: Application of visual transformer in renal image analysis
Algorithms | Datasets | Evaluation indicators/results | Main views and contributions | Limitations |
---|---|---|---|---|
TransUNet [29] | Synapse 2015/ACDC | Synapse (DSC: 77.48%; Kidney (R): 81.87%; Kidney (L):77.02%; HD: 31.69Â mm)/ACDC(DSC: 89.71%) | TransUNet is the first successful attempt to introduce a Transformer into medical image segmentation. Combining CNN and Transformer in coding | Transformer leads to a dramatic increase in the number of model parameters |
IB-TransUNet [68] | Synapse 2015 | DSC: Kidney (R):79.87% Kidney (L):83.89% | Using the UNet model to combine the information bottleneck (IB) with the Transformer | More advantages in learning small organ features |
Swin-Unet [32] | Synapse 2015 | DSC: 79.13% HD: 21.55Â mm | The information bottleneck block was innovatively introduced in the encoding; a hierarchical Swin Transformer model with moving windows is used as an encoder to extract contextual features. An asymmetric Swin Transformer model decoder with a patch extension layer is designed to perform the upsampling operation | Higher dependency on large and diverse datasets with a large number of parameters and complexity |
AgDenseU-Net 2.5D [60] | KiTS 2021 | DSC: Kidney: 95% Tumor: 87.8% Cyst: 74.6% | Combining the features of AggRes (which enhances feature representation by aggregating residual connectivity and attention mechanisms) and DenseU-Net (which efficiently performs multi-scale feature fusion) | Higher computation and memory consumption, longer training time |
LeViT-UNet [69] | Synapse/ACDC | Synapse (DSC: 78.53%, Kidney (R): 80.25%, Kidney (L): 84.61%, HD: 16.84Â mm)/ACDC (DSC: 90.32%) | Using LeViT as the encoder of LeViT-UNet, combining LeViT Transformer with U-Net | Some metrics do not reach SOTA, and the segmentation performance is imaged to some extent to reduce the computational complexity |
ViTBIS [70] | Synapse 2015 | DSC: 80.45% | Adding the Concat operator for merging features | The dataset is more homogeneous, with fewer baselines for comparison |
TransClaw U-Net [33] | Synapse 2015 | Synapse (DSC: 78.09%, HD: 26.38Â mm) | Claw U-Net with Transformer Combined/decoder dual-path design | Relatively homogenous data sets |
After-Unet [71] | Thorax-85/BCV/SegTHOR thorax | Thorax-85 (DSC: 92.32%)/BCV (DSC: 81.02%)/SegTHOR thorax (DSC: 92.10%) | Both intra- and inter-slice long-distance cues were considered to guide segmentation | Axis information is naturally provided mainly for 3D volume |
TransBTSV2 [19] | KiTS 2019/ BraTS2019/ BraTS2020/ LiTS 2017 | KiTS 2019 (DSC: KIdney: 97.37%, Tumor: 83.69%, Composite: 90.53%) | Not limited to brain tumor segmentation (BTS) but focuses on general medical image segmentation, providing a powerful and efficient 3D baseline for the volumetric segmentation of medical images | Mainly for 3D medical image segmentation tasks |
UNETR [31] | BTCV/MSD | BTCV (AVG: 89.1%)/MSD (DSC: 71.1%, HD95: 8.822Â mm) | The Transformers encoder utilizes embedded 3D corpora to capture remote dependencies efficiently; the jump-join decoder combines extracted representations of different resolutions and predicts the segmentation output | Mainly for 3D medical image segmentation |
DBT-UNETR [72] | BTCV | AVG:80.3% | An improved SwinUNETR is proposed based on UNETR with Swin Transformer as an alternative to Transformer | No significant improvement in performance compared to UNETR |
NnFormer [37] | Synapse 2015/ ACDC | Synapse (DSC: 87.40%)/ACDC(DSC: 91.78%) | Utilizing a combination of cross-convolution and self-attention operations | Little performance gain on the ACDC dataset |
HiFormer [73] | Synapse 2015 | DSC:80.69% | Two multi-scale representations were designed based on the Swin transformer module and CNN encoder, and the Double-Level Fusion (DLF) module was designed to finely fuse the global and local features of the two representations | Single dataset |
MPSHT [74] | Synapse 2015/ ACDC | Synapse (DSC: 79.76%, KIdney: 80.77%, HD: 21.55Â mm)/ACDC(DSC: 91.80%) | Based on the CNN-Transformer model hybrid model, to which the asymptotic sampling module is added | Accuracy of segmentation to be improved |
DSGA-Net [75] | Synapse 2015/ BraTs 2020/ ACDC | Synapse (DSC: 81.24%)/BraTs2020 (DSC: 85.82%)/ACDC(DSC: 91.34%) | Add a Depth Separable Gating Visual Transformation (DSG-ViT) module to the code and propose a Hybrid Three-Branch Attention (MTA) module | Considerable computational burden; consumes large amounts of GPU memory |
MedNeXt [76] | BTCV/AMOS22/KiTS19/BraTS21/AVG | BTCV (DSC: 88.76%)/AMOS22 (DSC: 91.77%)/KiTS19 (DSC: 91.02%)/BraTS21 (DSC: 91.49%)/AVG (DSC: 88.01%) | The use of ConvNeXt 3D and the extension of ConvNeXt blocks to upsampling and downsampling layers represents a modern deep architecture for medical image segmentation | Deep Networks Dedicated to Medical Image Segmentation |
MESTrans [77] | COVID-DS36/GlaS/Synapse/I2CVB | COVID-DS36 (DSC: 81.23%)/GlaS (DSC: 89.95%, IoU: 82.39)/Synapse (DSC: 77.48%, HD:31.69Â mm)/I2CVB (DSC: 92.3%, IoU: 85.8) | Propose a Multi-scale Embedding (MEB) and Multi-layer Spatial Attention Transformer structure (SATrans) to adjust the sensory field. Propose a Feature Fusion Module (FFM) for global learning between shallow and deep features | The performance of small organ segmentation needs to be improved |
ST-Unet [78] | Synapse 2015/ISIC 2018 | Synapse2015(DSC:78.86%, HD:20.37mm)/ISIC 2018(F1:90.94%, mIoU:85.26) | Proposing a new Cross-Layer Feature Enhancement (CLFE) module for cross-layer feature learning with spatial and channel squeezing and excitation modules to highlight the saliency of specific regions | The accuracy of segmentation needs to be improved |
COTRNet [79] | KiTS 2021 | DSC: Kidney:92.28% Tumor:55.28% Cyst:0.50.52% | Utilizing pre-trained ResNet to develop the encoder, in addition to adding deep supervised | The accuracy of segmentation for masses and tumors needs to be improved |
CS-Unet [80] | Synapse 2015 | DSC:82.21% Kidney(R):79.52% Kidney(L):85.28% | Design of convolutional Swin-Transformer (CST) module that merges convolution with multi-head self-attention and feed-forward networks | Facing the challenge of dealing with long-range dependencies |