Skip to main content

Table 1 Comparison of kidney image segmentation algorithm performance

From: Application of visual transformer in renal image analysis



Evaluation indicators/results

Main views and contributions


TransUNet [29]

Synapse 2015/ACDC

Synapse (DSC: 77.48%; Kidney (R): 81.87%; Kidney (L):77.02%; HD: 31.69 mm)/ACDC(DSC: 89.71%)

TransUNet is the first successful attempt to introduce a Transformer into medical image segmentation. Combining CNN and Transformer in coding

Transformer leads to a dramatic increase in the number of model parameters

IB-TransUNet [68]

Synapse 2015

DSC: Kidney (R):79.87%

Kidney (L):83.89%

Using the UNet model to combine the information bottleneck (IB) with the Transformer

More advantages in learning small organ features

Swin-Unet [32]

Synapse 2015

DSC: 79.13%

HD: 21.55 mm

The information bottleneck block was innovatively introduced in the encoding; a hierarchical Swin Transformer model with moving windows is used as an encoder to extract contextual features. An asymmetric Swin Transformer model decoder with a patch extension layer is designed to perform the upsampling operation

Higher dependency on large and diverse datasets with a large number of parameters and complexity

AgDenseU-Net 2.5D [60]

KiTS 2021


Kidney: 95%

Tumor: 87.8%

Cyst: 74.6%

Combining the features of AggRes (which enhances feature representation by aggregating residual connectivity and attention mechanisms) and DenseU-Net (which efficiently performs multi-scale feature fusion)

Higher computation and memory consumption, longer training time

LeViT-UNet [69]


Synapse (DSC: 78.53%, Kidney (R): 80.25%, Kidney (L): 84.61%, HD: 16.84 mm)/ACDC (DSC: 90.32%)

Using LeViT as the encoder of LeViT-UNet, combining LeViT Transformer with U-Net

Some metrics do not reach SOTA, and the segmentation performance is imaged to some extent to reduce the computational complexity

ViTBIS [70]

Synapse 2015

DSC: 80.45%

Adding the Concat operator for merging features

The dataset is more homogeneous, with fewer baselines for comparison


U-Net [33]

Synapse 2015

Synapse (DSC: 78.09%, HD: 26.38 mm)

Claw U-Net with Transformer

Combined/decoder dual-path design

Relatively homogenous data sets

After-Unet [71]

Thorax-85/BCV/SegTHOR thorax

Thorax-85 (DSC: 92.32%)/BCV (DSC: 81.02%)/SegTHOR thorax

(DSC: 92.10%)

Both intra- and inter-slice long-distance cues were considered to guide segmentation

Axis information is naturally provided mainly for 3D volume

TransBTSV2 [19]

KiTS 2019/



LiTS 2017

KiTS 2019 (DSC: KIdney: 97.37%, Tumor: 83.69%, Composite: 90.53%)

Not limited to brain tumor segmentation (BTS) but focuses on general medical image segmentation, providing a powerful and efficient 3D baseline for the volumetric segmentation of medical images

Mainly for 3D medical image segmentation tasks

UNETR [31]


BTCV (AVG: 89.1%)/MSD (DSC: 71.1%, HD95: 8.822 mm)

The Transformers encoder utilizes embedded 3D corpora to capture remote dependencies efficiently; the jump-join decoder combines extracted representations of different resolutions and predicts the segmentation output

Mainly for 3D medical image segmentation




An improved SwinUNETR is proposed based on UNETR with Swin Transformer as an alternative to Transformer

No significant improvement in performance compared to UNETR

NnFormer [37]

Synapse 2015/


Synapse (DSC: 87.40%)/ACDC(DSC: 91.78%)

Utilizing a combination of cross-convolution and self-attention operations

Little performance gain on the ACDC dataset

HiFormer [73]

Synapse 2015


Two multi-scale representations were designed based on the Swin transformer module and CNN encoder, and the Double-Level Fusion (DLF) module was designed to finely fuse the global and local features of the two representations

Single dataset

MPSHT [74]

Synapse 2015/


Synapse (DSC: 79.76%, KIdney: 80.77%, HD: 21.55 mm)/ACDC(DSC: 91.80%)

Based on the CNN-Transformer model hybrid model, to which the asymptotic sampling module is added

Accuracy of segmentation to be improved

DSGA-Net [75]

Synapse 2015/

BraTs 2020/


Synapse (DSC: 81.24%)/BraTs2020 (DSC: 85.82%)/ACDC(DSC: 91.34%)

Add a Depth Separable Gating Visual Transformation (DSG-ViT) module to the code and propose a Hybrid Three-Branch Attention (MTA) module

Considerable computational burden; consumes large amounts of GPU memory

MedNeXt [76]


BTCV (DSC: 88.76%)/AMOS22 (DSC: 91.77%)/KiTS19 (DSC: 91.02%)/BraTS21 (DSC: 91.49%)/AVG (DSC: 88.01%)

The use of ConvNeXt 3D and the extension of ConvNeXt blocks to upsampling and downsampling layers represents a modern deep architecture for medical image segmentation

Deep Networks Dedicated to Medical Image Segmentation

MESTrans [77]


COVID-DS36 (DSC: 81.23%)/GlaS (DSC: 89.95%, IoU: 82.39)/Synapse (DSC: 77.48%, HD:31.69 mm)/I2CVB (DSC: 92.3%, IoU: 85.8)

Propose a Multi-scale Embedding (MEB) and Multi-layer Spatial Attention Transformer structure (SATrans) to adjust the sensory field. Propose a Feature Fusion Module (FFM) for global learning between shallow and deep features

The performance of small organ segmentation needs to be improved

ST-Unet [78]

Synapse 2015/ISIC 2018

Synapse2015(DSC:78.86%, HD:20.37mm)/ISIC 2018(F1:90.94%, mIoU:85.26)

Proposing a new Cross-Layer Feature Enhancement (CLFE) module for cross-layer feature learning with spatial and channel squeezing and excitation modules to highlight the saliency of specific regions

The accuracy of segmentation needs to be improved

COTRNet [79]

KiTS 2021





Utilizing pre-trained ResNet to develop the encoder, in addition to adding deep supervised

The accuracy of segmentation for masses and tumors needs to be improved

CS-Unet [80]

Synapse 2015




Design of convolutional Swin-Transformer (CST) module that merges convolution with multi-head self-attention and feed-forward networks

Facing the challenge of dealing with long-range dependencies