Skip to main content

Table 1 Comparison of kidney image segmentation algorithm performance

From: Application of visual transformer in renal image analysis

Algorithms

Datasets

Evaluation indicators/results

Main views and contributions

Limitations

TransUNet [29]

Synapse 2015/ACDC

Synapse (DSC: 77.48%; Kidney (R): 81.87%; Kidney (L):77.02%; HD: 31.69 mm)/ACDC(DSC: 89.71%)

TransUNet is the first successful attempt to introduce a Transformer into medical image segmentation. Combining CNN and Transformer in coding

Transformer leads to a dramatic increase in the number of model parameters

IB-TransUNet [68]

Synapse 2015

DSC: Kidney (R):79.87%

Kidney (L):83.89%

Using the UNet model to combine the information bottleneck (IB) with the Transformer

More advantages in learning small organ features

Swin-Unet [32]

Synapse 2015

DSC: 79.13%

HD: 21.55 mm

The information bottleneck block was innovatively introduced in the encoding; a hierarchical Swin Transformer model with moving windows is used as an encoder to extract contextual features. An asymmetric Swin Transformer model decoder with a patch extension layer is designed to perform the upsampling operation

Higher dependency on large and diverse datasets with a large number of parameters and complexity

AgDenseU-Net 2.5D [60]

KiTS 2021

DSC:

Kidney: 95%

Tumor: 87.8%

Cyst: 74.6%

Combining the features of AggRes (which enhances feature representation by aggregating residual connectivity and attention mechanisms) and DenseU-Net (which efficiently performs multi-scale feature fusion)

Higher computation and memory consumption, longer training time

LeViT-UNet [69]

Synapse/ACDC

Synapse (DSC: 78.53%, Kidney (R): 80.25%, Kidney (L): 84.61%, HD: 16.84 mm)/ACDC (DSC: 90.32%)

Using LeViT as the encoder of LeViT-UNet, combining LeViT Transformer with U-Net

Some metrics do not reach SOTA, and the segmentation performance is imaged to some extent to reduce the computational complexity

ViTBIS [70]

Synapse 2015

DSC: 80.45%

Adding the Concat operator for merging features

The dataset is more homogeneous, with fewer baselines for comparison

TransClaw

U-Net [33]

Synapse 2015

Synapse (DSC: 78.09%, HD: 26.38 mm)

Claw U-Net with Transformer

Combined/decoder dual-path design

Relatively homogenous data sets

After-Unet [71]

Thorax-85/BCV/SegTHOR thorax

Thorax-85 (DSC: 92.32%)/BCV (DSC: 81.02%)/SegTHOR thorax

(DSC: 92.10%)

Both intra- and inter-slice long-distance cues were considered to guide segmentation

Axis information is naturally provided mainly for 3D volume

TransBTSV2 [19]

KiTS 2019/

BraTS2019/

BraTS2020/

LiTS 2017

KiTS 2019 (DSC: KIdney: 97.37%, Tumor: 83.69%, Composite: 90.53%)

Not limited to brain tumor segmentation (BTS) but focuses on general medical image segmentation, providing a powerful and efficient 3D baseline for the volumetric segmentation of medical images

Mainly for 3D medical image segmentation tasks

UNETR [31]

BTCV/MSD

BTCV (AVG: 89.1%)/MSD (DSC: 71.1%, HD95: 8.822 mm)

The Transformers encoder utilizes embedded 3D corpora to capture remote dependencies efficiently; the jump-join decoder combines extracted representations of different resolutions and predicts the segmentation output

Mainly for 3D medical image segmentation

DBT-UNETR [72]

BTCV

AVG:80.3%

An improved SwinUNETR is proposed based on UNETR with Swin Transformer as an alternative to Transformer

No significant improvement in performance compared to UNETR

NnFormer [37]

Synapse 2015/

ACDC

Synapse (DSC: 87.40%)/ACDC(DSC: 91.78%)

Utilizing a combination of cross-convolution and self-attention operations

Little performance gain on the ACDC dataset

HiFormer [73]

Synapse 2015

DSC:80.69%

Two multi-scale representations were designed based on the Swin transformer module and CNN encoder, and the Double-Level Fusion (DLF) module was designed to finely fuse the global and local features of the two representations

Single dataset

MPSHT [74]

Synapse 2015/

ACDC

Synapse (DSC: 79.76%, KIdney: 80.77%, HD: 21.55 mm)/ACDC(DSC: 91.80%)

Based on the CNN-Transformer model hybrid model, to which the asymptotic sampling module is added

Accuracy of segmentation to be improved

DSGA-Net [75]

Synapse 2015/

BraTs 2020/

ACDC

Synapse (DSC: 81.24%)/BraTs2020 (DSC: 85.82%)/ACDC(DSC: 91.34%)

Add a Depth Separable Gating Visual Transformation (DSG-ViT) module to the code and propose a Hybrid Three-Branch Attention (MTA) module

Considerable computational burden; consumes large amounts of GPU memory

MedNeXt [76]

BTCV/AMOS22/KiTS19/BraTS21/AVG

BTCV (DSC: 88.76%)/AMOS22 (DSC: 91.77%)/KiTS19 (DSC: 91.02%)/BraTS21 (DSC: 91.49%)/AVG (DSC: 88.01%)

The use of ConvNeXt 3D and the extension of ConvNeXt blocks to upsampling and downsampling layers represents a modern deep architecture for medical image segmentation

Deep Networks Dedicated to Medical Image Segmentation

MESTrans [77]

COVID-DS36/GlaS/Synapse/I2CVB

COVID-DS36 (DSC: 81.23%)/GlaS (DSC: 89.95%, IoU: 82.39)/Synapse (DSC: 77.48%, HD:31.69 mm)/I2CVB (DSC: 92.3%, IoU: 85.8)

Propose a Multi-scale Embedding (MEB) and Multi-layer Spatial Attention Transformer structure (SATrans) to adjust the sensory field. Propose a Feature Fusion Module (FFM) for global learning between shallow and deep features

The performance of small organ segmentation needs to be improved

ST-Unet [78]

Synapse 2015/ISIC 2018

Synapse2015(DSC:78.86%, HD:20.37mm)/ISIC 2018(F1:90.94%, mIoU:85.26)

Proposing a new Cross-Layer Feature Enhancement (CLFE) module for cross-layer feature learning with spatial and channel squeezing and excitation modules to highlight the saliency of specific regions

The accuracy of segmentation needs to be improved

COTRNet [79]

KiTS 2021

DSC:

Kidney:92.28%

Tumor:55.28%

Cyst:0.50.52%

Utilizing pre-trained ResNet to develop the encoder, in addition to adding deep supervised

The accuracy of segmentation for masses and tumors needs to be improved

CS-Unet [80]

Synapse 2015

DSC:82.21%

Kidney(R):79.52%

Kidney(L):85.28%

Design of convolutional Swin-Transformer (CST) module that merges convolution with multi-head self-attention and feed-forward networks

Facing the challenge of dealing with long-range dependencies