Skip to main content

Table 1 Summary of studies that have used independent validation or compared the model against human experts

From: Artificial intelligence in glaucoma: opportunities, challenges, and future directions

References

Task

Output label

Data modality

No. of sample for model development

No. of sample for independent validation

No. of human expert

Model used

Key results

Strengths/limitations

Nawaz et al. [166]

Glaucoma detection

Healthy, glaucoma

Fundus

ORIGA

(650 images)

HRF (30 images),

RIM ONE (485 images)

 

Bi-directional Feature Pyramid Network (BiFPN) EfficientDet-D0 (EfficientNet-B0 as backbone)

Train on ORIGA, test accuracy: HRF-98.21%; RIM ONE–97.96%;

train on ROM-ONE, test accuracy: ORIGA–97.83%, HRF–98.19%.

Pros: high performance.

Cons: validation dataset is small.

Gong et al. [249]

Glaucoma diagnosis

Normal, glaucoma

Fundus

 

1000 images

4 doctors

Hierarchical structure (HDLS) AI system + SVM

Doctors’ performance was improved with the assistance of AI. Overall doctors' performance without AI assistance (round1): sensitivity: 65%, specificity: 78%; accuracy: 71.5%;

Overall doctors' performance with AI assistance (round 2): sensitivity: 91%, specificity: 88%; accuracy: 89.5%.

Pros: AI and human comparison. Cons: small number of tested samples (200 each round) and number of experts; self-learning effect exists.

Yugha et al. [239]

Glaucoma detection

Healthy, glaucoma

Fundus

ORIGA

(650 images)

RIM-1-DL (485 images);

HRF (45 images)

 

Bi-Directional Feature Pyramid system modules of EfficientDet-DO (EfficientNet-B0 backbone)

Accuracy: HRF: 97.89%; RIM 1: 97.64%.

Pros: high performance.

Cons: validation dataset is small.

Ko et al. [287]

Glaucoma detection

Non-glaucoma, glaucoma

Fundus

TVGH

(944 images)

CHGH (158 images); DRISHTI-GS1 (101 images);

RIM-ONE r2 (455 images)

 

EfficientNet B3

CHGH: AUC: 0.910 (0.798–1.000); accuracy: 80%; sensitivity: 65%; specificity: 95%;

RIM-ONE r2: AUC: 0.624 (0.501–0.748); accuracy: 52.5%; sensitivity: 15%; specificity: 90%;

DRISHTI-GS1: AUC: 0.770 (0.558–0.982); accuracy: 55%; sensitivity: 10%; specificity: 100%.

Pros: validated on multiple datasets. Cons: performance was not generalizable in RIM-ONE r2 and DRISHTI-GS1.

Xue et al. [228]

Glaucoma detection; severity classification

Normal, mild, moderate, severe

IOP, fundus, VF

6131 samples

240 samples

8 juniors,3 seniors,3 experts

Multi- feature deep learning (MFDL) (DetectionNet, ClassificationNet) (ResNet backbone)

MFDL achieved a higher accuracy of 0.842 (95% CI, 0.795–0.888) than the direct four classification deep learning (DFC-DL, accuracy of 0.513 [0.449–0.576]), CFP-based single-feature deep learning (CFP-DL, accuracy of 0.483 [0.420–0.547]) and VF-based single-feature deep learning (VF-DL, accuracy of 0.725 [0.668–0.782])

Its performance outperformed 8 juniors,3 seniors and 1 expert and was comparable with 2 glaucoma experts

Pros: compared with human expert. Cons: validation dataset is small and not from multi-center.

Wu et al. [47]

Glaucoma screening;

subtyping; early diagnosis

Screening (glaucoma, healthy); subtyping (POAG, PAAG); early POAG

Tear metabolic fingerprinting (TMF)

266 samples

54 samples

 

Ridge regression (RR)

Identified metabolic biomarker (Lac, Thr, Mer, Sul, Bar, or DPAE) for glaucoma characterization

Glaucoma Screening AUC: 0.856 (95% CI: 0.757–0.954).

Pros: biomarkers were identified; simple model with good performance. Cons: mass spectrometer is essential for the data; larger dataset is needed for further validation.

Singh et al. [173]

Glaucoma diagnosis

Normal, glaucoma

Fundus

ACRIMA(705 mages), ORIGA

(650 images),

HRF (30 images)

DRISHTI-GS (101 images);

PRIVATE (33 images)

 

InceptionResNet-V2

AUC: 0.9042; accuracy: 90%; sensitivity: 86.748%; specificity: 94.11%; F1-score: 91.13%.

Pros: compared multiple models.

Cons: small validation and training dataset.

Noury et al. [175]

Glaucoma diagnosis;

severity classification

Normal, glaucoma; mild, moderate severe

SD-OCT ONH scans

2461 OCT scans volumes

Hong Kong (HK): 1625 scans;

India: 672 scans;

Nepal: 380 scans

1

DiagFind: 3D CNN

AUC: HK: 0.80 (95% CI, 0.78–0.82), India:0.94 (95% CI, 0.93–0.96), Nepal: 0.87 (95% CI, 0.85–0.90); sensitivity: HK: 0.73 (0.67–0.79); India: 0.93 (0.88–0.99); Nepa: 0.79 (0.68–0.90);

specificity: HK: 0.73 (0.61–0.85); India: 0.71 (0.51–0.91); Nepa: 0.79 (0.66–0.92);

F1-score: HK: 0.76 (0.75–0.77); India: 0.91 (0.90–0.92); Nepal: 0.80 (0.78–0.83) testing set from Stanford (100 cases): AUC: 0.92 (95% CI, 0.90–0.93), human grader: 0.91

Pros: validated result on rea-world datasets from multiple sites.

Cons: exclude cases without consensus labeling and difficulty to be diagnosed by skilled clinicians.

Fan et al. [176]

Glaucoma diagnosis

Healthy, glaucoma

Fundus

OHTS: 66,715 images

ACRIMA (705 images);

LAG (4854 images);

DIGS (9473 images)

 

ResNet-50

AUC: DIGS: 0.74 (0.69–0.79); ACRIMA: 0.74 (0.70–0.77); LAG: 0.79 (0.78–0.81);

Sensitivity (at 85% specificity): DIGS: 0.52; ACRIMA: 0.46; LAG: 0.59;

Sensitivity (at 95% specificity): DIGS: 0.30; ACRIMA: 0.29; LAG: 0.42

Pros: validated results on multiple external datasets.

Cons: the model was not generalizable in external datasets.

Fan et al. [222]

Glaucoma diagnosis

Healthy, glaucoma

Fundus

OHTS: 66,715 images

DIGS (10,473 images),

ACRIMA (705 images),

LAG (4854 images),

RIM-ONE (455 images),

ORIGA (650 images)

 

Data-efficient image Transformer (DeiT)

AUC: OHTS: 0.91 (0.87, 0.93), DIGS:0.77 (0.71, 0.82); ACRIMA: 0.740.74 (0.70, 0.77); LAG: 0.88 (0.87, 0.89); RIM-ONE:0.91 (0.88, 0.94); ORIGA: 0.73 (0.68, 0.77); sensitivity (at 85% specificity): OHTS:0.79; DIGS: 0.57; ACRIMA: 0.46; LAG: 0.77; RIM-ONE: 0.83; ORIGA: 0.40; sensitivity (at 95% specificity): OHTS:0.56; DIGS: 0.34; ACRIMA: 0.31; LAG: 0.59; RIM-ONE: 0.73; ORIGA: 0.21.

Pros: validated in multiple external datasets; Vision Transformers have the potential to improve generalizability.

Cons: cropped images may lose information.

Huang et al. [231]

Glaucoma diagnosis

Normal, glaucoma

Fundus, VF

1655 samples

196 samples

 

Probabilistic deep learning model (EffientientNetB4 backbone)

AUC: 0.98 (0.98–0.99); accuracy: 93% (92–95%); sensitivity: 91% (87–95%); specificity: 95% (94–96%).

Pros: quantifying the uncertainty of the model

Cons: dataset was from COMPASS instrument; more external validation is needed.

Huang et al. [146]

Glaucoma VF grading

Clear, mild, moderate, severe, diffuse

VF (HFA and Octopus)

3805 VFs (Octopus);

13,231 VFs (HFA)

150 VFs (HFA)

2 ophthalmic clinicians, 6 medical students

Fine-grained grading deep learning system (FGGDL: FGG-O, FGG-H); Interactive Interface

AI outperformed human experts and their performance was improved with the assistance of AI

AUC: FGGDL: 0.893 (0.862–0.923); clinician 1: 0.838 (0.801–0.874); clinician 2: 0.833 (0.796–0.869); all the other medical students' performance was lower than 0.80.

Pros: external validation is applied and compared with human experts.

Cons: high test–retest variability was not considered.

Li et al. [182]

OD/OC segmentation, glaucoma screening

Glaucoma, non-glaucoma

Fundus

In-house (2440 images)

DRISHIT-GS (101 images); RIM-ONE v3 (159 images)

4 ophthalmologists

R-DCNN (DAC-ResNet34)

The segmentation results of in-house testing set were comparable to that of human experts', OD: DC-98.51%, JC-97.07%; OC: DC-97.63%, JC-95.39%

Segmentation results of DRISHTI-GS and RIM-ONE v3 are better than existing studies: DRISHTI-GS: OD: DC-97.23%, JC-94.17%; OD: DC-94.56%, JC-89.92%. RIM-ONE v3: OD: DC-96.89%, JC-91.32%; OC: DC-88.94%, JC-78.21%. glaucoma screening: AUC: DRISHTI-GS:0.968; RIM-ONE v3:0.941.

Pros: compared with human expert and existing studies.

Cons: small sample size of training and validation dataset.

Li et al. [164]

Glaucoma diagnosis, glaucoma incidence/progression prediction

Diagnosis (glaucoma, non-glaucoma);

glaucoma incidence prediction: with/without glaucoma development;

glaucoma progression prediction: with/without progression

Fundus

Diagnosis:

24,054 eyes;

predict glaucoma incidence: 11,548 eyes;

predict glaucoma progression: 3425 eyes

Glaucoma diagnosis: external test 1: 6162 images (eyes), external test 2: 824 images (eyes);

glaucoma incidence: external test 1: 955 images, external test 2: 719 images; glaucoma progression: external test 1: 337 images, external test 2: 513 images

 

DiagnoseNet, PredictNet

Glaucoma diagnosis: AUC: test 1: 0.94 (0.93–0.94); test 2–0.91 (0.89–0.93), sensitivity: glaucoma diagnosis: test 1:0.89(0.87–0.90); test 2: 0.92 (0.88–0.96), specificity: test 1: 0.83 (0.81–0.84); test 2: 0.71 (0.67–0.74);

Predict glaucoma incidence: AUC: test 1: 0.89 (0.83–0.95); test 2: 0.88 (0.79–0.97); sensitivity: test 1: 0.84 (0.81–0.86); test 2: 0.84 (0.81–0.86); specificity: test 1: 0.68 (0.43–0.87); test 2: 0.80 (0.44–0.97);

Predict glaucoma progression: AUC: test 1: 0.87 (0.81–0.92) and test 2: 0.88 (0.83–0.94); sensitivity: test 1: 0.82 (0.78–0.87) and test 2: 0.81 (0.77–0.84); specificity: test 1: 0.59 (0.39–0.76) and test 2: 0.74 (0.55–0.88).

Pros: performed multiple tasks (diagnosis, incidence and progression prediction) and include multiple external validation.

Cons: exclude low quality images, validation dataset were only Chinese datasets.

Mehta et al. [149]

Glaucoma detection

Healthy, glaucoma, PTG (progress to glaucoma)

Demographic, systemic and ocular data, color fundus, OCT

UK Biobank (2574 eyes, glaucoma-1193 eyes, healthy-1283 eyes, PTG-98 eyes)

200 eyes

5 glaucoma-fellowship trained ophthalmologists

InceptionResnetV4

Best model with input of OCT, color fundus, systemic and ocular data obtained AUC = 0.967 (95% CI 0.93–1.0);

Human expert (only made diagnosis on color fundus): AUC: 0.79–0.84.

Pros: used multiple modalities, several methods for interpret DL model (SHAP, saliency map);

Cons: poor quality of the fundus images may contribute low AUC.

Hemelings et al. [151]

Glaucoma detection, VCDR regression

Glaucoma, non-glaucoma

Fundus

UZL (13,551 images)

REFUGE (1200 images)

 

ResNet50

AUC: original fundus: 0.87(95% CI 0.83–0.91); 60% ONH cropping: 0.80 (95% CI 0.76–0.84).

Pros: report explainability analysis by cropping ONH area;

Cons: applying masks of fixed size might lead to a small variation in visible features across fundus images due to variation in ONH size across the study population.

Thakoor et al. [279]

Glaucoma detection

Glaucoma, non-glaucoma

OCT images (RNFL probability maps)

737 eyes

135 eyes

2 expert OCT readers

InceptionV3 + FC (with concept activation vectors (TCAVs))

The TCAVs scores were consistent with features used by human experts based on eye fixations

AUC:0.911.

Pros: applied test with concept activation vectors (TCAVs) for model interpretability;

Cons: multi-center datasets were not in external validation.

Thakoor et al. [281]

Glaucoma detection

Glaucoma, non-glaucoma

OCT B-Scans, RNFL probability maps

RNFL maps: 737 eyes; B-scans: 771 eyes

RNFL maps:135 eyes; B-scans: 125 eyes

 

CNN A (ResNet18 + RF) with

RNFL-map as input

CNN generalizability can be improved with data augmentation, multiple input image modalities, and training on images with confident ratings. choosing a thorough and consistent RS for training and testing improves generalization to new datasets

Best result was with RNFL map input and data augmentation on CNN A: AUC = 0.918 (95% CI 0.866–0.970), accuracy = 85.9%;

Pros: improved generalizability by several techniques (multi-modalities, consistent labels, data augmentation); Cons: independent validation can be improved by a larger dataset.

Natarajan et al. [143]

Glaucoma detection, OD segmentation

Glaucoma, normal

Fundus

RIGA (750 images), RIM-ONEv2 (455 images)

ACRIMA (705 images), Drishti-GS1 (101 images), RIMONEv1 (169 images)

 

UNet-Snet (SqueezeNet)

glaucoma detection: AUC: ACRIMA: 100%; Drishti-GS1: 99.90%, RIMONEv1: 100%; accuracy: ACRIMA: 99.86%; Drishti-GS1: 97.05%, RIMONEv1: 100%; sensitivity: ACRIMA: 100%; Drishti-GS1: 100%, RIMONEv1: 100%; specificity: ACRIMA: 99.75%; Drishti-GS1: 90.32%, RIMONEv1: 100%.

Pros: achieved high performance; Cons: small dataset for training and independent validation.

Kenichi et al. [129]

Glaucoma diagnosis

Glaucoma, normal

Fundus

3132 images (from ordinary camera)

162 images (from ordinary camera and smartphone)

 

ResNet34

AUC: camera-98.9%; smartphone-84.2%

advanced glaucoma: AUC: camera-99.3%; smartphone-90.0%.

Pros: validated performance between fundus from ordinary camera and smartphone;

Cons: training images were all from ordinary camera, poor image quality of fundus from smartphone.

Xu et al. [242]

Glaucoma diagnosis

Referable glaucomatous optic neuropathy (GON), unlikely GON

Fundus

1791 images

dataset1: 6301 images

dataset2: 1964 images

dataset3: 400 images

12 (4 senior ophthalmologists, 4 junior ophthalmologists, and 4 technicians)

Hierarchical deep learning system (HDLS)(segmentation- classification, Inception-v3 backbone)

Reliable region has higher sensitivity and specificity than suspicious region

datset1: AUC = 0.981 (95% CI, 0.978–0.985), reliable region: sensitivity = 97.7% (95% CI, 97.0–98.3%), specificity = 97.8% (95% CI, 97.2–98.4%);

dataset2: AUC = 0.983 (95% CI, 0.977–0.989); reliable region: sensitivity: 98.4% (95% CI, 97.3–99.5%), specificity = 98.2% (95% CI, 97.4–99.1%)

Dataset 3: performance of human experts were improved with the referring of HDLS: senior group: sensitivity: 0.93 (diagnose independently), 0.96 (referring to HDLS); specificity: 0.88 (diagnose independently), 0.95 (referring to HDLS).

Pros: this system is transparent and interpretable. the results were validated on three large validation datasets and results were comparable to that of human experts.

Cons: validation dataset did not include data from other ethnics than Chinese.

Bhuiyan et al. [246]

Glaucoma diagnosis

Glaucoma-suspect, not-suspect

Fundus

1546 disc-centered fundus images (AREDS, SiMES, RIM-ONE)

ORIGA

(638 gradable images)

 

Ensemble of Xception, Inception-Resnet-V2, NasNet, Inception-V3

AUC: 0.85; accuracy: 83.54%; sensitivity: 80.11%; specificity: 84.96%.

Pros: screening glaucoma suspect is important;

Cons: CDR is not the only biomarker for glaucoma suspect, other biomarkers, such as CDR asymmetry can be added in future studies.

Tang et al. [205]

Glaucoma diagnosis

Glaucoma, non-glaucoma

Fundus

Sanyuan (11,443 images), Tongren (7806 images), Xiehe (4363 images)

REFUGE (1200 images)

 

AMNet (semi-supervised learning)

accuracy: 95.75%; sensitivity: 87.5%; specificity: 96.7%; F1-score: 91.9%.

Pros: the model boosted the robustness of model with limited labeled data.

Cons: diverse validation dataset might be needed to future validate the model.

Alghamdi et al. [241]

Glaucoma diagnosis

Glaucoma, normal

Fundus

RIM-ONE (455 images)

RIGA (750 images)

 

2 ophthalmologists

TCNN-Transfer Convolutional Neural Network (VGG16); SSCNN- Semi-supervised Convolutional Neural Network model with self-learning; SSCNN-DAE-Semi-supervised Convolutional Neural Network model with autoencoder

three deep learning CNN models outperform the performances of both ophthalmologists with clear margins. RIM-ONE: accuracy: two experts attain 59.2% and 55.4.0% SSCNN-DAE: AUC: 0.95, accuracy: 93.8%; sensitivity: 98.90%; specificity: 90.50%.

Pros: compared with human experts and outperformed them.

Cons: small datasets were used for training; external validation was not included.

Li et al. [278]

Glaucoma diagnosis in myopia

Glaucoma, non-glaucoma

RNFL profile

2223 eyes

508 eyes

 

FCN + RBFN (radial basis function network) + RNFL compensation

By applying the RNFL compensation algorithm, the AUC for detecting glaucoma increased from 0.70 to 0.84, from 0.75 to 0.89, from 0.77 to 0.89, and from 0.78 to 0.87 for eyes in the highest 10%, 20%, 30% and any axial length (AL), respectively.

Pros: An RBFN has good generalization, strong tolerance to input noise, and online learning ability made it possible to interpret the patterns to a reliable compensation. Cons: compensation of the RNFL profile was based on data from participants with an age ≥ 50 years, this was not validated in younger participants; the validation dataset included a 1:1 ratio of glaucomatous to non-glaucomatous eyes, this is not the case in read world data.

Chiang et al. [277]

Primary angle closure

glaucoma (PACG) detection

POAG, PACG

Goniophotography

32,635 images

1000 images

9 graders

CNN (ResNet-50 backbone)

CNN achieved excellent performance based on single-grader (AUC = 0.969) and consensus (AUC = 0.952) labels. The agreement between the CNN classifier and consensus labels (κ = 0.746) surpassed that of all non-reference human graders (κ = 0.578–0.702).

Pros: the model obtained comparable performance with human graders. Cons: the model has limitation of generalizable to other ethnic groups and goniophotography taken from other devices.

Zhao et al. [196]

Cup-to-disc ratio (CDR) estimation, glaucoma screening

Glaucoma, normal

Fundus

Direct-CSU (934 images)

ORIGA (650 images)

 

Unsupervised feature representation of fundus image with a CNN (MFPPNet-3 blocks DenseNet) + RF

glaucoma screening: AUC: 0.88

CDR estimation: MAE: 0.0606; coefficient of correlation r = 0.68.

Pros: estimated the CDR value more effectively than results obtained from traditional segmentation-based methods; potential to handle unlabeled data.

Cons: further validation on diverse data sources would be needed.

Jammal et al. [112]

Fundus predict RNFL; glaucoma detection

Glaucoma, normal

Fundus

32,820 pairs of fundus photos and SD-OCT scans

 

2 graders

M2M DL (ResNet34 backbone)

DL algorithm outperformed human graders in detecting signs of glaucomatous damage on fundus photographs. glaucoma detection: AUC: DL: 0.801 [95% CI: 0.757, 0.845]; human: 0.775 [95% CI: 0.728, 0.823]; AUPRC: DL: 0.810 [95% CI: 0.765, 0.851]; human: 0.761 [95% CI: 0.703, 0.819]

RNFL prediction (DL): MAE: 7.39 um; Pearson’s r = 0.832.

Pros: the M2M model is able to provide a quantitative output and outperformed human graders.

Cons: used the presence of visual field defects as the gold standard may be biased and influence accuracy.

Wang et al. [142]

Glaucoma detection

Glaucoma, normal

OCT images

HK dataset: 975,400 B-scans (4877 volumns)

Stanford dataset: 246,200 B-scans

(1231 volumns)

2 glaucoma experts

2D-ResNet18-SEMT (SEmi-supervised Multi-Task)

Stanford (volumn based): AUC: 0.933, accuracy:86%; F1-score: 0.889;

human vs model (HK, volumn based): AUC: 0.977 (DL), 0.918 (human); accuracy: 0.927 (DL), 0.912 (human); F1 score: 0.941 (DL); 0.917 (human).

Pros: semi-supervised learning addressed the miss VF measurement label problem in the training set, the multi-task learning network to explore the relationship between the functional and structural, which was beneficial to accuracy improvement.

Cons: the framework at the training stage is not end-to-end, where the hard assignment for VF measurement and the multi-tasking training work in a cascaded manner.

Russakoff et al. [119]

Glaucoma diagnosis

Referable, nonreferable glaucoma

3D-OCT

2805 scans

Hongkong:505 eyes, India:336 eyes

 

gNet3D (3DCNN)

AUC: HK:0.78; India: 0.95

Pros: first studies to use machine learning in risk stratification; multinational external datasets across geographical and ethnicity distribution from Hong Kong and India.

Cons: lack of inclusion of suspects in external datasets; different definitions of glaucoma between development and external dataset.

Zaleska-Żmijewska et al. [121]

Glaucoma diagnosis

Glaucoma, healthy

Fundus, IOP

1687 images

Campaign1:752 images (C1)

campaign2: 352 images (C2)

 

AlexNet

images classifier: accuracy: C1:80%; C2:78%; sensitivity: C1:0.73, C2:0.84; specificity: C1:0.83; C2:0.67;

fundus + IOP: accuracy: C1:71%; C2:79%; sensitivity: C1:0.79, C2:0.92; specificity: C1:0.67; C2:0.42.

Pros: include IOP risk factor in model; IOP inclusion improved the sensitivity. Cons: small number of datasets, performance need to be improved.

Zheng et al. [130]

Glaucoma diagnosis

Glaucoma, normal

SD-OCT images (hand-craft features (HCFs), peripapillary RNFL OCT images)

1501 images

104 images

 

Inception-V3

AUC: 0.990 (0.974, 1.000), accuracy: 0. 990(0.974, 1.000); sensitivity: 0.981 (at 80%, 90% specificity).

Pros: achieved higher sensitivity and specificity compared to traditional HCFs.

Cons: external validation test is needed from different centers or OCT devices; most of the glaucoma cases were quite severe and this made classification easier in the current study; used images of Chinese eyes only, the results may not be

applicable to other populations.

Li et al. [131]

Glaucoma detection

Glaucoma, non-glaucoma

VF (deviation probability plots (PDPs), numerical

pattern deviation plots (NDPs), and numeric displays (NDs))

9022 VFs

phase1: test 1: 200 VFs, test 2:406 VFs, test 3: 507VFs

phase 2: 649 VFs

6 ophthalmologists

iGlaucoma (2D-Fusion-CNN (with input of ND + NDP + PDP))

In Phase I, the DLS outperformed all six ophthalmologists in the three test sets (AUC of 0.834–0.877, with a sensitivity of 0.831–0.922 and a specificity of 0.676–0.709);

In Phase II, iGlaucoma had 0.99 accuracy in recognizing different patterns in pattern deviation probability plots region, with corresponding AUC, sensitivity and specificity of 0.966 (0.953–0.979), 0.954 (0.930–0.977), and 0.873 (0.838–0.908).

Pros: developed ‘iGlaucoma’, a smartphone application-based deep learning system (DLS); the performance outperformed human expert.

Cons: this study is limited to the Chinese population; this DLS only utilizes VF, no other modality included.

Kim et al. [132]

Glaucoma diagnosis

Glaucoma, normal

OCT (RNFL thickness, RNFL deviation maps, GCIPL thickness, GCIPL deviation maps, ocular axial length)

8988 images

1420 images

2 glaucoma specialists

VGG-19

The glaucoma-diagnostic ability was highest when the DL system used the RNFL thickness map alone, among combination sets, use of the RNFL and GCIPL deviation map showed the highest diagnostic ability. It showed detection patterns similar to those of glaucoma specialists

External (RNFL + GCIPL deviation map): AUC: 0.985 (95% CI 0.966–0.995), sensitivity (at 90% specificity): 97.2%, sensitivity (at 80% specificity): 98.2%.

Pros: include interpretability and compared with human, with agreement pattern; multiple modalities and wide severity level included. Cons: external validation only include good quality OCT images and only from Asian ethnicity.

Christopher et al. [136]

Glaucoma diagnosis

Glaucoma, normal

Fundus

DIGS/ADAGES: 14,822 images; MRCH: 3132 images

Iinan Dataset: 215 images

Hiroshima Dataset: 171 images

ACRIMA: 705 images

 

UCSD (ResNet50)

UTokyo (ResNet34)

Utokyo (sequential) obtained highest performance in external datasets: AUC: linan:0.97 (0.94–0.99), Hiroshima: 0.99 (0.99–0.99), ACRIMA: 0.86 (0.83–0.89).

Pros: DIGS/ADAGES and MRCH consisted diverse population, the study computed model performance stratified by disease severity, myopia status, and race.

Cons: the datasets have inconsistent glaucoma definition and labeling strategy.

Maadi et al. [194]

OD/OC segmentation, glaucoma detection

GLAUCOMA, non-glaucoma

Fundus

Drishti-GS1: 101 images

RIMONE v3: 159 images

REFUGE: 1200 images

 

Modified U-Net (SE-ResNet50)

OD/OC segmentation: F1-score: OD:0.91; OC:0.79. Diagnosis: AUC: 0.939.

Pros: external validation was used and performed well.

Cons: larger training and validation datasets might be needed for further improvement.

Phene et al. [109]

Detect referable GON

Non-referable GON, referable GON

Fundus

88,126 images

A: 1205 images

B: 9642 images

C: 346 images

10 graders (grade 411 images in validation A dataset)

Inception-v3

The performance of DL was better than human on a subset (411 images) of validation dataset A

AUC: A:0.945 (0.929–0.960); B: 0.855 (0.841–0.870); C: 0.881 (0.838–0.918).

Pros: this DL algorithm has higher sensitivity and comparable specificity to eye care providers in detecting referable GON in color fundus images; provides insight into which ONH features drive GON assessment by glaucoma specialists.

Cons: diagnosis of glaucoma is not based on ONH appearance alone but also relies on the compilation of risk factors.

Gómez-Valverde et al. [261]

Glaucoma detection

Glaucoma, normal

Fundus

2313 images (RIM-ONE, DRISHTI-GS, ESPERANZA)

 

4 glaucoma experts

VGG19

human expert (ESPERANZA): specificity = 0.8914, sensitivity = 0.7662;

model (train on all 3 dataset): AUC: 0.94; sensitivity: 87.01%; specificity: 89.01%.

Pros: model performance was comparable to human expert.

Cons: no independent validation dataset provided.

Asaoka et al. [118]

Glaucoma diagnosis

Glaucoma, normal

Fundus

3132 images (camera: Kowa)

Test 1: 205 images (camera: Kowa)

test2: 171 images (camera: Topcon)

 

ResNet

Data augmentation improved model performance

With augmentation: AUC: test 1: 0.948 (0.903–0.968); test 2: 0.997 (0.994–1);

Without augmentation: AUC: test 1:0.877 (0.828–0.926); test 2: 0.945(0.913–0.976).

Pros: model was validated using images obtained from different fundus cameras and it had a high diagnostic ability irrespective of the type of fundus camera.

Cons: the model has not been validated in different ethnicities.

Kim et al. [113]

Glaucoma diagnosis and localization

Glaucoma, normal

Fundus

2123 images

RIM-ONE r3

(159 images)

 

ResNet152

accuracy: 93.5%; sensitivity: 92.9%, specificity: 92.9%.

Pros: a web application was developed with output pd Grad-CAM. Cons: larger datasets from different instrument may be incorporated to improve the performance.

Asaoka et al. [31]

Diagnose early glaucoma

Glaucoma, normal

Macula SD-OCT (RNFL and GCC thicknesses)

Pretrain: 4316 OCT images

train: 178 eyes (94 POAG, 84 normal)

196 eyes

(114 POAG, 82 normal)

 

DL transform (CNN using 8*8 grid RNFL (in the first channel) and GCC data (in the second channel), use both pretraining and training)

With pretraining: AUC: 93.7% (90.6–96.8). Optimum discrimination at sensitivity of 82.5%, specificity of 93.9%. Specificity (at 80% sensitivity): 83.3%; Specificity (at 90% sensitivity): 86.6%

Without pretraining: AUC: 76.6–78.8%

Pros: the study tested importance of pretraining process in model performance improvement

Cons: the results were obtained using a homogeneous patient population (only Japanese patients), further study might prepare OCT data from multiple ethnicities, to generalize the results

Al-Aswad et al. [289]

Glaucoma diagnosis

Glaucoma, non-glaucoma

Fundus

110 images

 

6 ophthalmologists

Pegasus

AUC: Pegasus: 92.6%, ophthalmologist: 69.6% -84.9%, the "best case" consensus scenario AUC of 89.1%. Sensitivity: Pegasus: 83.7%; ophthalmologists: 61.3%-81.6%. Specificity: Pegasus: 88.2%, ophthalmologists:80.0% -94.1%

Agreement with gold standard: Pegasus: 0.715; highest ophthalmologist: 0.613.

Pros: Pegasus outperformed 5 of the 6 ophthalmologists in terms of diagnostic performance, and there was no statistically significant difference between the deep learning system and the “best case” consensus between the ophthalmologists.

Cons: small sample size and fundus from same camera.

Norouzifard et al. [102]

Glaucoma diagnosis

Glaucoma, normal

Fundus

447 images

HRF (30 images)

 

Inception-ResNet-V2

accuracy: 80%

Pros: validated on independent dataset.

Cons: both training and validation datasets were small.

Shibata et al. [105]

Glaucoma screening

Glaucoma, normal

Fundus

3132 images (1364 glaucoma, 1768 normal)

110 images (60 glaucoma, 50 normal)

3 residents

ResNet

AUC: model:96.5% (93.5–99.6%); residents: 72.6–91.2%.

Pros: the performance was compared with human expert and the model outperformed them.

Cons: the current study excluded photographs with features that could interfere with an expert diagnosis of glaucoma, which was not “real world” setting.

Li et al. [256]

VF classification

Glaucoma, non-glaucoma

VF (VF PD images)

4012 VFs

300 VFs

9 ophthalmologists

CNN

Model: accuracy: 0.876 (95% CI 0.838–0.914); sensitivity: 0.932; specificity: 0.826

Ophthalmologists: the average accuracies are 0.607, 0.585 and 0.626 for resident ophthalmologists, attending ophthalmologists and glaucoma experts, respectively.

Pros: CNN has achieved higher accuracy compared to human ophthalmologists and traditional rules; Cons: used only pattern deviation images as the input of CNN, preperimetric glaucoma may not be effectively detected by machine.

Andersson et al. [75]

Glaucoma Diagnosis

Glaucoma, healthy

VF

(30–2 VF)

165 subjects

(99 glaucoma,66 healthy)

 

30 physicians

ANN

ANN: sensitivity: 93%, specificity: 91%

Physicians: sensitivity: 61%- 96% (mean 83%); specificity:59%-100% (mean 90%).

Pros: ANN performs at least as well as physicians in assessments of visual fields for the diagnosis of glaucoma; Cons: external and larger datasets are needed for further validation.

Goldbaum et al. [12]

Glaucoma diagnosis

Glaucoma, normal

VF (24–2 VF) + age

345 eyes (189 normal, 156 glaucoma)

 

2 human experts

Mixture of Gaussian (MoG)

MoG had significantly greater ROC area than PSD and CPSD. Human experts were not better at classifying visual fields than the machine classifiers or the global indices. MoG(PCA): AUC: 0.922; sensitivity: 0.67(at specificity = 1), 0.79 (at specificity = 0.9);

Expert 1: sensitivity = 0.75, specificity = 0.96;

Expert 2: sensitivity = 0.88, specificity = 0.59.

Pros: compared multiple methods based on different parameters.

Cons: larger and external validation are needed for further evaluation.

Goldbaum et al. [63]

Glaucoma diagnosis

Glaucoma, normal

VF

120 eyes (60 normal, 60 glaucoma)

 

2 glaucoma specialists

ANN (2 layered)

The experts and the network were in agreement about 74% of the time

ANN: accuracy: 67%; sensitivity: 65%; specificity: 71%; glaucoma specialists: accuracy: 67%; sensitivity: 59%; specificity: 74%.

Pros: compared with human expert. Cons: dataset was small, performance needs to be improved and external validation was needed.