From: Artificial intelligence in glaucoma: opportunities, challenges, and future directions
References | Task | Output label | Data modality | No. of sample for model development | No. of sample for independent validation | No. of human expert | Model used | Key results | Strengths/limitations |
---|---|---|---|---|---|---|---|---|---|
Nawaz et al. [166] | Glaucoma detection | Healthy, glaucoma | Fundus | ORIGA (650 images) | HRF (30 images), RIM ONE (485 images) | Bi-directional Feature Pyramid Network (BiFPN) EfficientDet-D0 (EfficientNet-B0 as backbone) | Train on ORIGA, test accuracy: HRF-98.21%; RIM ONE–97.96%; train on ROM-ONE, test accuracy: ORIGA–97.83%, HRF–98.19%. | Pros: high performance. Cons: validation dataset is small. | |
Gong et al. [249] | Glaucoma diagnosis | Normal, glaucoma | Fundus | 1000 images | 4 doctors | Hierarchical structure (HDLS) AI system + SVM | Doctors’ performance was improved with the assistance of AI. Overall doctors' performance without AI assistance (round1): sensitivity: 65%, specificity: 78%; accuracy: 71.5%; Overall doctors' performance with AI assistance (round 2): sensitivity: 91%, specificity: 88%; accuracy: 89.5%. | Pros: AI and human comparison. Cons: small number of tested samples (200 each round) and number of experts; self-learning effect exists. | |
Yugha et al. [239] | Glaucoma detection | Healthy, glaucoma | Fundus | ORIGA (650 images) | RIM-1-DL (485 images); HRF (45 images) | Bi-Directional Feature Pyramid system modules of EfficientDet-DO (EfficientNet-B0 backbone) | Accuracy: HRF: 97.89%; RIM 1: 97.64%. | Pros: high performance. Cons: validation dataset is small. | |
Ko et al. [287] | Glaucoma detection | Non-glaucoma, glaucoma | Fundus | TVGH (944 images) | CHGH (158 images); DRISHTI-GS1 (101 images); RIM-ONE r2 (455 images) | EfficientNet B3 | CHGH: AUC: 0.910 (0.798–1.000); accuracy: 80%; sensitivity: 65%; specificity: 95%; RIM-ONE r2: AUC: 0.624 (0.501–0.748); accuracy: 52.5%; sensitivity: 15%; specificity: 90%; DRISHTI-GS1: AUC: 0.770 (0.558–0.982); accuracy: 55%; sensitivity: 10%; specificity: 100%. | Pros: validated on multiple datasets. Cons: performance was not generalizable in RIM-ONE r2 and DRISHTI-GS1. | |
Xue et al. [228] | Glaucoma detection; severity classification | Normal, mild, moderate, severe | IOP, fundus, VF | 6131 samples | 240 samples | 8 juniors,3 seniors,3 experts | Multi- feature deep learning (MFDL) (DetectionNet, ClassificationNet) (ResNet backbone) | MFDL achieved a higher accuracy of 0.842 (95% CI, 0.795–0.888) than the direct four classification deep learning (DFC-DL, accuracy of 0.513 [0.449–0.576]), CFP-based single-feature deep learning (CFP-DL, accuracy of 0.483 [0.420–0.547]) and VF-based single-feature deep learning (VF-DL, accuracy of 0.725 [0.668–0.782]) Its performance outperformed 8 juniors,3 seniors and 1 expert and was comparable with 2 glaucoma experts | Pros: compared with human expert. Cons: validation dataset is small and not from multi-center. |
Wu et al. [47] | Glaucoma screening; subtyping; early diagnosis | Screening (glaucoma, healthy); subtyping (POAG, PAAG); early POAG | Tear metabolic fingerprinting (TMF) | 266 samples | 54 samples | Ridge regression (RR) | Identified metabolic biomarker (Lac, Thr, Mer, Sul, Bar, or DPAE) for glaucoma characterization Glaucoma Screening AUC: 0.856 (95% CI: 0.757–0.954). | Pros: biomarkers were identified; simple model with good performance. Cons: mass spectrometer is essential for the data; larger dataset is needed for further validation. | |
Singh et al. [173] | Glaucoma diagnosis | Normal, glaucoma | Fundus | ACRIMA(705 mages), ORIGA (650 images), HRF (30 images) | DRISHTI-GS (101 images); PRIVATE (33 images) | InceptionResNet-V2 | AUC: 0.9042; accuracy: 90%; sensitivity: 86.748%; specificity: 94.11%; F1-score: 91.13%. | Pros: compared multiple models. Cons: small validation and training dataset. | |
Noury et al. [175] | Glaucoma diagnosis; severity classification | Normal, glaucoma; mild, moderate severe | SD-OCT ONH scans | 2461 OCT scans volumes | Hong Kong (HK): 1625 scans; India: 672 scans; Nepal: 380 scans | 1 | DiagFind: 3D CNN | AUC: HK: 0.80 (95% CI, 0.78–0.82), India:0.94 (95% CI, 0.93–0.96), Nepal: 0.87 (95% CI, 0.85–0.90); sensitivity: HK: 0.73 (0.67–0.79); India: 0.93 (0.88–0.99); Nepa: 0.79 (0.68–0.90); specificity: HK: 0.73 (0.61–0.85); India: 0.71 (0.51–0.91); Nepa: 0.79 (0.66–0.92); F1-score: HK: 0.76 (0.75–0.77); India: 0.91 (0.90–0.92); Nepal: 0.80 (0.78–0.83) testing set from Stanford (100 cases): AUC: 0.92 (95% CI, 0.90–0.93), human grader: 0.91 | Pros: validated result on rea-world datasets from multiple sites. Cons: exclude cases without consensus labeling and difficulty to be diagnosed by skilled clinicians. |
Fan et al. [176] | Glaucoma diagnosis | Healthy, glaucoma | Fundus | OHTS: 66,715 images | ACRIMA (705 images); LAG (4854 images); DIGS (9473 images) | ResNet-50 | AUC: DIGS: 0.74 (0.69–0.79); ACRIMA: 0.74 (0.70–0.77); LAG: 0.79 (0.78–0.81); Sensitivity (at 85% specificity): DIGS: 0.52; ACRIMA: 0.46; LAG: 0.59; Sensitivity (at 95% specificity): DIGS: 0.30; ACRIMA: 0.29; LAG: 0.42 | Pros: validated results on multiple external datasets. Cons: the model was not generalizable in external datasets. | |
Fan et al. [222] | Glaucoma diagnosis | Healthy, glaucoma | Fundus | OHTS: 66,715 images | DIGS (10,473 images), ACRIMA (705 images), LAG (4854 images), RIM-ONE (455 images), ORIGA (650 images) | Data-efficient image Transformer (DeiT) | AUC: OHTS: 0.91 (0.87, 0.93), DIGS:0.77 (0.71, 0.82); ACRIMA: 0.740.74 (0.70, 0.77); LAG: 0.88 (0.87, 0.89); RIM-ONE:0.91 (0.88, 0.94); ORIGA: 0.73 (0.68, 0.77); sensitivity (at 85% specificity): OHTS:0.79; DIGS: 0.57; ACRIMA: 0.46; LAG: 0.77; RIM-ONE: 0.83; ORIGA: 0.40; sensitivity (at 95% specificity): OHTS:0.56; DIGS: 0.34; ACRIMA: 0.31; LAG: 0.59; RIM-ONE: 0.73; ORIGA: 0.21. | Pros: validated in multiple external datasets; Vision Transformers have the potential to improve generalizability. Cons: cropped images may lose information. | |
Huang et al. [231] | Glaucoma diagnosis | Normal, glaucoma | Fundus, VF | 1655 samples | 196 samples | Probabilistic deep learning model (EffientientNetB4 backbone) | AUC: 0.98 (0.98–0.99); accuracy: 93% (92–95%); sensitivity: 91% (87–95%); specificity: 95% (94–96%). | Pros: quantifying the uncertainty of the model Cons: dataset was from COMPASS instrument; more external validation is needed. | |
Huang et al. [146] | Glaucoma VF grading | Clear, mild, moderate, severe, diffuse | VF (HFA and Octopus) | 3805 VFs (Octopus); 13,231 VFs (HFA) | 150 VFs (HFA) | 2 ophthalmic clinicians, 6 medical students | Fine-grained grading deep learning system (FGGDL: FGG-O, FGG-H); Interactive Interface | AI outperformed human experts and their performance was improved with the assistance of AI AUC: FGGDL: 0.893 (0.862–0.923); clinician 1: 0.838 (0.801–0.874); clinician 2: 0.833 (0.796–0.869); all the other medical students' performance was lower than 0.80. | Pros: external validation is applied and compared with human experts. Cons: high test–retest variability was not considered. |
Li et al. [182] | OD/OC segmentation, glaucoma screening | Glaucoma, non-glaucoma | Fundus | In-house (2440 images) | DRISHIT-GS (101 images); RIM-ONE v3 (159 images) | 4 ophthalmologists | R-DCNN (DAC-ResNet34) | The segmentation results of in-house testing set were comparable to that of human experts', OD: DC-98.51%, JC-97.07%; OC: DC-97.63%, JC-95.39% Segmentation results of DRISHTI-GS and RIM-ONE v3 are better than existing studies: DRISHTI-GS: OD: DC-97.23%, JC-94.17%; OD: DC-94.56%, JC-89.92%. RIM-ONE v3: OD: DC-96.89%, JC-91.32%; OC: DC-88.94%, JC-78.21%. glaucoma screening: AUC: DRISHTI-GS:0.968; RIM-ONE v3:0.941. | Pros: compared with human expert and existing studies. Cons: small sample size of training and validation dataset. |
Li et al. [164] | Glaucoma diagnosis, glaucoma incidence/progression prediction | Diagnosis (glaucoma, non-glaucoma); glaucoma incidence prediction: with/without glaucoma development; glaucoma progression prediction: with/without progression | Fundus | Diagnosis: 24,054 eyes; predict glaucoma incidence: 11,548 eyes; predict glaucoma progression: 3425 eyes | Glaucoma diagnosis: external test 1: 6162 images (eyes), external test 2: 824 images (eyes); glaucoma incidence: external test 1: 955 images, external test 2: 719 images; glaucoma progression: external test 1: 337 images, external test 2: 513 images | DiagnoseNet, PredictNet | Glaucoma diagnosis: AUC: test 1: 0.94 (0.93–0.94); test 2–0.91 (0.89–0.93), sensitivity: glaucoma diagnosis: test 1:0.89(0.87–0.90); test 2: 0.92 (0.88–0.96), specificity: test 1: 0.83 (0.81–0.84); test 2: 0.71 (0.67–0.74); Predict glaucoma incidence: AUC: test 1: 0.89 (0.83–0.95); test 2: 0.88 (0.79–0.97); sensitivity: test 1: 0.84 (0.81–0.86); test 2: 0.84 (0.81–0.86); specificity: test 1: 0.68 (0.43–0.87); test 2: 0.80 (0.44–0.97); Predict glaucoma progression: AUC: test 1: 0.87 (0.81–0.92) and test 2: 0.88 (0.83–0.94); sensitivity: test 1: 0.82 (0.78–0.87) and test 2: 0.81 (0.77–0.84); specificity: test 1: 0.59 (0.39–0.76) and test 2: 0.74 (0.55–0.88). | Pros: performed multiple tasks (diagnosis, incidence and progression prediction) and include multiple external validation. Cons: exclude low quality images, validation dataset were only Chinese datasets. | |
Mehta et al. [149] | Glaucoma detection | Healthy, glaucoma, PTG (progress to glaucoma) | Demographic, systemic and ocular data, color fundus, OCT | UK Biobank (2574 eyes, glaucoma-1193 eyes, healthy-1283 eyes, PTG-98 eyes) | 200 eyes | 5 glaucoma-fellowship trained ophthalmologists | InceptionResnetV4 | Best model with input of OCT, color fundus, systemic and ocular data obtained AUC = 0.967 (95% CI 0.93–1.0); Human expert (only made diagnosis on color fundus): AUC: 0.79–0.84. | Pros: used multiple modalities, several methods for interpret DL model (SHAP, saliency map); Cons: poor quality of the fundus images may contribute low AUC. |
Hemelings et al. [151] | Glaucoma detection, VCDR regression | Glaucoma, non-glaucoma | Fundus | UZL (13,551 images) | REFUGE (1200 images) | ResNet50 | AUC: original fundus: 0.87(95% CI 0.83–0.91); 60% ONH cropping: 0.80 (95% CI 0.76–0.84). | Pros: report explainability analysis by cropping ONH area; Cons: applying masks of fixed size might lead to a small variation in visible features across fundus images due to variation in ONH size across the study population. | |
Thakoor et al. [279] | Glaucoma detection | Glaucoma, non-glaucoma | OCT images (RNFL probability maps) | 737 eyes | 135 eyes | 2 expert OCT readers | InceptionV3 + FC (with concept activation vectors (TCAVs)) | The TCAVs scores were consistent with features used by human experts based on eye fixations AUC:0.911. | Pros: applied test with concept activation vectors (TCAVs) for model interpretability; Cons: multi-center datasets were not in external validation. |
Thakoor et al. [281] | Glaucoma detection | Glaucoma, non-glaucoma | OCT B-Scans, RNFL probability maps | RNFL maps: 737 eyes; B-scans: 771 eyes | RNFL maps:135 eyes; B-scans: 125 eyes | CNN A (ResNet18 + RF) with RNFL-map as input | CNN generalizability can be improved with data augmentation, multiple input image modalities, and training on images with confident ratings. choosing a thorough and consistent RS for training and testing improves generalization to new datasets Best result was with RNFL map input and data augmentation on CNN A: AUC = 0.918 (95% CI 0.866–0.970), accuracy = 85.9%; | Pros: improved generalizability by several techniques (multi-modalities, consistent labels, data augmentation); Cons: independent validation can be improved by a larger dataset. | |
Natarajan et al. [143] | Glaucoma detection, OD segmentation | Glaucoma, normal | Fundus | RIGA (750 images), RIM-ONEv2 (455 images) | ACRIMA (705 images), Drishti-GS1 (101 images), RIMONEv1 (169 images) | UNet-Snet (SqueezeNet) | glaucoma detection: AUC: ACRIMA: 100%; Drishti-GS1: 99.90%, RIMONEv1: 100%; accuracy: ACRIMA: 99.86%; Drishti-GS1: 97.05%, RIMONEv1: 100%; sensitivity: ACRIMA: 100%; Drishti-GS1: 100%, RIMONEv1: 100%; specificity: ACRIMA: 99.75%; Drishti-GS1: 90.32%, RIMONEv1: 100%. | Pros: achieved high performance; Cons: small dataset for training and independent validation. | |
Kenichi et al. [129] | Glaucoma diagnosis | Glaucoma, normal | Fundus | 3132 images (from ordinary camera) | 162 images (from ordinary camera and smartphone) | ResNet34 | AUC: camera-98.9%; smartphone-84.2% advanced glaucoma: AUC: camera-99.3%; smartphone-90.0%. | Pros: validated performance between fundus from ordinary camera and smartphone; Cons: training images were all from ordinary camera, poor image quality of fundus from smartphone. | |
Xu et al. [242] | Glaucoma diagnosis | Referable glaucomatous optic neuropathy (GON), unlikely GON | Fundus | 1791 images | dataset1: 6301 images dataset2: 1964 images dataset3: 400 images | 12 (4 senior ophthalmologists, 4 junior ophthalmologists, and 4 technicians) | Hierarchical deep learning system (HDLS)(segmentation- classification, Inception-v3 backbone) | Reliable region has higher sensitivity and specificity than suspicious region datset1: AUC = 0.981 (95% CI, 0.978–0.985), reliable region: sensitivity = 97.7% (95% CI, 97.0–98.3%), specificity = 97.8% (95% CI, 97.2–98.4%); dataset2: AUC = 0.983 (95% CI, 0.977–0.989); reliable region: sensitivity: 98.4% (95% CI, 97.3–99.5%), specificity = 98.2% (95% CI, 97.4–99.1%) Dataset 3: performance of human experts were improved with the referring of HDLS: senior group: sensitivity: 0.93 (diagnose independently), 0.96 (referring to HDLS); specificity: 0.88 (diagnose independently), 0.95 (referring to HDLS). | Pros: this system is transparent and interpretable. the results were validated on three large validation datasets and results were comparable to that of human experts. Cons: validation dataset did not include data from other ethnics than Chinese. |
Bhuiyan et al. [246] | Glaucoma diagnosis | Glaucoma-suspect, not-suspect | Fundus | 1546 disc-centered fundus images (AREDS, SiMES, RIM-ONE) | ORIGA (638 gradable images) | Ensemble of Xception, Inception-Resnet-V2, NasNet, Inception-V3 | AUC: 0.85; accuracy: 83.54%; sensitivity: 80.11%; specificity: 84.96%. | Pros: screening glaucoma suspect is important; Cons: CDR is not the only biomarker for glaucoma suspect, other biomarkers, such as CDR asymmetry can be added in future studies. | |
Tang et al. [205] | Glaucoma diagnosis | Glaucoma, non-glaucoma | Fundus | Sanyuan (11,443 images), Tongren (7806 images), Xiehe (4363 images) | REFUGE (1200 images) | AMNet (semi-supervised learning) | accuracy: 95.75%; sensitivity: 87.5%; specificity: 96.7%; F1-score: 91.9%. | Pros: the model boosted the robustness of model with limited labeled data. Cons: diverse validation dataset might be needed to future validate the model. | |
Alghamdi et al. [241] | Glaucoma diagnosis | Glaucoma, normal | Fundus | RIM-ONE (455 images) RIGA (750 images) | 2 ophthalmologists | TCNN-Transfer Convolutional Neural Network (VGG16); SSCNN- Semi-supervised Convolutional Neural Network model with self-learning; SSCNN-DAE-Semi-supervised Convolutional Neural Network model with autoencoder | three deep learning CNN models outperform the performances of both ophthalmologists with clear margins. RIM-ONE: accuracy: two experts attain 59.2% and 55.4.0% SSCNN-DAE: AUC: 0.95, accuracy: 93.8%; sensitivity: 98.90%; specificity: 90.50%. | Pros: compared with human experts and outperformed them. Cons: small datasets were used for training; external validation was not included. | |
Li et al. [278] | Glaucoma diagnosis in myopia | Glaucoma, non-glaucoma | RNFL profile | 2223 eyes | 508 eyes | FCN + RBFN (radial basis function network) + RNFL compensation | By applying the RNFL compensation algorithm, the AUC for detecting glaucoma increased from 0.70 to 0.84, from 0.75 to 0.89, from 0.77 to 0.89, and from 0.78 to 0.87 for eyes in the highest 10%, 20%, 30% and any axial length (AL), respectively. | Pros: An RBFN has good generalization, strong tolerance to input noise, and online learning ability made it possible to interpret the patterns to a reliable compensation. Cons: compensation of the RNFL profile was based on data from participants with an age ≥ 50 years, this was not validated in younger participants; the validation dataset included a 1:1 ratio of glaucomatous to non-glaucomatous eyes, this is not the case in read world data. | |
Chiang et al. [277] | Primary angle closure glaucoma (PACG) detection | POAG, PACG | Goniophotography | 32,635 images | 1000 images | 9 graders | CNN (ResNet-50 backbone) | CNN achieved excellent performance based on single-grader (AUC = 0.969) and consensus (AUC = 0.952) labels. The agreement between the CNN classifier and consensus labels (κ = 0.746) surpassed that of all non-reference human graders (κ = 0.578–0.702). | Pros: the model obtained comparable performance with human graders. Cons: the model has limitation of generalizable to other ethnic groups and goniophotography taken from other devices. |
Zhao et al. [196] | Cup-to-disc ratio (CDR) estimation, glaucoma screening | Glaucoma, normal | Fundus | Direct-CSU (934 images) | ORIGA (650 images) | Unsupervised feature representation of fundus image with a CNN (MFPPNet-3 blocks DenseNet) + RF | glaucoma screening: AUC: 0.88 CDR estimation: MAE: 0.0606; coefficient of correlation r = 0.68. | Pros: estimated the CDR value more effectively than results obtained from traditional segmentation-based methods; potential to handle unlabeled data. Cons: further validation on diverse data sources would be needed. | |
Jammal et al. [112] | Fundus predict RNFL; glaucoma detection | Glaucoma, normal | Fundus | 32,820 pairs of fundus photos and SD-OCT scans | 2 graders | M2M DL (ResNet34 backbone) | DL algorithm outperformed human graders in detecting signs of glaucomatous damage on fundus photographs. glaucoma detection: AUC: DL: 0.801 [95% CI: 0.757, 0.845]; human: 0.775 [95% CI: 0.728, 0.823]; AUPRC: DL: 0.810 [95% CI: 0.765, 0.851]; human: 0.761 [95% CI: 0.703, 0.819] RNFL prediction (DL): MAE: 7.39 um; Pearson’s r = 0.832. | Pros: the M2M model is able to provide a quantitative output and outperformed human graders. Cons: used the presence of visual field defects as the gold standard may be biased and influence accuracy. | |
Wang et al. [142] | Glaucoma detection | Glaucoma, normal | OCT images | HK dataset: 975,400 B-scans (4877 volumns) | Stanford dataset: 246,200 B-scans (1231 volumns) | 2 glaucoma experts | 2D-ResNet18-SEMT (SEmi-supervised Multi-Task) | Stanford (volumn based): AUC: 0.933, accuracy:86%; F1-score: 0.889; human vs model (HK, volumn based): AUC: 0.977 (DL), 0.918 (human); accuracy: 0.927 (DL), 0.912 (human); F1 score: 0.941 (DL); 0.917 (human). | Pros: semi-supervised learning addressed the miss VF measurement label problem in the training set, the multi-task learning network to explore the relationship between the functional and structural, which was beneficial to accuracy improvement. Cons: the framework at the training stage is not end-to-end, where the hard assignment for VF measurement and the multi-tasking training work in a cascaded manner. |
Russakoff et al. [119] | Glaucoma diagnosis | Referable, nonreferable glaucoma | 3D-OCT | 2805 scans | Hongkong:505 eyes, India:336 eyes | gNet3D (3DCNN) | AUC: HK:0.78; India: 0.95 | Pros: first studies to use machine learning in risk stratification; multinational external datasets across geographical and ethnicity distribution from Hong Kong and India. Cons: lack of inclusion of suspects in external datasets; different definitions of glaucoma between development and external dataset. | |
Zaleska-Żmijewska et al. [121] | Glaucoma diagnosis | Glaucoma, healthy | Fundus, IOP | 1687 images | Campaign1:752 images (C1) campaign2: 352 images (C2) | AlexNet | images classifier: accuracy: C1:80%; C2:78%; sensitivity: C1:0.73, C2:0.84; specificity: C1:0.83; C2:0.67; fundus + IOP: accuracy: C1:71%; C2:79%; sensitivity: C1:0.79, C2:0.92; specificity: C1:0.67; C2:0.42. | Pros: include IOP risk factor in model; IOP inclusion improved the sensitivity. Cons: small number of datasets, performance need to be improved. | |
Zheng et al. [130] | Glaucoma diagnosis | Glaucoma, normal | SD-OCT images (hand-craft features (HCFs), peripapillary RNFL OCT images) | 1501 images | 104 images | Inception-V3 | AUC: 0.990 (0.974, 1.000), accuracy: 0. 990(0.974, 1.000); sensitivity: 0.981 (at 80%, 90% specificity). | Pros: achieved higher sensitivity and specificity compared to traditional HCFs. Cons: external validation test is needed from different centers or OCT devices; most of the glaucoma cases were quite severe and this made classification easier in the current study; used images of Chinese eyes only, the results may not be applicable to other populations. | |
Li et al. [131] | Glaucoma detection | Glaucoma, non-glaucoma | VF (deviation probability plots (PDPs), numerical pattern deviation plots (NDPs), and numeric displays (NDs)) | 9022 VFs | phase1: test 1: 200 VFs, test 2:406 VFs, test 3: 507VFs phase 2: 649 VFs | 6 ophthalmologists | iGlaucoma (2D-Fusion-CNN (with input of ND + NDP + PDP)) | In Phase I, the DLS outperformed all six ophthalmologists in the three test sets (AUC of 0.834–0.877, with a sensitivity of 0.831–0.922 and a specificity of 0.676–0.709); In Phase II, iGlaucoma had 0.99 accuracy in recognizing different patterns in pattern deviation probability plots region, with corresponding AUC, sensitivity and specificity of 0.966 (0.953–0.979), 0.954 (0.930–0.977), and 0.873 (0.838–0.908). | Pros: developed ‘iGlaucoma’, a smartphone application-based deep learning system (DLS); the performance outperformed human expert. Cons: this study is limited to the Chinese population; this DLS only utilizes VF, no other modality included. |
Kim et al. [132] | Glaucoma diagnosis | Glaucoma, normal | OCT (RNFL thickness, RNFL deviation maps, GCIPL thickness, GCIPL deviation maps, ocular axial length) | 8988 images | 1420 images | 2 glaucoma specialists | VGG-19 | The glaucoma-diagnostic ability was highest when the DL system used the RNFL thickness map alone, among combination sets, use of the RNFL and GCIPL deviation map showed the highest diagnostic ability. It showed detection patterns similar to those of glaucoma specialists External (RNFL + GCIPL deviation map): AUC: 0.985 (95% CI 0.966–0.995), sensitivity (at 90% specificity): 97.2%, sensitivity (at 80% specificity): 98.2%. | Pros: include interpretability and compared with human, with agreement pattern; multiple modalities and wide severity level included. Cons: external validation only include good quality OCT images and only from Asian ethnicity. |
Christopher et al. [136] | Glaucoma diagnosis | Glaucoma, normal | Fundus | DIGS/ADAGES: 14,822 images; MRCH: 3132 images | Iinan Dataset: 215 images Hiroshima Dataset: 171 images ACRIMA: 705 images | UCSD (ResNet50) UTokyo (ResNet34) | Utokyo (sequential) obtained highest performance in external datasets: AUC: linan:0.97 (0.94–0.99), Hiroshima: 0.99 (0.99–0.99), ACRIMA: 0.86 (0.83–0.89). | Pros: DIGS/ADAGES and MRCH consisted diverse population, the study computed model performance stratified by disease severity, myopia status, and race. Cons: the datasets have inconsistent glaucoma definition and labeling strategy. | |
Maadi et al. [194] | OD/OC segmentation, glaucoma detection | GLAUCOMA, non-glaucoma | Fundus | Drishti-GS1: 101 images RIMONE v3: 159 images | REFUGE: 1200 images | Modified U-Net (SE-ResNet50) | OD/OC segmentation: F1-score: OD:0.91; OC:0.79. Diagnosis: AUC: 0.939. | Pros: external validation was used and performed well. Cons: larger training and validation datasets might be needed for further improvement. | |
Phene et al. [109] | Detect referable GON | Non-referable GON, referable GON | Fundus | 88,126 images | A: 1205 images B: 9642 images C: 346 images | 10 graders (grade 411 images in validation A dataset) | Inception-v3 | The performance of DL was better than human on a subset (411 images) of validation dataset A AUC: A:0.945 (0.929–0.960); B: 0.855 (0.841–0.870); C: 0.881 (0.838–0.918). | Pros: this DL algorithm has higher sensitivity and comparable specificity to eye care providers in detecting referable GON in color fundus images; provides insight into which ONH features drive GON assessment by glaucoma specialists. Cons: diagnosis of glaucoma is not based on ONH appearance alone but also relies on the compilation of risk factors. |
Gómez-Valverde et al. [261] | Glaucoma detection | Glaucoma, normal | Fundus | 2313 images (RIM-ONE, DRISHTI-GS, ESPERANZA) | 4 glaucoma experts | VGG19 | human expert (ESPERANZA): specificity = 0.8914, sensitivity = 0.7662; model (train on all 3 dataset): AUC: 0.94; sensitivity: 87.01%; specificity: 89.01%. | Pros: model performance was comparable to human expert. Cons: no independent validation dataset provided. | |
Asaoka et al. [118] | Glaucoma diagnosis | Glaucoma, normal | Fundus | 3132 images (camera: Kowa) | Test 1: 205 images (camera: Kowa) test2: 171 images (camera: Topcon) | ResNet | Data augmentation improved model performance With augmentation: AUC: test 1: 0.948 (0.903–0.968); test 2: 0.997 (0.994–1); Without augmentation: AUC: test 1:0.877 (0.828–0.926); test 2: 0.945(0.913–0.976). | Pros: model was validated using images obtained from different fundus cameras and it had a high diagnostic ability irrespective of the type of fundus camera. Cons: the model has not been validated in different ethnicities. | |
Kim et al. [113] | Glaucoma diagnosis and localization | Glaucoma, normal | Fundus | 2123 images | RIM-ONE r3 (159 images) | ResNet152 | accuracy: 93.5%; sensitivity: 92.9%, specificity: 92.9%. | Pros: a web application was developed with output pd Grad-CAM. Cons: larger datasets from different instrument may be incorporated to improve the performance. | |
Asaoka et al. [31] | Diagnose early glaucoma | Glaucoma, normal | Macula SD-OCT (RNFL and GCC thicknesses) | Pretrain: 4316 OCT images train: 178 eyes (94 POAG, 84 normal) | 196 eyes (114 POAG, 82 normal) | DL transform (CNN using 8*8 grid RNFL (in the first channel) and GCC data (in the second channel), use both pretraining and training) | With pretraining: AUC: 93.7% (90.6–96.8). Optimum discrimination at sensitivity of 82.5%, specificity of 93.9%. Specificity (at 80% sensitivity): 83.3%; Specificity (at 90% sensitivity): 86.6% Without pretraining: AUC: 76.6–78.8% | Pros: the study tested importance of pretraining process in model performance improvement Cons: the results were obtained using a homogeneous patient population (only Japanese patients), further study might prepare OCT data from multiple ethnicities, to generalize the results | |
Al-Aswad et al. [289] | Glaucoma diagnosis | Glaucoma, non-glaucoma | Fundus | 110 images | 6 ophthalmologists | Pegasus | AUC: Pegasus: 92.6%, ophthalmologist: 69.6% -84.9%, the "best case" consensus scenario AUC of 89.1%. Sensitivity: Pegasus: 83.7%; ophthalmologists: 61.3%-81.6%. Specificity: Pegasus: 88.2%, ophthalmologists:80.0% -94.1% Agreement with gold standard: Pegasus: 0.715; highest ophthalmologist: 0.613. | Pros: Pegasus outperformed 5 of the 6 ophthalmologists in terms of diagnostic performance, and there was no statistically significant difference between the deep learning system and the “best case” consensus between the ophthalmologists. Cons: small sample size and fundus from same camera. | |
Norouzifard et al. [102] | Glaucoma diagnosis | Glaucoma, normal | Fundus | 447 images | HRF (30 images) | Inception-ResNet-V2 | accuracy: 80% | Pros: validated on independent dataset. Cons: both training and validation datasets were small. | |
Shibata et al. [105] | Glaucoma screening | Glaucoma, normal | Fundus | 3132 images (1364 glaucoma, 1768 normal) | 110 images (60 glaucoma, 50 normal) | 3 residents | ResNet | AUC: model:96.5% (93.5–99.6%); residents: 72.6–91.2%. | Pros: the performance was compared with human expert and the model outperformed them. Cons: the current study excluded photographs with features that could interfere with an expert diagnosis of glaucoma, which was not “real world” setting. |
Li et al. [256] | VF classification | Glaucoma, non-glaucoma | VF (VF PD images) | 4012 VFs | 300 VFs | 9 ophthalmologists | CNN | Model: accuracy: 0.876 (95% CI 0.838–0.914); sensitivity: 0.932; specificity: 0.826 Ophthalmologists: the average accuracies are 0.607, 0.585 and 0.626 for resident ophthalmologists, attending ophthalmologists and glaucoma experts, respectively. | Pros: CNN has achieved higher accuracy compared to human ophthalmologists and traditional rules; Cons: used only pattern deviation images as the input of CNN, preperimetric glaucoma may not be effectively detected by machine. |
Andersson et al. [75] | Glaucoma Diagnosis | Glaucoma, healthy | VF (30–2 VF) | 165 subjects (99 glaucoma,66 healthy) | 30 physicians | ANN | ANN: sensitivity: 93%, specificity: 91% Physicians: sensitivity: 61%- 96% (mean 83%); specificity:59%-100% (mean 90%). | Pros: ANN performs at least as well as physicians in assessments of visual fields for the diagnosis of glaucoma; Cons: external and larger datasets are needed for further validation. | |
Goldbaum et al. [12] | Glaucoma diagnosis | Glaucoma, normal | VF (24–2 VF) + age | 345 eyes (189 normal, 156 glaucoma) | 2 human experts | Mixture of Gaussian (MoG) | MoG had significantly greater ROC area than PSD and CPSD. Human experts were not better at classifying visual fields than the machine classifiers or the global indices. MoG(PCA): AUC: 0.922; sensitivity: 0.67(at specificity = 1), 0.79 (at specificity = 0.9); Expert 1: sensitivity = 0.75, specificity = 0.96; Expert 2: sensitivity = 0.88, specificity = 0.59. | Pros: compared multiple methods based on different parameters. Cons: larger and external validation are needed for further evaluation. | |
Goldbaum et al. [63] | Glaucoma diagnosis | Glaucoma, normal | VF | 120 eyes (60 normal, 60 glaucoma) | 2 glaucoma specialists | ANN (2 layered) | The experts and the network were in agreement about 74% of the time ANN: accuracy: 67%; sensitivity: 65%; specificity: 71%; glaucoma specialists: accuracy: 67%; sensitivity: 59%; specificity: 74%. | Pros: compared with human expert. Cons: dataset was small, performance needs to be improved and external validation was needed. |