Artificial intelligence in glaucoma: opportunities, challenges, and future directions

Huang, Xiaoqin; Islam, Md Rafiqul; Akter, Shanjita; Ahmed, Fuad; Kazami, Ehsan; Serhan, Hashem Abu; Abd-alrazaq, Alaa; Yousefi, Siamak

doi:10.1186/s12938-023-01187-8

BioMedical Engineering OnLine

Table 1 Summary of studies that have used independent validation or compared the model against human experts

From: Artificial intelligence in glaucoma: opportunities, challenges, and future directions

References	Task	Output label	Data modality	No. of sample for model development	No. of sample for independent validation	No. of human expert	Model used	Key results	Strengths/limitations
Nawaz et al. [166]	Glaucoma detection	Healthy, glaucoma	Fundus	ORIGA (650 images)	HRF (30 images), RIM ONE (485 images)		Bi-directional Feature Pyramid Network (BiFPN) EfficientDet-D0 (EfficientNet-B0 as backbone)	Train on ORIGA, test accuracy: HRF-98.21%; RIM ONE–97.96%; train on ROM-ONE, test accuracy: ORIGA–97.83%, HRF–98.19%.	Pros: high performance. Cons: validation dataset is small.
Gong et al. [249]	Glaucoma diagnosis	Normal, glaucoma	Fundus		1000 images	4 doctors	Hierarchical structure (HDLS) AI system + SVM	Doctors’ performance was improved with the assistance of AI. Overall doctors' performance without AI assistance (round1): sensitivity: 65%, specificity: 78%; accuracy: 71.5%; Overall doctors' performance with AI assistance (round 2): sensitivity: 91%, specificity: 88%; accuracy: 89.5%.	Pros: AI and human comparison. Cons: small number of tested samples (200 each round) and number of experts; self-learning effect exists.
Yugha et al. [239]	Glaucoma detection	Healthy, glaucoma	Fundus	ORIGA (650 images)	RIM-1-DL (485 images); HRF (45 images)		Bi-Directional Feature Pyramid system modules of EfficientDet-DO (EfficientNet-B0 backbone)	Accuracy: HRF: 97.89%; RIM 1: 97.64%.	Pros: high performance. Cons: validation dataset is small.
Ko et al. [287]	Glaucoma detection	Non-glaucoma, glaucoma	Fundus	TVGH (944 images)	CHGH (158 images); DRISHTI-GS1 (101 images); RIM-ONE r2 (455 images)		EfficientNet B3	CHGH: AUC: 0.910 (0.798–1.000); accuracy: 80%; sensitivity: 65%; specificity: 95%; RIM-ONE r2: AUC: 0.624 (0.501–0.748); accuracy: 52.5%; sensitivity: 15%; specificity: 90%; DRISHTI-GS1: AUC: 0.770 (0.558–0.982); accuracy: 55%; sensitivity: 10%; specificity: 100%.	Pros: validated on multiple datasets. Cons: performance was not generalizable in RIM-ONE r2 and DRISHTI-GS1.
Xue et al. [228]	Glaucoma detection; severity classification	Normal, mild, moderate, severe	IOP, fundus, VF	6131 samples	240 samples	8 juniors,3 seniors,3 experts	Multi- feature deep learning (MFDL) (DetectionNet, ClassificationNet) (ResNet backbone)	MFDL achieved a higher accuracy of 0.842 (95% CI, 0.795–0.888) than the direct four classification deep learning (DFC-DL, accuracy of 0.513 [0.449–0.576]), CFP-based single-feature deep learning (CFP-DL, accuracy of 0.483 [0.420–0.547]) and VF-based single-feature deep learning (VF-DL, accuracy of 0.725 [0.668–0.782]) Its performance outperformed 8 juniors,3 seniors and 1 expert and was comparable with 2 glaucoma experts	Pros: compared with human expert. Cons: validation dataset is small and not from multi-center.
Wu et al. [47]	Glaucoma screening; subtyping; early diagnosis	Screening (glaucoma, healthy); subtyping (POAG, PAAG); early POAG	Tear metabolic fingerprinting (TMF)	266 samples	54 samples		Ridge regression (RR)	Identified metabolic biomarker (Lac, Thr, Mer, Sul, Bar, or DPAE) for glaucoma characterization Glaucoma Screening AUC: 0.856 (95% CI: 0.757–0.954).	Pros: biomarkers were identified; simple model with good performance. Cons: mass spectrometer is essential for the data; larger dataset is needed for further validation.
Singh et al. [173]	Glaucoma diagnosis	Normal, glaucoma	Fundus	ACRIMA(705 mages), ORIGA (650 images), HRF (30 images)	DRISHTI-GS (101 images); PRIVATE (33 images)		InceptionResNet-V2	AUC: 0.9042; accuracy: 90%; sensitivity: 86.748%; specificity: 94.11%; F1-score: 91.13%.	Pros: compared multiple models. Cons: small validation and training dataset.
Noury et al. [175]	Glaucoma diagnosis; severity classification	Normal, glaucoma; mild, moderate severe	SD-OCT ONH scans	2461 OCT scans volumes	Hong Kong (HK): 1625 scans; India: 672 scans; Nepal: 380 scans	1	DiagFind: 3D CNN	AUC: HK: 0.80 (95% CI, 0.78–0.82), India:0.94 (95% CI, 0.93–0.96), Nepal: 0.87 (95% CI, 0.85–0.90); sensitivity: HK: 0.73 (0.67–0.79); India: 0.93 (0.88–0.99); Nepa: 0.79 (0.68–0.90); specificity: HK: 0.73 (0.61–0.85); India: 0.71 (0.51–0.91); Nepa: 0.79 (0.66–0.92); F1-score: HK: 0.76 (0.75–0.77); India: 0.91 (0.90–0.92); Nepal: 0.80 (0.78–0.83) testing set from Stanford (100 cases): AUC: 0.92 (95% CI, 0.90–0.93), human grader: 0.91	Pros: validated result on rea-world datasets from multiple sites. Cons: exclude cases without consensus labeling and difficulty to be diagnosed by skilled clinicians.
Fan et al. [176]	Glaucoma diagnosis	Healthy, glaucoma	Fundus	OHTS: 66,715 images	ACRIMA (705 images); LAG (4854 images); DIGS (9473 images)		ResNet-50	AUC: DIGS: 0.74 (0.69–0.79); ACRIMA: 0.74 (0.70–0.77); LAG: 0.79 (0.78–0.81); Sensitivity (at 85% specificity): DIGS: 0.52; ACRIMA: 0.46; LAG: 0.59; Sensitivity (at 95% specificity): DIGS: 0.30; ACRIMA: 0.29; LAG: 0.42	Pros: validated results on multiple external datasets. Cons: the model was not generalizable in external datasets.
Fan et al. [222]	Glaucoma diagnosis	Healthy, glaucoma	Fundus	OHTS: 66,715 images	DIGS (10,473 images), ACRIMA (705 images), LAG (4854 images), RIM-ONE (455 images), ORIGA (650 images)		Data-efficient image Transformer (DeiT)	AUC: OHTS: 0.91 (0.87, 0.93), DIGS:0.77 (0.71, 0.82); ACRIMA: 0.740.74 (0.70, 0.77); LAG: 0.88 (0.87, 0.89); RIM-ONE:0.91 (0.88, 0.94); ORIGA: 0.73 (0.68, 0.77); sensitivity (at 85% specificity): OHTS:0.79; DIGS: 0.57; ACRIMA: 0.46; LAG: 0.77; RIM-ONE: 0.83; ORIGA: 0.40; sensitivity (at 95% specificity): OHTS:0.56; DIGS: 0.34; ACRIMA: 0.31; LAG: 0.59; RIM-ONE: 0.73; ORIGA: 0.21.	Pros: validated in multiple external datasets; Vision Transformers have the potential to improve generalizability. Cons: cropped images may lose information.
Huang et al. [231]	Glaucoma diagnosis	Normal, glaucoma	Fundus, VF	1655 samples	196 samples		Probabilistic deep learning model (EffientientNetB4 backbone)	AUC: 0.98 (0.98–0.99); accuracy: 93% (92–95%); sensitivity: 91% (87–95%); specificity: 95% (94–96%).	Pros: quantifying the uncertainty of the model Cons: dataset was from COMPASS instrument; more external validation is needed.
Huang et al. [146]	Glaucoma VF grading	Clear, mild, moderate, severe, diffuse	VF (HFA and Octopus)	3805 VFs (Octopus); 13,231 VFs (HFA)	150 VFs (HFA)	2 ophthalmic clinicians, 6 medical students	Fine-grained grading deep learning system (FGGDL: FGG-O, FGG-H); Interactive Interface	AI outperformed human experts and their performance was improved with the assistance of AI AUC: FGGDL: 0.893 (0.862–0.923); clinician 1: 0.838 (0.801–0.874); clinician 2: 0.833 (0.796–0.869); all the other medical students' performance was lower than 0.80.	Pros: external validation is applied and compared with human experts. Cons: high test–retest variability was not considered.
Li et al. [182]	OD/OC segmentation, glaucoma screening	Glaucoma, non-glaucoma	Fundus	In-house (2440 images)	DRISHIT-GS (101 images); RIM-ONE v3 (159 images)	4 ophthalmologists	R-DCNN (DAC-ResNet34)	The segmentation results of in-house testing set were comparable to that of human experts', OD: DC-98.51%, JC-97.07%; OC: DC-97.63%, JC-95.39% Segmentation results of DRISHTI-GS and RIM-ONE v3 are better than existing studies: DRISHTI-GS: OD: DC-97.23%, JC-94.17%; OD: DC-94.56%, JC-89.92%. RIM-ONE v3: OD: DC-96.89%, JC-91.32%; OC: DC-88.94%, JC-78.21%. glaucoma screening: AUC: DRISHTI-GS:0.968; RIM-ONE v3:0.941.	Pros: compared with human expert and existing studies. Cons: small sample size of training and validation dataset.
Li et al. [164]	Glaucoma diagnosis, glaucoma incidence/progression prediction	Diagnosis (glaucoma, non-glaucoma); glaucoma incidence prediction: with/without glaucoma development; glaucoma progression prediction: with/without progression	Fundus	Diagnosis: 24,054 eyes; predict glaucoma incidence: 11,548 eyes; predict glaucoma progression: 3425 eyes	Glaucoma diagnosis: external test 1: 6162 images (eyes), external test 2: 824 images (eyes); glaucoma incidence: external test 1: 955 images, external test 2: 719 images; glaucoma progression: external test 1: 337 images, external test 2: 513 images		DiagnoseNet, PredictNet	Glaucoma diagnosis: AUC: test 1: 0.94 (0.93–0.94); test 2–0.91 (0.89–0.93), sensitivity: glaucoma diagnosis: test 1:0.89(0.87–0.90); test 2: 0.92 (0.88–0.96), specificity: test 1: 0.83 (0.81–0.84); test 2: 0.71 (0.67–0.74); Predict glaucoma incidence: AUC: test 1: 0.89 (0.83–0.95); test 2: 0.88 (0.79–0.97); sensitivity: test 1: 0.84 (0.81–0.86); test 2: 0.84 (0.81–0.86); specificity: test 1: 0.68 (0.43–0.87); test 2: 0.80 (0.44–0.97); Predict glaucoma progression: AUC: test 1: 0.87 (0.81–0.92) and test 2: 0.88 (0.83–0.94); sensitivity: test 1: 0.82 (0.78–0.87) and test 2: 0.81 (0.77–0.84); specificity: test 1: 0.59 (0.39–0.76) and test 2: 0.74 (0.55–0.88).	Pros: performed multiple tasks (diagnosis, incidence and progression prediction) and include multiple external validation. Cons: exclude low quality images, validation dataset were only Chinese datasets.
Mehta et al. [149]	Glaucoma detection	Healthy, glaucoma, PTG (progress to glaucoma)	Demographic, systemic and ocular data, color fundus, OCT	UK Biobank (2574 eyes, glaucoma-1193 eyes, healthy-1283 eyes, PTG-98 eyes)	200 eyes	5 glaucoma-fellowship trained ophthalmologists	InceptionResnetV4	Best model with input of OCT, color fundus, systemic and ocular data obtained AUC = 0.967 (95% CI 0.93–1.0); Human expert (only made diagnosis on color fundus): AUC: 0.79–0.84.	Pros: used multiple modalities, several methods for interpret DL model (SHAP, saliency map); Cons: poor quality of the fundus images may contribute low AUC.
Hemelings et al. [151]	Glaucoma detection, VCDR regression	Glaucoma, non-glaucoma	Fundus	UZL (13,551 images)	REFUGE (1200 images)		ResNet50	AUC: original fundus: 0.87(95% CI 0.83–0.91); 60% ONH cropping: 0.80 (95% CI 0.76–0.84).	Pros: report explainability analysis by cropping ONH area; Cons: applying masks of fixed size might lead to a small variation in visible features across fundus images due to variation in ONH size across the study population.
Thakoor et al. [279]	Glaucoma detection	Glaucoma, non-glaucoma	OCT images (RNFL probability maps)	737 eyes	135 eyes	2 expert OCT readers	InceptionV3 + FC (with concept activation vectors (TCAVs))	The TCAVs scores were consistent with features used by human experts based on eye fixations AUC:0.911.	Pros: applied test with concept activation vectors (TCAVs) for model interpretability; Cons: multi-center datasets were not in external validation.
Thakoor et al. [281]	Glaucoma detection	Glaucoma, non-glaucoma	OCT B-Scans, RNFL probability maps	RNFL maps: 737 eyes; B-scans: 771 eyes	RNFL maps:135 eyes; B-scans: 125 eyes		CNN A (ResNet18 + RF) with RNFL-map as input	CNN generalizability can be improved with data augmentation, multiple input image modalities, and training on images with confident ratings. choosing a thorough and consistent RS for training and testing improves generalization to new datasets Best result was with RNFL map input and data augmentation on CNN A: AUC = 0.918 (95% CI 0.866–0.970), accuracy = 85.9%;	Pros: improved generalizability by several techniques (multi-modalities, consistent labels, data augmentation); Cons: independent validation can be improved by a larger dataset.
Natarajan et al. [143]	Glaucoma detection, OD segmentation	Glaucoma, normal	Fundus	RIGA (750 images), RIM-ONEv2 (455 images)	ACRIMA (705 images), Drishti-GS1 (101 images), RIMONEv1 (169 images)		UNet-Snet (SqueezeNet)	glaucoma detection: AUC: ACRIMA: 100%; Drishti-GS1: 99.90%, RIMONEv1: 100%; accuracy: ACRIMA: 99.86%; Drishti-GS1: 97.05%, RIMONEv1: 100%; sensitivity: ACRIMA: 100%; Drishti-GS1: 100%, RIMONEv1: 100%; specificity: ACRIMA: 99.75%; Drishti-GS1: 90.32%, RIMONEv1: 100%.	Pros: achieved high performance; Cons: small dataset for training and independent validation.
Kenichi et al. [129]	Glaucoma diagnosis	Glaucoma, normal	Fundus	3132 images (from ordinary camera)	162 images (from ordinary camera and smartphone)		ResNet34	AUC: camera-98.9%; smartphone-84.2% advanced glaucoma: AUC: camera-99.3%; smartphone-90.0%.	Pros: validated performance between fundus from ordinary camera and smartphone; Cons: training images were all from ordinary camera, poor image quality of fundus from smartphone.
Xu et al. [242]	Glaucoma diagnosis	Referable glaucomatous optic neuropathy (GON), unlikely GON	Fundus	1791 images	dataset1: 6301 images dataset2: 1964 images dataset3: 400 images	12 (4 senior ophthalmologists, 4 junior ophthalmologists, and 4 technicians)	Hierarchical deep learning system (HDLS)(segmentation- classification, Inception-v3 backbone)	Reliable region has higher sensitivity and specificity than suspicious region datset1: AUC = 0.981 (95% CI, 0.978–0.985), reliable region: sensitivity = 97.7% (95% CI, 97.0–98.3%), specificity = 97.8% (95% CI, 97.2–98.4%); dataset2: AUC = 0.983 (95% CI, 0.977–0.989); reliable region: sensitivity: 98.4% (95% CI, 97.3–99.5%), specificity = 98.2% (95% CI, 97.4–99.1%) Dataset 3: performance of human experts were improved with the referring of HDLS: senior group: sensitivity: 0.93 (diagnose independently), 0.96 (referring to HDLS); specificity: 0.88 (diagnose independently), 0.95 (referring to HDLS).	Pros: this system is transparent and interpretable. the results were validated on three large validation datasets and results were comparable to that of human experts. Cons: validation dataset did not include data from other ethnics than Chinese.
Bhuiyan et al. [246]	Glaucoma diagnosis	Glaucoma-suspect, not-suspect	Fundus	1546 disc-centered fundus images (AREDS, SiMES, RIM-ONE)	ORIGA (638 gradable images)		Ensemble of Xception, Inception-Resnet-V2, NasNet, Inception-V3	AUC: 0.85; accuracy: 83.54%; sensitivity: 80.11%; specificity: 84.96%.	Pros: screening glaucoma suspect is important; Cons: CDR is not the only biomarker for glaucoma suspect, other biomarkers, such as CDR asymmetry can be added in future studies.
Tang et al. [205]	Glaucoma diagnosis	Glaucoma, non-glaucoma	Fundus	Sanyuan (11,443 images), Tongren (7806 images), Xiehe (4363 images)	REFUGE (1200 images)		AMNet (semi-supervised learning)	accuracy: 95.75%; sensitivity: 87.5%; specificity: 96.7%; F1-score: 91.9%.	Pros: the model boosted the robustness of model with limited labeled data. Cons: diverse validation dataset might be needed to future validate the model.
Alghamdi et al. [241]	Glaucoma diagnosis	Glaucoma, normal	Fundus	RIM-ONE (455 images) RIGA (750 images)		2 ophthalmologists	TCNN-Transfer Convolutional Neural Network (VGG16); SSCNN- Semi-supervised Convolutional Neural Network model with self-learning; SSCNN-DAE-Semi-supervised Convolutional Neural Network model with autoencoder	three deep learning CNN models outperform the performances of both ophthalmologists with clear margins. RIM-ONE: accuracy: two experts attain 59.2% and 55.4.0% SSCNN-DAE: AUC: 0.95, accuracy: 93.8%; sensitivity: 98.90%; specificity: 90.50%.	Pros: compared with human experts and outperformed them. Cons: small datasets were used for training; external validation was not included.
Li et al. [278]	Glaucoma diagnosis in myopia	Glaucoma, non-glaucoma	RNFL profile	2223 eyes	508 eyes		FCN + RBFN (radial basis function network) + RNFL compensation	By applying the RNFL compensation algorithm, the AUC for detecting glaucoma increased from 0.70 to 0.84, from 0.75 to 0.89, from 0.77 to 0.89, and from 0.78 to 0.87 for eyes in the highest 10%, 20%, 30% and any axial length (AL), respectively.	Pros: An RBFN has good generalization, strong tolerance to input noise, and online learning ability made it possible to interpret the patterns to a reliable compensation. Cons: compensation of the RNFL profile was based on data from participants with an age ≥ 50 years, this was not validated in younger participants; the validation dataset included a 1:1 ratio of glaucomatous to non-glaucomatous eyes, this is not the case in read world data.
Chiang et al. [277]	Primary angle closure glaucoma (PACG) detection	POAG, PACG	Goniophotography	32,635 images	1000 images	9 graders	CNN (ResNet-50 backbone)	CNN achieved excellent performance based on single-grader (AUC = 0.969) and consensus (AUC = 0.952) labels. The agreement between the CNN classifier and consensus labels (κ = 0.746) surpassed that of all non-reference human graders (κ = 0.578–0.702).	Pros: the model obtained comparable performance with human graders. Cons: the model has limitation of generalizable to other ethnic groups and goniophotography taken from other devices.
Zhao et al. [196]	Cup-to-disc ratio (CDR) estimation, glaucoma screening	Glaucoma, normal	Fundus	Direct-CSU (934 images)	ORIGA (650 images)		Unsupervised feature representation of fundus image with a CNN (MFPPNet-3 blocks DenseNet) + RF	glaucoma screening: AUC: 0.88 CDR estimation: MAE: 0.0606; coefficient of correlation r = 0.68.	Pros: estimated the CDR value more effectively than results obtained from traditional segmentation-based methods; potential to handle unlabeled data. Cons: further validation on diverse data sources would be needed.
Jammal et al. [112]	Fundus predict RNFL; glaucoma detection	Glaucoma, normal	Fundus	32,820 pairs of fundus photos and SD-OCT scans		2 graders	M2M DL (ResNet34 backbone)	DL algorithm outperformed human graders in detecting signs of glaucomatous damage on fundus photographs. glaucoma detection: AUC: DL: 0.801 [95% CI: 0.757, 0.845]; human: 0.775 [95% CI: 0.728, 0.823]; AUPRC: DL: 0.810 [95% CI: 0.765, 0.851]; human: 0.761 [95% CI: 0.703, 0.819] RNFL prediction (DL): MAE: 7.39 um; Pearson’s r = 0.832.	Pros: the M2M model is able to provide a quantitative output and outperformed human graders. Cons: used the presence of visual field defects as the gold standard may be biased and influence accuracy.
Wang et al. [142]	Glaucoma detection	Glaucoma, normal	OCT images	HK dataset: 975,400 B-scans (4877 volumns)	Stanford dataset: 246,200 B-scans (1231 volumns)	2 glaucoma experts	2D-ResNet18-SEMT (SEmi-supervised Multi-Task)	Stanford (volumn based): AUC: 0.933, accuracy:86%; F1-score: 0.889; human vs model (HK, volumn based): AUC: 0.977 (DL), 0.918 (human); accuracy: 0.927 (DL), 0.912 (human); F1 score: 0.941 (DL); 0.917 (human).	Pros: semi-supervised learning addressed the miss VF measurement label problem in the training set, the multi-task learning network to explore the relationship between the functional and structural, which was beneficial to accuracy improvement. Cons: the framework at the training stage is not end-to-end, where the hard assignment for VF measurement and the multi-tasking training work in a cascaded manner.
Russakoff et al. [119]	Glaucoma diagnosis	Referable, nonreferable glaucoma	3D-OCT	2805 scans	Hongkong:505 eyes, India:336 eyes		gNet3D (3DCNN)	AUC: HK:0.78; India: 0.95	Pros: first studies to use machine learning in risk stratification; multinational external datasets across geographical and ethnicity distribution from Hong Kong and India. Cons: lack of inclusion of suspects in external datasets; different definitions of glaucoma between development and external dataset.
Zaleska-Żmijewska et al. [121]	Glaucoma diagnosis	Glaucoma, healthy	Fundus, IOP	1687 images	Campaign1:752 images (C1) campaign2: 352 images (C2)		AlexNet	images classifier: accuracy: C1:80%; C2:78%; sensitivity: C1:0.73, C2:0.84; specificity: C1:0.83; C2:0.67; fundus + IOP: accuracy: C1:71%; C2:79%; sensitivity: C1:0.79, C2:0.92; specificity: C1:0.67; C2:0.42.	Pros: include IOP risk factor in model; IOP inclusion improved the sensitivity. Cons: small number of datasets, performance need to be improved.
Zheng et al. [130]	Glaucoma diagnosis	Glaucoma, normal	SD-OCT images (hand-craft features (HCFs), peripapillary RNFL OCT images)	1501 images	104 images		Inception-V3	AUC: 0.990 (0.974, 1.000), accuracy: 0. 990(0.974, 1.000); sensitivity: 0.981 (at 80%, 90% specificity).	Pros: achieved higher sensitivity and specificity compared to traditional HCFs. Cons: external validation test is needed from different centers or OCT devices; most of the glaucoma cases were quite severe and this made classification easier in the current study; used images of Chinese eyes only, the results may not be applicable to other populations.
Li et al. [131]	Glaucoma detection	Glaucoma, non-glaucoma	VF (deviation probability plots (PDPs), numerical pattern deviation plots (NDPs), and numeric displays (NDs))	9022 VFs	phase1: test 1: 200 VFs, test 2:406 VFs, test 3: 507VFs phase 2: 649 VFs	6 ophthalmologists	iGlaucoma (2D-Fusion-CNN (with input of ND + NDP + PDP))	In Phase I, the DLS outperformed all six ophthalmologists in the three test sets (AUC of 0.834–0.877, with a sensitivity of 0.831–0.922 and a specificity of 0.676–0.709); In Phase II, iGlaucoma had 0.99 accuracy in recognizing different patterns in pattern deviation probability plots region, with corresponding AUC, sensitivity and specificity of 0.966 (0.953–0.979), 0.954 (0.930–0.977), and 0.873 (0.838–0.908).	Pros: developed ‘iGlaucoma’, a smartphone application-based deep learning system (DLS); the performance outperformed human expert. Cons: this study is limited to the Chinese population; this DLS only utilizes VF, no other modality included.
Kim et al. [132]	Glaucoma diagnosis	Glaucoma, normal	OCT (RNFL thickness, RNFL deviation maps, GCIPL thickness, GCIPL deviation maps, ocular axial length)	8988 images	1420 images	2 glaucoma specialists	VGG-19	The glaucoma-diagnostic ability was highest when the DL system used the RNFL thickness map alone, among combination sets, use of the RNFL and GCIPL deviation map showed the highest diagnostic ability. It showed detection patterns similar to those of glaucoma specialists External (RNFL + GCIPL deviation map): AUC: 0.985 (95% CI 0.966–0.995), sensitivity (at 90% specificity): 97.2%, sensitivity (at 80% specificity): 98.2%.	Pros: include interpretability and compared with human, with agreement pattern; multiple modalities and wide severity level included. Cons: external validation only include good quality OCT images and only from Asian ethnicity.
Christopher et al. [136]	Glaucoma diagnosis	Glaucoma, normal	Fundus	DIGS/ADAGES: 14,822 images; MRCH: 3132 images	Iinan Dataset: 215 images Hiroshima Dataset: 171 images ACRIMA: 705 images		UCSD (ResNet50) UTokyo (ResNet34)	Utokyo (sequential) obtained highest performance in external datasets: AUC: linan:0.97 (0.94–0.99), Hiroshima: 0.99 (0.99–0.99), ACRIMA: 0.86 (0.83–0.89).	Pros: DIGS/ADAGES and MRCH consisted diverse population, the study computed model performance stratified by disease severity, myopia status, and race. Cons: the datasets have inconsistent glaucoma definition and labeling strategy.
Maadi et al. [194]	OD/OC segmentation, glaucoma detection	GLAUCOMA, non-glaucoma	Fundus	Drishti-GS1: 101 images RIMONE v3: 159 images	REFUGE: 1200 images		Modified U-Net (SE-ResNet50)	OD/OC segmentation: F1-score: OD:0.91; OC:0.79. Diagnosis: AUC: 0.939.	Pros: external validation was used and performed well. Cons: larger training and validation datasets might be needed for further improvement.
Phene et al. [109]	Detect referable GON	Non-referable GON, referable GON	Fundus	88,126 images	A: 1205 images B: 9642 images C: 346 images	10 graders (grade 411 images in validation A dataset)	Inception-v3	The performance of DL was better than human on a subset (411 images) of validation dataset A AUC: A:0.945 (0.929–0.960); B: 0.855 (0.841–0.870); C: 0.881 (0.838–0.918).	Pros: this DL algorithm has higher sensitivity and comparable specificity to eye care providers in detecting referable GON in color fundus images; provides insight into which ONH features drive GON assessment by glaucoma specialists. Cons: diagnosis of glaucoma is not based on ONH appearance alone but also relies on the compilation of risk factors.
Gómez-Valverde et al. [261]	Glaucoma detection	Glaucoma, normal	Fundus	2313 images (RIM-ONE, DRISHTI-GS, ESPERANZA)		4 glaucoma experts	VGG19	human expert (ESPERANZA): specificity = 0.8914, sensitivity = 0.7662; model (train on all 3 dataset): AUC: 0.94; sensitivity: 87.01%; specificity: 89.01%.	Pros: model performance was comparable to human expert. Cons: no independent validation dataset provided.
Asaoka et al. [118]	Glaucoma diagnosis	Glaucoma, normal	Fundus	3132 images (camera: Kowa)	Test 1: 205 images (camera: Kowa) test2: 171 images (camera: Topcon)		ResNet	Data augmentation improved model performance With augmentation: AUC: test 1: 0.948 (0.903–0.968); test 2: 0.997 (0.994–1); Without augmentation: AUC: test 1:0.877 (0.828–0.926); test 2: 0.945(0.913–0.976).	Pros: model was validated using images obtained from different fundus cameras and it had a high diagnostic ability irrespective of the type of fundus camera. Cons: the model has not been validated in different ethnicities.
Kim et al. [113]	Glaucoma diagnosis and localization	Glaucoma, normal	Fundus	2123 images	RIM-ONE r3 (159 images)		ResNet152	accuracy: 93.5%; sensitivity: 92.9%, specificity: 92.9%.	Pros: a web application was developed with output pd Grad-CAM. Cons: larger datasets from different instrument may be incorporated to improve the performance.
Asaoka et al. [31]	Diagnose early glaucoma	Glaucoma, normal	Macula SD-OCT (RNFL and GCC thicknesses)	Pretrain: 4316 OCT images train: 178 eyes (94 POAG, 84 normal)	196 eyes (114 POAG, 82 normal)		DL transform (CNN using 8*8 grid RNFL (in the first channel) and GCC data (in the second channel), use both pretraining and training)	With pretraining: AUC: 93.7% (90.6–96.8). Optimum discrimination at sensitivity of 82.5%, specificity of 93.9%. Specificity (at 80% sensitivity): 83.3%; Specificity (at 90% sensitivity): 86.6% Without pretraining: AUC: 76.6–78.8%	Pros: the study tested importance of pretraining process in model performance improvement Cons: the results were obtained using a homogeneous patient population (only Japanese patients), further study might prepare OCT data from multiple ethnicities, to generalize the results
Al-Aswad et al. [289]	Glaucoma diagnosis	Glaucoma, non-glaucoma	Fundus	110 images		6 ophthalmologists	Pegasus	AUC: Pegasus: 92.6%, ophthalmologist: 69.6% -84.9%, the "best case" consensus scenario AUC of 89.1%. Sensitivity: Pegasus: 83.7%; ophthalmologists: 61.3%-81.6%. Specificity: Pegasus: 88.2%, ophthalmologists:80.0% -94.1% Agreement with gold standard: Pegasus: 0.715; highest ophthalmologist: 0.613.	Pros: Pegasus outperformed 5 of the 6 ophthalmologists in terms of diagnostic performance, and there was no statistically significant difference between the deep learning system and the “best case” consensus between the ophthalmologists. Cons: small sample size and fundus from same camera.
Norouzifard et al. [102]	Glaucoma diagnosis	Glaucoma, normal	Fundus	447 images	HRF (30 images)		Inception-ResNet-V2	accuracy: 80%	Pros: validated on independent dataset. Cons: both training and validation datasets were small.
Shibata et al. [105]	Glaucoma screening	Glaucoma, normal	Fundus	3132 images (1364 glaucoma, 1768 normal)	110 images (60 glaucoma, 50 normal)	3 residents	ResNet	AUC: model:96.5% (93.5–99.6%); residents: 72.6–91.2%.	Pros: the performance was compared with human expert and the model outperformed them. Cons: the current study excluded photographs with features that could interfere with an expert diagnosis of glaucoma, which was not “real world” setting.
Li et al. [256]	VF classification	Glaucoma, non-glaucoma	VF (VF PD images)	4012 VFs	300 VFs	9 ophthalmologists	CNN	Model: accuracy: 0.876 (95% CI 0.838–0.914); sensitivity: 0.932; specificity: 0.826 Ophthalmologists: the average accuracies are 0.607, 0.585 and 0.626 for resident ophthalmologists, attending ophthalmologists and glaucoma experts, respectively.	Pros: CNN has achieved higher accuracy compared to human ophthalmologists and traditional rules; Cons: used only pattern deviation images as the input of CNN, preperimetric glaucoma may not be effectively detected by machine.
Andersson et al. [75]	Glaucoma Diagnosis	Glaucoma, healthy	VF (30–2 VF)	165 subjects (99 glaucoma,66 healthy)		30 physicians	ANN	ANN: sensitivity: 93%, specificity: 91% Physicians: sensitivity: 61%- 96% (mean 83%); specificity:59%-100% (mean 90%).	Pros: ANN performs at least as well as physicians in assessments of visual fields for the diagnosis of glaucoma; Cons: external and larger datasets are needed for further validation.
Goldbaum et al. [12]	Glaucoma diagnosis	Glaucoma, normal	VF (24–2 VF) + age	345 eyes (189 normal, 156 glaucoma)		2 human experts	Mixture of Gaussian (MoG)	MoG had significantly greater ROC area than PSD and CPSD. Human experts were not better at classifying visual fields than the machine classifiers or the global indices. MoG(PCA): AUC: 0.922; sensitivity: 0.67(at specificity = 1), 0.79 (at specificity = 0.9); Expert 1: sensitivity = 0.75, specificity = 0.96; Expert 2: sensitivity = 0.88, specificity = 0.59.	Pros: compared multiple methods based on different parameters. Cons: larger and external validation are needed for further evaluation.
Goldbaum et al. [63]	Glaucoma diagnosis	Glaucoma, normal	VF	120 eyes (60 normal, 60 glaucoma)		2 glaucoma specialists	ANN (2 layered)	The experts and the network were in agreement about 74% of the time ANN: accuracy: 67%; sensitivity: 65%; specificity: 71%; glaucoma specialists: accuracy: 67%; sensitivity: 59%; specificity: 74%.	Pros: compared with human expert. Cons: dataset was small, performance needs to be improved and external validation was needed.

Back to article page

ISSN: 1475-925X

Contact us

Submission enquiries: journalsubmissions@springernature.com