- Open Access
Deep learning for gastroscopic images: computer-aided techniques for clinicians
BioMedical Engineering OnLine volume 21, Article number: 12 (2022)
Gastric disease is a major health problem worldwide. Gastroscopy is the main method and the gold standard used to screen and diagnose many gastric diseases. However, several factors, such as the experience and fatigue of endoscopists, limit its performance. With recent advancements in deep learning, an increasing number of studies have used this technology to provide on-site assistance during real-time gastroscopy. This review summarizes the latest publications on deep learning applications in overcoming disease-related and nondisease-related gastroscopy challenges. The former aims to help endoscopists find lesions and characterize them when they appear in the view shed of the gastroscope. The purpose of the latter is to avoid missing lesions due to poor-quality frames, incomplete inspection coverage of gastroscopy, etc., thus improving the quality of gastroscopy. This study aims to provide technical guidance and a comprehensive perspective for physicians to understand deep learning technology in gastroscopy. Some key issues to be handled before the clinical application of deep learning technology and the future direction of disease-related and nondisease-related applications of deep learning to gastroscopy are discussed herein.
Gastric disease is a major health problem, with gastric cancer ranking second among the leading causes of cancer-related deaths . Gastroscopy is the main technical method used to diagnose and screen many gastric diseases, and is the gold standard. Gastroscopy uses a thin, soft tube to extend into the stomach, enabling endoscopists to directly observe stomach lesions. It reflects the condition of the examined part and can confirm a diagnosis through pathological biopsies of suspicious lesions. It is the preferred method for examining gastric lesions.
However, endoscopists may make incorrect observations during gastroscopy due to fatigue caused by long working hours or inexperience. Several imaging modalities, such as narrow-band imaging (NBI), magnifying endoscopy (ME), autofluorescence imaging (AFI), and 3D imaging, have emerged. While these new technologies have improved the diagnostic capabilities of gastroscopy, endoscopists should be trained on how to effectively use them.
Therefore, a computer-aided diagnosis system has been developed to improve gastroscopy efficiency and quality in daily clinical practice, becoming a “third eye” for endoscopists. In recent years, deep learning technology has significantly improved the performance of computer-aided diagnosis systems due to continuous breakthroughs in algorithms, hardware performance, computing power, and the accumulation of several labelled endoscopic image datasets.
This review included relevant works published between 2018 and 2020 from the PubMed and Web of Science databases. The keywords “endoscopy gastric artificial intelligence”, “endoscopy gastric computer vision”, “endoscopy gastric convolutional neural network”, “endoscopy gastric deep learning”, “endoscopy stomach artificial intelligence”, “endoscopy stomach computer vision”, “endoscopy stomach convolutional neural network” and “endoscopy stomach deep learning” were used. A total of 493 publications were identified from the database search, and 40 manuscripts were included in the final analysis after screening (as shown in Fig. 1). This review summarizes the on-site application of deep learning during gastroscopy in recent years to provide technical guidance and a comprehensive perspective for physicians to understand what deep learning (DL) technology can do and how that role is achieved.
Some technical concepts, common networks, and algorithms used in developing a gastroscopy-assisted system are introduced in Chapter II. Details of the four main tasks of gastric image analysis using deep learning technology is presented, respectively. Chapter III summarizes existing deep learning applications for solving disease-related challenges in gastroscopy. With these technologies, endoscopists can identify, locate and diagnose lesions that appear in the viewshed of gastroscopy more accurately. Specifically, gastric diseases are classified into Helicobacter pylori, gastric cancer and other precancerous conditions, which are stated in “Helicobacter pylori”, “Gastric cancer” and “Precancerous conditions” sections, respectively. Then, Chapter IV presents the deep learning applications not directly related to diseases. They help endoscopists screen keyframes from the gastroscopic video stream and comprehensively inspect the entire surface of the oesophagus and stomach. These DL models prevent endoscopists from ignoring lesions that do not appear in the viewshed of the gastroscope or misdiagnosing lesions in poor-quality frames. “Informatic frame screening”, “Anatomical classification”, “Artefact detection” and “Depth estimation and 3D reconstruction of the stomach” sections introduced the application of deep learning for informatic frame screening, anatomical classification, artefact detection and depth estimation in gastroscopy, respectively. Chapter V shows the analysed current publications in the research field and indicates the key issues to be addressed before the clinical application of the technology. Furthermore, future perspectives for DL application in disease-related and nondisease-related gastroscopy as well as promising DL technologies and approaches are proposed. Finally, the development trend of DL-based assisted systems in real-time gastroscopy to provide on-site support is discussed.
Technical aspects of deep learning in gastroscopy
Deep learning is a state-of-the-art (SOTA) machine learning technique. Before deep learning, machine learning mainly used handcrafted features, where image patterns such as colour and texture were encoded in a mathematical description. A classifier was then used to analyse the features of each image category during a training process and to classify a new input image. A DL architecture has several hidden layers and can automatically extract and identify numerous high-level, complex features that a traditional machine learning (ML) method cannot analyse.
Convolutional neural networks (CNNs) are the first and most commonly used deep neural networks for gastric image analysis. A CNN has a unique effect on image processing. Its structure includes convolutional layers, pooling layers, and fully connected layers. CNN applications for gastric image analysis can be grouped into four main tasks based on the challenges endoscopists encounter in clinical practice: image classification, object detection, semantic segmentation, and instance segmentation. Figure 2 illustrates the difference among the four main tasks. Recently, recurrent neural networks (RNNs) and generative adversarial networks (GANs) have also been used to further improve the performance of CNN-based gastroscopic image processing methods with regard to these clinical challenges. Unlike CNNs, RNNs efficiently process time-series data because they can remember historical information. By combining the information from several adjacent oesophagogastroduodenoscopy (EGD) video frames, focusing on the time sequence of the input and the connection between the previous and next frames, a better effect is achieved in gastroscopic image analysis . The internal memory structure of an RNN meets such a scenario. Gated recurrent unit networks (GRUs)  and long short-term memory (LSTM) networks  are commonly used RNN architectures based on practical performance. Generative adversarial networks (GANs) introduce the confrontation idea in deep learning. The discriminant model and the generative model are the two confrontation sides. The discriminant model accurately distinguishes real data from generated data, and the generative model generates new data that conform to the probability distribution of real data. A GAN can effectively generate new data similar to real data via the adversarial training of the two neural networks. The function of GANs in gastroscopic image analysis mainly includes image data enhancement , image style transfer , and image restoration  due to the inadequate endoscopic data and poor-quality frames in EGD videos. Typical GAN algorithms include DCGANs , CGAN , and CycleGAN .  lists the famous GANs. “Image classification task”, “Object detection task”, “Semantic segmentation task”, and “Instance segmentation task” section provide a detailed introduction to the four main tasks of gastric image analysis using deep learning technology.
Image classification task
An image classification task determines the category of a given input image at the image level. It is a basic task in high-level image understanding and can be divided into binary- and multi-classification tasks. After multiple convolution-and-pooling operations via a CNN, an image is classified in the output layer following the requirements. The activation function of the output layer is the only difference between binary- and multi-classification tasks. An image classification task for gastric image analysis mainly includes determining whether a frame is an analysable information frame [12, 13] or contains a lesion [14,15,16,17,18,19,20,21,22], determining frame anatomical position [2, 12, 18, 23,24,25], and the classification of lesion features [19, 26,27,28,29,30,31,32,33,34,35,36,37,38,39]. Some classification networks with high performance in natural image classification, including AlexNet , VGG , GoogLeNet (Inception) series [42,43,44,45], ResNet , ResNeXt , DenseNet , SENet , SqueezeNet , and EfficientNet , can be used in EGD image classification.
Object detection task
Object detection detects all objects in an image, giving their location information using a bounding box and classifying each object. An object detection network uses a classification network with a powerful feature extraction capability as its backbone. It achieves its goals by changing the output layer structure. An object detection task for gastroscopic images involves detecting, boxing, and classifying lesions [52,53,54,55] and artefacts [7, 56], and the anatomical structure of the stomach . Two-stage algorithms using candidate regions such as RCNN , SPP-Net , fast RCNN , and faster RCNN  and one-stage algorithms based on regression such as YOLO series [61,62,63,64,65], SSD , CornerNet , ExtremeNet  and CenterNet  are the two main object detection algorithms. While some classic object detection networks have achieved good results in gastroscopic image analysis, some SOTA algorithms, such as EfficientDet  and CentripetalNet , with higher performance and less calculation time, should be considered because a DL model will finally be used for clinical real-time videos.
Semantic segmentation task
Semantic segmentation is a more fine-grained task than object detection that determines each pixel class of an entire image. It classifies an image pixel-by-pixel. The height and width of the output are the same as those of the input image. The number of channels equals the number of categories, representing each spatial location category (pixel-by-pixel classification). It mainly segments a lesion [35, 39, 72, 73] and the artefact  boundary and estimates the depth of endoscopic images and 3D reconstruction of the stomach  in gastroscopic image analysis. Several classic algorithms, such as FCNs , SegNet , U-Net , and DeepLeb series [78,79,80,81], have been used in this field.
Instance segmentation task
Instance segmentation distinguishes different instances from the same category. For instance, semantic segmentation only predicts the pixels of multiple lesions as a category of "lesions", but instance segmentation distinguishes each pixel from multiple lesions such as “lesion 1”, “lesion 2” and “lesion 3”. Instance boxing using an object detection algorithm and semantic segmentation on each bounding box is used to realize instance segmentation. An instance segmentation task mainly detects the lesions and delineates their margin . Mask RCNN , PANet , and CentripetalNet  are the superior algorithms for this task.
Deep learning application to disease-related gastroscopy challenges
At this time, available DL models are not like human endoscopists, who can screen multiple diseases and take a biopsy for qualitative analysis at the same time during a gastroscopy. Most gastroscopy DL applications focus on a single disease and achieve a specific clinical task. Therefore, we divide stomach diseases into three categories, Helicobacter pylori (HP), gastric cancer (GC), and other precancerous diseases, and introduce the application of DL in solving specific clinical tasks related to each.
Helicobacter pylori infection causes chronic atrophic gastritis and intestinal metaplasia, which increase the risk of gastric cancer. Approximately 90% of noncardia gastric cancers are related to HP infection [85,86,87,88]. The redness and swelling of the gastric mucosa during an endoscopy inspection can be used to diagnose an HP infection. However, it is time-consuming, and the accuracy of the results depends on the skill of the endoscopist. Recently, some articles reported a method that detects and diagnoses HP infection using a deep learning model. Itoh et al.  first developed a convolutional neural network using GoogLeNet to differentiate HP-positive from HP-negative in white-light imaging (WLI) images and showed a sensitivity, specificity, and area under the curve (AUC) of 86.7%, 86.7%, and 0.956, respectively. In addition, Zheng et al.  developed a CNN model using ResNet-50 to evaluate HP infection and obtained similar results. Shichijo et al.  constructed a convolutional neural network (GoogLeNet) to ascertain HP infection statuses, including HP-positive, HP-negative, and HP-eradicated. A total of 23,699 images from 847 patients were used to validate the algorithm and showed a diagnostic accuracy of 80%, 84%, and 48% for negative, eradicated, and positive, respectively, similar to the results of experienced endoscopists . Nakashima et al.  constructed a GoogLeNet-based DL model to predict the HP infection status in WLI, blue-light imaging (BLI), and linked colour imaging (LCI) images. The AUCs were 0.66, 0.96, and 0.95 for WLI, BLI, and LCI, respectively. Nakashima et al.  developed two DL models using a 22-layer skip-connection architecture to classify the HP infection status into three similar categories for WLI and LCI images (as shown in Fig. 3). A validation dataset of endoscopic videos of 120 subjects was developed to evaluate computer-aided diagnosis (CAD) systems. Comparisons revealed that LCl-based DL diagnoses were more accurate than WLI-based DL diagnoses [uninfected (84.2% vs. 75.0%), currently infected (82.5% vs. 77.5%) and post-eradication (79.2% vs. 74.2%)], indicating that a DL model with image-enhanced endoscopy is a more powerful image diagnostic tool for HP infection than conventional white-light endoscopy.
Gastric cancer is a common gastrointestinal tumour with rapid progress and high modality that seriously threatens human life and health [90, 91]. Gastroscopy and pathological biopsy are the gold standards for gastric cancer diagnosis. However, gastroscopy depends on equipment and the diagnostic ability of endoscopists. Therefore, several deep learning models have been recently developed to assist in diagnosing various aspects of gastric cancer.
Gastric cancer prognosis is related to detection time. The 5-year survival rate of advanced gastric cancer is less than 30%, even after surgical treatment . Meanwhile, radical treatment under endoscopy can be used for most early gastric cancers with a 5-year survival rate of more than 90% . However, early gastric cancer usually does not have obvious characteristics under endoscopy; only slight local regional mucosal changes occur, which are difficult to detect. Hirasawa et al.  first developed a CNN using single-shot multibox detection (SSD) to automatically detect gastric cancer in endoscopic images. A total of 13,584 endoscopic images were used, and the model could correctly detect 71 of 77 GC lesions (92.2% sensitivity) in 2296 stomach images requiring only 47 s. The unidentified lesions were superficially downregulated and differentiated-type intramucosal cancers, which can be easily misdiagnosed as gastritis. Hirasawa et al. also applied the technology to real-time GC detection in videos . The CNN correctly detected 64 of 68 EGC lesions (94.1% sensitivity) from 68 endoscopic submucosal dissection (ESD) procedures for EGC in 62 patients. The median time for lesion detection after the first appearance on the screen was 1 s. A sample image for the early detection of gastric cancer using their CNN system is shown in Fig. 4. Moreover, they compared the detection ability between the CNN and endoscopists . An independent test set of 2940 images from 140 cases was used for validation. The CNN system showed a significantly higher sensitivity than the 67 endoscopists (58.4% vs. 31.9%) at a faster detection speed (45.5 s vs. 173.0 min). Sakai et al.  proposed a GoogLeNet-based model to detect EGC under WLI. The accuracy, sensitivity, and specificity were 87.6%, 80.0%, and 94.8%, respectively. Luo et al. developed another DL system named GRAIDS using DeepLabv3 + + by  to detect upper gastrointestinal (GI) cancer. This multicentre, case–control study was performed in six hospitals of different tiers in China. The model was trained and tested on 1,036,496 endoscopic images from 84,424 individuals, which is the largest dataset in this research area to date. GRAIDS showed a sensitivity comparable to that of expert endoscopists (0.942 vs. 0.945) and was superior to competent (0.858) and trainee (0.722) endoscopists. Wu et al.  built a deep convolutional neural network (DCNN) to detect EGC in real-time unprocessed EGD videos and designed a man–machine competition. The DCNN detected EGC with an accuracy of 92.5%, a sensitivity of 94.0%, a specificity of 91.0%, a positive predictive value (PPV) of 91.3%, and a negative predictive value (NPV) of 93.8%, greater than that of endoscopists at all levels. Wang et al. constructed a cloud-based image analysis service to enhance GC screening . Their study is unique due to its deployment method and result integration of all trained CNNs (AlexNet, GoogLeNet, and VGGNet) to obtain the final prediction. The sensitivity of the proposed approach (79.6%) was significantly greater than that of other single-CNN models (61.5% for AlexNet, 68.8% for GoogLeNet, and 69.7% for VGGNet). Yoon et al.  developed a VGG-16-based DL system to detect EGC. The model showed a sensitivity of 91.0% and an AUC of 0.981. Shibata et al.  developed a Mask R-CNN-based detection method for EGC. They collected 1208 healthy and 533 cancer images to perform fivefold cross-validation. The results showed 96% sensitivity with only 0.10 false positives per image, which is acceptable for endoscopists in clinical practice if the performance is not significantly influenced after being applied to video images.
Unlike GC detection, which emphasizes sensitivity to reduce the rate of missing lesions, in GC diagnosis, a DL model distinguishes benign lesions from GC, emphasizing accuracy, reducing unnecessary biopsies, and minimizing costs. Cho et al.  established CNN models using Inception-Resnet-v2 to automatically classify gastric neoplasms under WLI into five categories [advanced gastric cancer (AGC), early gastric cancer (EGC), high-grade dysplasia (HGD), low-grade dysplasia (LGD), and nonneoplasm]. The CNN model showed lower performance in the prospective validation using 200 images from 200 patients compared with the best endoscopists (five-category accuracy 76.4% vs. 87.6%; cancer 76.0% vs. 97.5%; neoplasm 73.5% vs. 96.5%) but was comparable to that of the worst endoscopist (cancer accuracy 76.0% vs. 82.0%), indicating potential clinical application in classifying gastric cancer or neoplasm. Lee et al.  constructed three CNNs using ResNet-50, VGG-16, and Inception-v4 to differentiate GC from gastric ulcers. ResNet-50 had the highest performance with 77.1% accuracy. Zhang et al.  developed a CNN system using ResNet34 and DeepLabv3 to assist the diagnosis of GC and other gastric lesions. The model was trained on 21,217 images, including five gastric conditions [peptic ulcer (PU), early gastric cancer (EGC), high-grade intraepithelial neoplasia (HGIN), advanced gastric cancer (AGC), submucosal tumours (SMTs)] and normal gastric mucosa. In addition, 1091 other images were used to evaluate the model. The diagnostic accuracy, specificity, and PPV of the CNN were higher than those of endoscopists with over 8 years of experience (accuracy: 78.7% vs. 74.2%; specificity: 91.2% vs. 86.7%; PPV: 55.4% vs. 41.7%). While GC diagnosis cannot achieve high accuracy under WLI, imaging enhancement endoscopy, such as ME-NBI, which can provide more structural information on mucosa and capillaries, is more accurate for distinguishing GC from other benign lesions, endoscopists require substantial effort to learn the skill since its efficiency relies on endoscopist experience. Therefore, a DL model in this field is extensively researched. Hu et al.  developed a VGG-19-based DL model (accuracy, 77.0%) to classify EGC and noncancerous lesions. Li et al.  also developed a CNN system using Inception-v3 to differentiate EGC from noncancerous lesions using ME-NBI images (accuracy, 90.91%; sensitivity, 91.18%; and specificity, 90.64%). Liu et al.  developed a ResNet-50-based CNN to classify ME-NBI endoscopic images into chronic gastritis (CGT), low-grade neoplasia (LGN), and EGC (accuracy, 0.96). Examples of the original ME-NBI image and the feature extraction procedure for its classification are provided in Fig. 5. In addition, Ueyama et al.  constructed a CNN using ResNet-50 to differentiate EGC from noncancerous mucosa and lesions. A total of 2300 ME-NBI images were used, and the model illustrated an extremely excellent performance. The overall accuracy, sensitivity, specificity, PPV and NPV were 98.7%, 98%, 100%, 100% and 96.8%, respectively. The total time for analysing the test dataset was only 60 s. Horiuchi et al.  reported a GoogLeNet-based CNN system to distinguish EGC from gastritis. A total of 1492 EGC and 1078 gastritis ME-NBI images were used for training, and 151 EGC and 107 gastritis images were used to evaluate the diagnostic ability. The accuracy of the model reached 85.3%, and the overall test speed was 0.02 s/image. Horiuchi et al. also conducted a video-based evaluation to compare the performance between expert endoscopists and the CNN model . The study included 174 ME-NBI videos (87 cancerous and 87 noncancerous) and 11 experts. The CNN model achieved an accuracy of 85.1%, which was significantly higher than that of two experts, less than that of one expert, and not significantly different from that of the remaining eight experts.
GC type classification
Identifying the type of GC, such as the differentiation status, accurately is critical for determining the surgical strategy and treatment plan. GC with different differentiation statuses shows an obvious difference in images under narrow-band imaging (Fig. 6). Therefore, it can be classified using a deep learning method. Ling et al.  developed a real-time system using VGG-16 to accurately identify the EGC differentiation status from ME-NBI endoscopy. A total of 2217 images from 145 EGC patients and 1870 images from 139 EGC patients were retrospectively collected to train and test the CNN. The performance of the CNN was then compared with that of experts using 882 images from 58 EGC patients. The system correctly predicted the differentiation status of EGCs with an accuracy of 83.3% on the test dataset and achieved superior performance compared with the five experts (86.2% vs. 69.7%). Furthermore, the system was successfully used on real EGC videos.
Determination of GC invasion depth
GC invasion depth is essential in determining the treatment method. For GC in the mucosa or superficial submucosa, endoscopic submucosal dissection (ESD) can be used for radical GC treatment without surgery or chemotherapy because it is minimally invasive and requires only a short hospital stay. However, there are limitations in clinical practice because endoscopists measure the exact depth based on the overall findings and personal experience. Yoon et al.  used a VGG-16 model to classify EGC endoscopic images as T1a (intramucosal) or T1b (submucosal). A total of 11,686 endoscopic images were used to perform fivefold cross-validation, and the AUC for depth prediction reached 0.851. However, undifferentiated-type GC showed a lower accuracy than differentiated-type GC. Zhu et al.  constructed a ResNet-50-based CNN to determine the invasion depth of GC in the mucosa or superficial submucosa (M/SM1) and deep submucosa (SM2). The model obtained an overall accuracy of 89.16%, specificity of 95.56%, PPV of 89.66%, and NPV of 88.97%. The accuracy and specificity were significantly higher than those of endoscopists. Furthermore, Cho et al.  developed a CNN based on DenseNet-161 to discriminate the mucosa-confined and submucosa-invaded GC invasion. The model showed excellent performance. The model accurately identified 6.7% of patients who underwent gastrectomy in an external test for potential ESD, preventing unnecessary operation.
GC margin delineation
It is important to first delineate the GC margin accurately before ESD to achieve endoscopic curative resection in EGC patients. An et al.  used a real-time fully convolutional network (UNet + +) to delineate the resection margin of EGC under indigo carmine (IC) chromoendoscopy (CE) or white-light endoscopy (WLE). The system (ENDOANGEL) showed an accuracy of 85.7% on the CE images and 88.9% on the WLE images under an overlap ratio threshold of 0.60 relative to expert-labelled manual markers. The system was also tested on ESD videos, and ENDOANGEL predicted the regions covering all areas of high-grade intraepithelial neoplasia and cancers. An et al. also developed a real-time system to accurately delineate EGC margins on ME-NBI endoscopy using the same UNet + + architecture . A total of 928 images from 132 EGC patients and 742 images from 87 EGC patients were used to train and test the system. The model showed an accuracy of 82.7% in differentiated EGC and 88.1% in undifferentiated EGC under an overlap ratio of 0.80. This system achieved superior performance compared with experts and was successfully tested on real-time EGC videos. Shibata et al.  developed a segmentation method using Mask R-CNN for EGC regions (as shown in Fig. 7). A total of 1208 healthy and 533 cancer images were collected, and the performance was evaluated via fivefold cross-validation. The average Dice index was 71%, indicating that the proposed scheme is useful for evaluating the invasion region.
While most precancerous conditions in the stomach are benign and harmless, they can develop into gastric cancer if not diagnosed and treated early. Zhang et al.  developed an SSD-based CNN named SSD-GPNet to detect gastric polyps. The network could realize real-time polyp detection with 50 fps and improve the mean average precision (mAP) to 90.4%. Some examples of the results are shown in Fig. 8. Further experiments showed that their network has an excellent performance in improving polyp detection by over 10%, especially for small polyps. Yan et al.  constructed a CNN (EfficientNetB4) using NBI and ME-NBI images to diagnose gastric intestinal metaplasia (GIM). A separate dataset of 477 images (242 GIM and 235 non-GIM) was used as the test set. The performance of the system was not significantly different from that of human experts (sensitivity 91.9% vs. 86.5%; specificity 86.0% vs. 81.4%; accuracy 88.8% vs. 83.8%). Figure 9 displays the classification decision procedure of a CNN using the Grad-CAM  method. Zhang et al.  constructed a CNN named CAG-Net using DenseNet121 to improve the diagnostic rate of chronic atrophic gastritis. Fivefold cross-validation was used to train and verify the model (3042 atrophic gastritis images and 2428 normal images). The diagnostic accuracy, sensitivity, and specificity of the model were 0.942, 0.945, and 0.940, respectively. The detection rates of mild, moderate, and severe atrophic gastritis were 93%, 95%, and 99%, respectively. Figure 10 shows interpretable thermodynamic maps of the CAG automatic diagnosis procedure.
Deep learning application to nondisease-related gastroscopy challenges
The DL technologies discussed in Chapter III can reach or even exceed experienced endoscopists in many disease-related clinical tasks. However, if a lesion has never entered the viewshed of the gastroscope due to incomplete inspection or the poor quality of video frames during gastroscopy, these systems do not work at all. Therefore, some deep learning technologies not directly related to gastric diseases have also been applied to improve the quality of gastroscopy.
Informatic frame screening
The video stream in clinical endoscopy can output 30 or 60 image frames per second, including many useless frames with no information. A deep learning model cannot analyse useless frames because of poor image quality or inappropriate imaging modalities. The useless frames show uncredible results, mislead endoscopists, waste considerable computing power, and decrease the real-time performance of the system. Wu et al.  developed a DCNN using VGG-16 to identify informatic frames. A total of 12,220 in vitro, 25,222 in vivo, and 16,760 unqualified EGD images from over 3000 patients were used for training the network to identify whether a frame was outside the body with high quality for the next-step analysis. A total of 3000 images (1000 per category) were randomly selected to test the model (accuracy, 97.55%). In addition, Zhang et al.  constructed a model of seven convolutional layers, one max-pooling layer, and one fully connected layer to classify video frames into three categories (NBI, informative and noninformative images). The workflow and example results of their proposed method are illustrated in Fig. 11. A total of 34,145 images were used for training, and 6000 images were used for testing (accuracy, 98.77%). Therefore, DL models can screen informatic frames as a preprocessing procedure. Then, other critical and computationally intensive models can perform only on the informatic frames, reducing the false-positive rate and leading to better real-time performance.
While an endoscopist can capture all gastric cancer that appears under endoscopy, some lesions can be missed due to the wide, curved stomach lumen. Although guidelines for mapping the entire stomach exist, they are often not well followed. Therefore, it is important to develop a practicable and reliable algorithm to guide endoscopists to examine the stomach comprehensively. Takiyama et al.  constructed a CNN using GoogLeNet to classify the anatomical location of EGD images into the pharynx, oesophagus, upper stomach, middle stomach, lower stomach, and duodenum. An independent validation set of 17,081 EGD images was used to evaluate the model. The model showed an AUC of 1.00 for laryngeal and oesophageal images and 0.99 for stomach and duodenal images. Wu et al.  built a system (WISENSE, currently ENDOANGEL) to classify the anatomical locations of EGD images into 10 and 26 parts. The DCNN showed accuracies of 90% and 65.9% on real-time EGD videos with the two location classification tasks, respectively, comparable to the performance of experts (63.8%). Wu et al. also evaluated the system in a randomized controlled trial to ascertain whether the system can reduce the blind spot rate . The blind spot rate was significantly lower on the WISENSE group than on the control group (5.86% vs. 22.46%). Additionally, a clinical trial was conducted to compare the performance of unsedated ultrathin transoral endoscopy (U-TOE), unsedated conventional oesophagogastroduodenoscopy (C-EGD), and sedated conventional oesophagogastroduodenoscopy (C-EGD) with or without the system. The blind spot rate was lowest on the sedated C-EGD, and the DL system reduced this rate to 3.42% . It is more difficult to provide an accurate label using a single frame due to the refined division of anatomical locations and the variations in EGD performances among different individuals in practice. Therefore, using information from more adjacent frames is practicable. However, a CNN can only analyse frames independently. Li et al.  combined a DCNN (Inception-v3) and LSTM to develop a system (IDEA) to monitor blind spots during real-time EGD. A total of 170,297 images and 5779 endoscopic videos were used. The model could divide the EGD examination into 31 sites from the hypopharynx to the duodenum. Representative images identified by IDEA are shown in Fig. 12. In addition, an independent dataset of 3100 EGD images and 129 videos was used to evaluate its performance. The system showed a sensitivity, specificity, and accuracy of 97.18%, 99.91%, and 99.83%, respectively, for images and 96.29%, 93.32%, and 95.30%, respectively, for videos. Furthermore, IDEA using an NVIDIA GTX1080TI, a widely used affordable GPU, could process one image in 80 ms, thus meeting the real-time requirement. Zhang et al.  designed a CNN using SSD to detect 10 anatomical structures of the upper digestive tract in real time. The method showed a precision of 93.74%. The abovementioned studies are WLI-based. However, some image enhancement techniques, such as NBI, are commonly used in clinical practice. Igarashi et al.  developed an algorithm using AlexNet to classify EGD images into 14 precise anatomical categories under different image-capture conditions. The model showed an accuracy of 0.965 on the validation datasets with 36,072 images.
Several artefacts, including motion blur, defocus, specularity reflection, over- and underexposure of image regions, and the presence of bubbles, fluids and artificial devices, corrupt over 60% of an endoscopy video frame, thus influencing the visual interpretation of the mucosal surface and significantly impeding the detection and quantitative analysis of lesions . Therefore, it is important to identify and localize artefacts to restore video frame quality before developing other computer-assisted diagnosis algorithms. Figure 13 shows the results of three SOTA detection baselines on this challenge. Ali et al.  proposed a framework using deep learning to detect and classify six different primary artefacts and restore mildly corrupted frames. The method showed the highest mAP of 49.0 and the lowest computational time of 88 ms. The restoration model preserved an average of 68.7%, which is 25% more frames than that retained from the raw videos on 10 test videos. Ali et al. also held a computer vision challenge named Endoscopy Artefact Detection (EAD 2019  and EAD 2020 ) and presented a comprehensive analysis of the submissions to EAD2019  and EAD2020 .
Depth estimation and 3D reconstruction of the stomach
Conventional gastroscopy without 3D vision and proper depth perception significantly limits diagnostic examinations and therapy delivery. 3D surface reconstruction technology helps doctors better enhance scene perception on an augmented reality (AR) system, preventing surgical risks caused by low visibility and inexperience. In addition, 3D structural information can significantly improve diagnostic and surgical performance. Figures 14 and 15 explain the procedure of depth estimation and 3D reconstruction.
Recently, Widya et al. [6, 99, 100] used a chromoendoscopy video that spread indigo carmine (IC) dye on the stomach surface to reconstruct the entire 3D shape of the stomach with mucosal surface details via the structure from motion (SFM) method. The red channel data showed complete and comprehensive results. A network for image-to-image style translation from the no-IC image and the IC-sprayed image was trained using a generative adversarial network (GAN) to improve the previous work. Therefore, complete stomach 3D reconstruction can be performed without IC dying. Ozyoruk et al.  proposed an unsupervised monocular depth and pose estimation method that combines residual networks with spatial attention modules to focus on different and highly textured tissue regions. Moreover, a comprehensive endoscopic simultaneous localization and mapping (SLAM) dataset consisting of 3D point cloud data from ex vivo porcine gastrointestinal (GI) tract organs was built.
In recent years, increasing numbers of DL algorithms have been developed and successfully applied to natural image processing due to deep learning theory and the continuous improvement in hardware performance. Deep learning use in gastroscopy-assisted diagnosis is a new research hotspot. This review included 40 related papers. There is an increasing yearly trend based on the number of papers published. The articles included 29 applications related to diseases (see Table 1, mainly gastric cancer and Helicobacter pylori infection) and 10 not related to diseases (see Table 2, mainly monitoring the anatomical structure of the stomach to reduce blind spots). One paper also reported a system combining disease-related and nondisease-related applications to automatically detect EGC without blind spots. Figure 16 summarizes the publications cited in this review.
To date, some systems using DL in gastroscopy have worked under real-time video conditions and achieved technical indicators comparable to expert endoscopists in both disease-related and nondisease-related applications. However, some key issues should be addressed before clinical use. First, most studies used retrospective datasets based on high-quality static images. When these models are used in real-time video analysis, performance tends to be poor due to the relatively poor quality of the video frames. Therefore, more prospective studies using video images are needed. Additionally, the current research used a small dataset due to the privacy of patients and the high cost of labelling the images, and unignorable selection bias existed. Although the performance in each study was high, the algorithms cannot be compared because there was no unified benchmark using the same dataset, such as ImageNet and MS COCO for natural image analysis. Large-scale open-access databases, such as the SUN database for a colonoscopy, should be used . Furthermore, the clinical value can only be known by deploying the system in hospitals, which requires the approval of relevant regulatory authorities. Although some regulatory-approved DL systems are available for colonoscopy [104,105,106,107], there is no such system for gastroscopy. Therefore, regulatory considerations for deep learning technologies in gastroscopy should be given more attention by major regulatory authorities [Food and Drug Administration (FDA, US); Pharmaceuticals and Medical Devices Agency (PMDA, Japan); National Medical Products Administration (NMPA, China); European Conformity (CE, Europe)] .
Future perspective for disease-related DL application to gastroscopy
It is necessary to develop a system that can detect key diseases in the stomach at the same time to make the system comprehensive in a pathological sense. The systems in this research are only sensitive to one disease, such as GC or HP, and are exclusive. This is not effective in clinical practice and can easily hinder an endoscopist’s examination. For instance, an HP detection system is not sensitive to GC lesions. The system could not give a reminder to the endoscopist when a GC lesion appeared on the screen, thus leading to missing data.
Furthermore, the system should achieve higher performance on some disease subtypes that endoscopists easily miss, such as lesions with a specific pathological status, a specific location, or a specific size; otherwise, if high technical metrics are achieved only on some lesions that are rarely ignored by physicians, then the system will have no great clinical significance.
In addition, there is a 2- to 3-year gap for deep learning technology application in gastroscopy compared to most cutting-edge research. Most state-of-the-art algorithms in deep learning have not been applied to screen diseases under gastroscopy. For instance, a 3D object detection algorithm can significantly improve the detection performance of flat lesions compared with a 2D object detection algorithm because it provides in-depth information. Some algorithms sensitive to small objects with only a few image pixels , camouflaged objects that are difficult to distinguish from a background , and few-shot or even zero-shot objects rarely appearing in small datasets  have been developed and applied to natural images and are important in gastroscopic lesion detection. However, it has not been applied in gastroscopic image analysis. In this research field, researchers often directly use algorithms that have achieved good results on natural images and perform transfer learning to obtain their models without making any changes to the network structure based on prior knowledge to make it more suitable for endoscopy image analysis. However, there are significant differences between endoscopic images and natural images in colour or texture. Therefore, doctors need to cooperate with DL algorithm engineers.
Future perspective for nondisease-related DL application to gastroscopy
Deep learning for nondisease-related applications enables disease-related applications to achieve better performance.
First, a nondisease-related DL model should make a disease-related model effectively detect and diagnose lesions to suit the real-time requirement. Therefore, more lightweight models with fewer parameters and inference calculations should be adopted. In addition, it will screen the frames with no information (motion blur, defocus.) and those with unsuitable imaging modalities (WLI, NBI, ME.). A relatively time-consuming disease-related model should only analyse the informatic frames after screening. In addition, the most appropriate endoscopy imaging modality based on the task settings should be clarified.
For the gastroscopy coverage rate, a nondisease-related DL model should enable a disease-related model to comprehensively inspect the stomach, covering the entire mucosal surface of the stomach without visual obstruction. Combining deep neural networks such as CNNs, RNNs and GANs should be explored. Currently, researchers perform anatomical classification of video frames to ensure gastroscopy comprehensiveness. However, the performance of this method decreases with detailed anatomical classification (classification of the stomach from 10 to 26 regions). Combining a CNN and RNN, which is more powerful in serialized video data processing, significantly increases the performance of the DL model up to 31 regions for the classification task.
Furthermore, some significant additional functions for gastroscopy can be realized using DL technology to solve clinical limitations. For instance, a monocular visual odometer with deep learning can be used to accurately measure lesion size, which is important for the diagnosis, treatment, and prognosis of a lesion. However, endoscopists currently estimate lesion size by comparing it with a reference object such as biopsy forceps, which has unignorable errors. In some nonmedical fields, such as automatic driving, visual measurement technology based on deep learning is a hot research direction. In the field of endoscopy, the newest research  showed clear boundaries in estimated depth by resampling pixels around occlusion boundaries. One obstacle was that the texture of tissue is patient-specific when first used for depth reconstruction of colonoscopy . While monocular methods are most effective without other attachments, the images obtained are the same for the motion of the monocular camera, zooming trail, and scene in the same multiple (since the epipolar constraint is equal to 0 ). Therefore, the object scale cannot be obtained via monocular-based methods. Solving the problem of the lack of measurement scale of a monocular endoscope will become an important challenge.
Promising techniques and approaches of DL for gastroscopy
Currently, several cutting-edge DL technologies have attracted widespread attention in the field of natural image processing. They have been proven to bring great improvements in medical image processing, such as MRIs, CTs, and X-rays, but have never been applied to gastroscopic image processing.
In terms of network architecture, a transformer based on an attention mechanism can extract more global features of an image than a CNN. Representative approaches such as ViT , DETR , SETR , and Swin-T  have obtained better results than a CNN for the classification, detection, and segmentation of natural images. In the field of medical image processing, some recent research, such as MedT , Swin-UNet , and SpecTr , have achieved SOTA performance on brain ultrasound image segmentation, gland microscope image segmentation, and multiorgan CT image segmentation.
In addition, network architecture search (NAS) is another direction of network architecture development. There is a large difference in semantics between medical images and natural images. Therefore, a network structure that achieves good results on natural images is not necessarily suitable for medical images. Redesigning a network structure for medical images requires a wealth of expertise. A NAS algorithm can reduce the need for prior knowledge and automatically search for an optimal network structure. Some well-known works in the NAS field, such as the DARTS series [121,122,123] and ProxylessNAS , have achieved surprising performance in natural image analysis. Recently, some studies on medical image processing have introduced NAS. For example, NAS-UNet , AutoDeepLab , MS-NAS , and BiX-NAS  have achieved SOTA performance on medical image segmentation.
For the training paradigm, self-supervised learning is a promising technology. Due to the complexity of medical images, doctors with professional knowledge are required to annotate images. This results in the scale of labelled medical image datasets always being small. In contrast, unlabelled raw medical images are relatively easy to obtain. To solve this problem, self-supervised learning methods such as the MoCo series [129,130,131], SimCLR series [132, 133], and BYOL  are considered, which can be trained using unlabelled data and have achieved comparable performance to supervised learning methods on natural image datasets. Studies based on these approaches, such as MoCo-CXR  and MedAug , have recently been applied to detect abnormalities in chest X-ray images.
Regarding the optimization procedure, currently applied optimizers usually utilize the gradient descent of the loss function to find an optimal solution. However, these optimization technologies are susceptible to the local optimal trap. Recently, some meta-heuristic algorithms, such as the Aquila Optimizer (AO) , Reptile Search Algorithm (RSA)  and Arithmetic Optimization Algorithm (AOA) , have been employed to solve a variety of complicated optimization problems. These optimization algorithms are able to perform a global search in the available search space of a problem to ensure that the final solution is close to the global optimum, which demonstrates the potential to improve the optimization process of developing DL models for gastroscopy.
Based on the findings mentioned above, we suggest that a DL-based assisted system for real-time gastroscopy to provide on-site support should be developed in a manner combining deep learning applications in disease-related and nondisease-related situations. Four development trends of deep learning in gastroscopy can be observed from the literature cited in this review: (1) real-time performance is improved; (2) coverage comprehensiveness (in both a spatial sense and pathological sense) is achieved; (3) detection sensitivity is enhanced; and (4) diagnosis accuracy is increased. However, there is still a gap before these systems can be applied to clinical practice. In the future, it is important to test the complete system using clinical indicators after validating a single function at the algorithm level using technical metrics such as sensitivity, specificity, PPV, and NPV, which are easily affected by the distribution of the test dataset. Another potential research direction is to conduct multicentre randomized controlled trials to test whether the system can improve the performance of endoscopists in an actual clinical environment, reduce the blind spot rate, increase the detection rate, and reduce the incidence of fatal, high-burden, and poor prognosis diseases such as advanced cancers. Furthermore, the exploration of more cutting-edge DL algorithms and their potential applications that are beneficial to gastroscopy can be future work for the research community. In conclusion, deep learning has the potential to improve the efficiency and quality of gastroscopy soon. However, endoscopists should first understand what DL can do and how to use it.
Availability of data and materials
Narrow band imaging
Convolutional neural network
Recurrent neural network
Generative adversarial network
Area under curve
Linked colour imaging
Single shot multibox detection
Early gastric cancer
Endoscopic submucosal dissection
Deep convolutional neural network
Positive predictive value
Negative predictive value
Advanced gastric cancer
High-grade intraepithelial neoplasia
Mean average precision
Gastric intestinal metaplasia
Ultrathin transoral endoscopy
Long Short-Term Memory networks
Structure from motion
Simultaneous localization and mapping
Hamashima C, Group SR, Guidelines GDS. Update version of the Japanese guidelines for gastric cancer screening. Japn J Clin Oncol. 2018;48(7):673–83.
Li YD, Zhu SW, Yu JP, Ruan RW, Cui Z, Li YT, et al. Intelligent detection endoscopic assistant: An artificial intelligence-based system for monitoring blind spots during esophagogastroduodenoscopy in real-time. Dig Liver Dis. 2020;34:9.
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:14061078. 2014.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.
Kanayama T, Kurose Y, Tanaka K, Aida K, Satoh Si, Kitsuregawa M, et al., editors. Gastric cancer detection from endoscopic images using synthesis by GAN. In: International Conference on Medical Image Computing and Computer-Assisted Intervention; 2019. New York: Springer.
Widya AR, Monno Y, Okutomi M, Suzuki S, Gotoda T, Miki K. Stomach 3D reconstruction based on virtual chromoendoscopic image generation. Annual International Conference of the IEEE Engineering in Medicine and Biology Society IEEE Engineering in Medicine and Biology Society Annual International Conference. 2020;2020:1848–52.
Ali S, Zhou F, Bailey A, Braden B, East JE, Lu X, et al. A deep learning framework for quality assessment and restoration in video endoscopy. Medical Image Anal. 2021;68:101900.
Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:151106434. 2015.
Mirza M, Osindero S. Conditional generative adversarial nets. arXiv preprint arXiv:14111784. 2014.
Zhu J-Y, Park T, Isola P, Efros AA, editors. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision; 2017.
The GAN Zoo. https://github.com/hindupuravinash/the-gan-zoo. 2017.
Wu L, Zhang J, Zhou W, An P, Shen L, Liu J, et al. Randomised controlled trial of WISENSE, a real-time quality improving system for monitoring blind spots during esophagogastroduodenoscopy. Gut. 2019;68(12):2161–9.
Xu Z, Tao Y, Wenfang Z, Ne L, Zhengxing H, Jiquan L, et al. Upper gastrointestinal anatomy detection with multi-task convolutional neural networks. Healthc Technol Lett. 2019;6(6):176–80.
Itoh T, Kawahira H, Nakashima H, Yata N. Deep learning analyzes Helicobacter pylori infection by upper gastrointestinal endoscopy images. Endoscopy Int Open. 2018;6(2):E139–44.
Nakashima H, Kawahira H, Kawachi H, Sakaki N. Artificial intelligence diagnosis of Helicobacter pylori infection using blue laser imaging-bright and linked color imaging: a single-center prospective study. Ann Gastroenterol. 2018;31(4):462–8.
Sakai Y, Takemoto S, Hori K, Nishimura M, Ikematsu H, Yano T, et al. Automatic detection of early gastric cancer in endoscopic images using a transferring convolutional neural network. Annu Int Conf IEEE Eng Med Biol Soc. 2018;2018:4138–41.
Wang H, Ding S, Wu D, Zhang Y, Yang S. Smart connected electronic gastroscope system for gastric cancer screening using multi-column convolutional neural networks. Int J Prod Res. 2019;57(21):6795–806.
Wu L, Zhou W, Wan X, Zhang J, Shen L, Hu S, et al. A deep neural network improves endoscopic detection of early gastric cancer without blind spots. Endoscopy. 2019;51(6):522–31.
Yoon HJ, Kim S, Kim JH, Keum JS, Oh SI, Jo J, et al. A lesion-based convolutional neural network improves endoscopic detection and depth prediction of early gastric cancer. J Clin Med. 2019;8:9.
Zheng W, Zhang X, Kim JJ, Zhu X, Ye G, Ye B, et al. High accuracy of convolutional neural network for evaluation of Helicobacter pylori infection based on endoscopic images: preliminary experience. Clin Transl Gastroenterol. 2019;10(12):e00109.
Zhang Y, Li F, Yuan F, Zhang K, Huo L, Dong Z, et al. Diagnosing chronic atrophic gastritis by gastroscopy using artificial intelligence. Dig Liver Dis. 2020;52(5):566–72.
Yan T, Wong PK, Choi IC, Vong CM, Yu HH. Intelligent diagnosis of gastric intestinal metaplasia based on convolutional neural network and limited number of endoscopic images. Comput Biol Med. 2020;126:89.
Takiyama H, Ozawa T, Ishihara S, Fujishiro M, Shichijo S, Nomura S, et al. Automatic anatomical classification of esophagogastroduodenoscopy images using deep convolutional neural networks. Scientific Rep. 2018;8:8.
Chen D, Wu L, Li Y, Zhang J, Liu J, Huang L, et al. Comparing blind spots of unsedated ultrafine, sedated, and unsedated conventional gastroscopy with and without artificial intelligence: a prospective, single-blind, 3-parallel-group, randomized, single-center trial. Gastrointest Endosc. 2020;91(2):332-9.e3.
Igarashi S, Sasaki Y, Mikami T, Sakuraba H, Fukuda S. Anatomical classification of upper gastrointestinal organs under various image capture conditions using AlexNet. Comput Biol Med. 2020;124:3.
Cho BJ, Bang CS, Park SW, Yang YJ, Seo SI, Lim H, et al. Automated classification of gastric neoplasms in endoscopic images using a convolutional neural network. Endoscopy. 2019;51(12):1121–9.
Lee JH, Kim YJ, Kim YW, Park S, Choi YI, Kim YJ, et al. Spotting malignancies from gastric endoscopic images using deep learning. Surg Endosc. 2019;33(11):3790–7.
Shichijo S, Endo Y, Aoyama K, Takeuchi Y, Ozawa T, Takiyama H, et al. Application of convolutional neural networks for evaluating Helicobacter pylori infection status on the basis of endoscopic images. Scand J Gastroenterol. 2019;54(2):158–63.
Zhu Y, Wang QC, Xu MD, Zhang Z, Cheng J, Zhong YS, et al. Application of convolutional neural network in the diagnosis of the invasion depth of gastric cancer based on conventional endoscopy. Gastrointest Endosc. 2019;89(4):806-15.e1.
Cho BJ, Bang CS, Lee JJ, Seo CW, Kim JH. Prediction of submucosal invasion for gastric neoplasms in endoscopic images using deep-learning. J Clin Med. 2020;9:6.
Horiuchi Y, Aoyama K, Tokai Y, Hirasawa T, Yoshimizu S, Ishiyama A, et al. Convolutional neural network for differentiating gastric cancer from gastritis using magnified endoscopy with narrow band imaging. Dig Dis Sci. 2020;65(5):1355–63.
Horiuchi Y, Hirasawa T, Ishizuka N, Tokai Y, Namikawa K, Yoshimizu S, et al. Performance of a computer-aided diagnosis system in diagnosing early gastric cancer using magnifying endoscopy videos with narrow-band imaging (with videos). Gastrointest Endosc. 2020;92(4):856.
Hu H, Gong L, Dong D, Zhu L, Wang M, He J, et al. Identifying early gastric cancer under magnifying narrow-band images via deep learning: a multicenter study. Gastroint Endoscopy. 2020;89:8.
Li L, Chen Y, Shen Z, Zhang X, Sang J, Ding Y, et al. Convolutional neural network for the diagnosis of early gastric cancer based on magnifying narrow band imaging. Gastric Cancer. 2020;23(1):126–32.
Ling T, Wu L, Fu Y, Xu Q, An P, Zhang J, et al. A deep learning-based system for identifying differentiation status and delineating the margins of early gastric cancer in magnifying narrow-band imaging endoscopy. Endoscopy. 2020;2324:788.
Liu X, Wang C, Bai J, Liao G. Fine-tuning pre-trained convolutional neural networks for gastric precancerous disease classification on magnification narrow-band imaging images. Neurocomputing. 2020;392:253–67.
Nakashima H, Kawahira H, Kawachi H, Sakaki N. Endoscopic three-categorical diagnosis of Helicobacter pylori infection using linked color imaging and deep learning: a single-center prospective study (with video). Gastric Cancer. 2020;23(6):1033–40.
Ueyama H, Kato Y, Akazawa Y, Yatagai N, Komori H, Takeda T, et al. Application of artificial intelligence using a convolutional neural network for diagnosis of early gastric cancer based on magnifying endoscopy with narrow-band imaging. J Gastroenterol Hepatol. 2020;26:78.
Zhang L, Zhang Y, Wang L, Wang J, Liu Y. Diagnosis of gastric lesions through a deep convolutional neural network. Dig Endosc. 2020;7:45.
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556. 2014.
Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:150203167. 2015.
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al., editors. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015.
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z, editors. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016.
Szegedy C, Ioffe S, Vanhoucke V, Alemi A, editors. Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI Conference on Artificial Intelligence; 2017.
He K, Zhang X, Ren S, Sun J, editors. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016.
Xie S, Girshick R, Dollár P, Tu Z, He K, editors. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017.
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ, editors. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017.
Hu J, Shen L, Sun G, editors. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018.
Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:160207360. 2016.
Tan M, Le QV. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:190511946. 2019.
Hirasawa T, Aoyama K, Tanimoto T, Ishihara S, Shichijo S, Ozawa T, et al. Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images. Gastric Cancer. 2018;21(4):653–60.
Ishioka M, Hirasawa T, Tada T. Detecting gastric cancer from video images using convolutional neural networks. Dig Endosc. 2019;31(2):e34–5.
Zhang X, Chen F, Yu T, An J, Huang Z, Liu J, et al. Real-time gastric polyp detection using convolutional neural networks. PLoS ONE. 2019;14(3):e0214133.
Ikenoyama Y, Hirasawa T, Ishioka M, Namikawa K, Yoshimizu S, Horiuchi Y, et al. Detecting early gastric cancer: Comparison between the diagnostic ability of convolutional neural networks and endoscopists. Dig Endosc. 2020;34:2.
Zhang YY, Xie D. Detection and segmentation of multi-class artifacts in endoscopy. J Zhejiang Univ Sci B. 2019;20(12):1014–20.
Girshick R, Donahue J, Darrell T, Malik J, editors. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2014.
He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1904–16.
Girshick R, editor. Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision; 2015.
Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2016;39(6):1137–49.
Redmon J, Divvala S, Girshick R, Farhadi A, editors. You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016.
Redmon J, Farhadi A, editors. YOLO9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017.
Redmon J, Farhadi A. Yolov3: An incremental improvement. arXiv preprint arXiv:180402767. 2018.
Bochkovskiy A, Wang C-Y, Liao H-YM. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv preprint arXiv:200410934. 2020.
Jocher G. Yolov5. https://github.com/ultralytics/yolov5. 2020.
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, et al., editors. Ssd: Single shot multibox detector. In: European conference on computer vision; 2016: Springer.
Law H, Deng J, editors. Cornernet: Detecting objects as paired keypoints. In: Proceedings of the European Conference on Computer Vision (ECCV); 2018.
Zhou X, Zhuo J, Krahenbuhl P, editors. Bottom-up object detection by grouping extreme and center points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019.
Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q, editors. Centernet: Keypoint triplets for object detection. In: Proceedings of the IEEE International Conference on Computer Vision; 2019.
Tan M, Pang R, Le QV, editors. Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020.
Dong Z, Li G, Liao Y, Wang F, Ren P, Qian C, editors. Centripetalnet: Pursuing high-quality keypoint pairs for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020.
Luo H, Xu G, Li C, He L, Luo L, Wang Z, et al. Real-time artificial intelligence for detection of upper gastrointestinal cancer by endoscopy: a multicentre, case-control, diagnostic study. Lancet Oncol. 2019;20(12):1645–54.
An P, Yang D, Wang J, Wu L, Zhou J, Zeng Z, et al. A deep learning method for delineating early gastric cancer resection margin under chromoendoscopy and white light endoscopy. Gastric Cancer. 2020;23(5):884–92.
Ozyoruk KB, Incetan K, Coskun G, Gokceler GI, Almalioglu Y, Mahmood F, et al. Quantitative Evaluation of Endoscopic SLAM Methods: EndoSLAM Dataset. arXiv preprint arXiv:200616670. 2020.
Long J, Shelhamer E, Darrell T, editors. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015.
Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell. 2017;39(12):2481–95.
Ronneberger O, Fischer P, Brox T, editors. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention; 2015: Springer.
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:14127062. 2014.
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell. 2017;40(4):834–48.
Chen L-C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:170605587. 2017.
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H, editors. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV); 2018.
Shibata T, Teramoto A, Yamada H, Ohmiya N, Saito K, Fujita H. Automated detection and segmentation of early gastric cancer from endoscopic images using mask R-CNN. Appl Sci Basel. 2020;10:11.
He K, Gkioxari G, Dollár P, Girshick R, editors. Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision; 2017.
Liu S, Qi L, Qin H, Shi J, Jia J, editors. Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018.
Sakaki N, Momma K, Egawa N, Yamada Y, Kan T, Ishiwata J. The influence of Helicobacter pylori infection on the progression of gastric mucosal atrophy and occurrence of gastric cancer. Eur J Gastroenterol Hepatol. 1995;7:S59-62.
Uemura N, Okamoto S, Yamamoto S, Matsumura N, Yamaguchi S, Yamakido M, et al. Helicobacter pylori infection and the development of gastric cancer. N Engl J Med. 2001;345(11):784–9.
Group IHpW. Helicobacter pylori eradication as a strategy for preventing gastric cancer. Lyon, France: International Agency for Research on Cancer (IARC Working Group Reports, No. 8). 2014.
Take S, Mizuno M, Ishiki K, Hamada F, Yoshida T, Yokota K, et al. Seventeen-year effects of eradicating Helicobacter pylori on the prevention of gastric cancer in patients with peptic ulcer; a prospective cohort study. J Gastroenterol. 2015;50(6):638–44.
Watanabe K, Nagata N, Shimbo T, Nakashima R, Furuhata E, Sakurai T, et al. Accuracy of endoscopic diagnosis of Helicobacter pylori infection according to level of endoscopic experience and the effect of training. BMC Gastroenterol. 2013;13(1):1–7.
Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M, et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int J Cancer. 2015;136(5):E359–86.
Torre LA, Bray F, Siegel RL, Ferlay J, Lortet-Tieulent J, Jemal A. Global cancer statistics, 2012. Cancer J Clin. 2015;65(2):87–108.
Katai H, Ishikawa T, Akazawa K, Isobe Y, Miyashiro I, Oda I, et al. Five-year survival analysis of surgically resected gastric cancer cases in Japan: a retrospective analysis of more than 100,000 patients from the nationwide registry of the Japanese Gastric Cancer Association (2001–2007). Gastric Cancer. 2018;21(1):144–54.
Sumiyama K. Past and current trends in endoscopic diagnosis for early stage gastric cancer in Japan. Gastric Cancer. 2017;20(1):20–7.
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D, editors. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision; 2017.
Ali S, Zhou F, Braden B, Bailey A, Yang S, Cheng G, et al. An objective comparison of detection and segmentation algorithms for artefacts in clinical endoscopy. Sci Rep. 2020;10:1.
Ali S. EAD Challenge: Multi-class artefact detection in video endoscopy. https://ead2019grand-challenge.org. 2019.
Ali S. Endoscopy Artefact Detection and Segmentation (EAD2020). https://ead2020grand-challenge.org/. 2020.
Ali S, Dmitrieva M, Ghatwary N, Bano S, Polat G, Temizel A, et al. Deep learning for detection and segmentation of artefact and disease instances in gastrointestinal endoscopy. Medical Image Analysis. 2021:102002.
Widya AR, Monno Y, Imahori K, Okutomi M, Suzuki S, Gotoda T, et al. 3D Reconstruction of Whole Stomach from Endoscope Video Using Structure-from-Motion 41st Annual International Conference of the Ieee Engineering in Medicine and Biology Society. IEEE Eng Med Biol Soc Conf Proc. 2019;2019:3900–4.
Widya AR, Monno Y, Okutomi M, Suzuki S, Gotoda T, Miki K. Whole Stomach 3D Reconstruction and Frame Localization From Monocular Endoscope Video. Ieee J Transl Eng Health Med-Jtehm. 2019;7:8.
İncetan K, Celik IO, Obeid A, Gokceler GI, Ozyoruk KB, Almalioglu Y, et al. VR-Caps: A Virtual Environment for Capsule Endoscopy. Med Image Anal. 2021;70:101990.
Ali S, Zhou F, Bailey A, Braden B, East J, Lu X, et al. A deep learning framework for quality assessment and restoration in video endoscopy. arXiv preprint arXiv:190407073. 2019.
Misawa M, Kudo S-e, Mori Y, Hotta K, Ohtsuka K, Matsuda T, et al. Development of a computer-aided detection system for colonoscopy and a publicly accessible large colonoscopy video database (with video). Gastrointestinal Endoscopy. 2020.
Medtronic. Medtronic launches the first artificial intelligence system for colonoscopy at United European Gastroenterology Week 2019. 2019.
Pentax. MedicalHOYA Group PENTAX Medical Cleared CE Mark for DISCOVERY™, an AI assisted polyp detector. 2019.
Corporation F. Fujifilm acquires CE mark and launches CAD EYE, a function of colonic polyp detection utilizing AI technology, in Europe. 2020.
Cybernet Systems Co L. EndoBRAIN—artificial intelligence system that supports optical diagnosis of colorectal polyps—was approved by PMDA (Pharmaceuticals and Medical Devices Agency), a regulatory body in Japan. 2018.
Walradt T, Brown JRG, Alagappan M, Lerner HP, Berzin TM. Regulatory considerations for artificial intelligence technologies in GI endoscopy. Gastrointest Endosc. 2020;92(4):801–6.
Kisantal M, Wojna Z, Murawski J, Naruniec J, Cho K. Augmentation for small object detection. arXiv preprint arXiv:190207296. 2019.
Fan D-P, Ji G-P, Sun G, Cheng M-M, Shen J, Shao L, editors. Camouflaged object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020.
Fan Q, Zhuo W, Tang C-K, Tai Y-W, editors. Few-shot object detection with attention-RPN and multi-relation detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020.
Ramamonjisoa M, Du Y, Lepetit V, editors. Predicting sharp and accurate occlusion boundaries in monocular depth estimation using displacement fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020.
Nadeem S, Kaufman AJ. Depth reconstruction and computer-aided polyp detection in optical colonoscopy video frames. 2016.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:201011929. 2020.
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S, editors. End-to-end object detection with transformers. In: European Conference on Computer Vision; 2020: Springer.
Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, et al., editors. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:210314030. 2021.
Valanarasu JMJ, Oza P, Hacihaliloglu I, Patel VM. Medical transformer: Gated axial-attention for medical image segmentation. arXiv preprint arXiv:210210662. 2021.
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, et al. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. arXiv preprint arXiv:210505537. 2021.
Yun B, Wang Y, Chen J, Wang H, Shen W, Li Q. SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation. arXiv preprint arXiv:210303604. 2021.
Liu H, Simonyan K, Yang Y. Darts: Differentiable architecture search. arXiv preprint arXiv:180609055. 2018.
Xu Y, Xie L, Zhang X, Chen X, Qi G-J, Tian Q, et al. PC-DARTS: Partial channel connections for memory-efficient architecture search. arXiv preprint arXiv:190705737. 2019.
Chen X, Xie L, Wu J, Tian Q, editors. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. In: Proceedings of the IEEE/CVF international conference on computer vision; 2019.
Cai H, Zhu L, Han S. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:181200332. 2018.
Weng Y, Zhou T, Li Y, Qiu X. Nas-unet: Neural architecture search for medical image segmentation. IEEE Access. 2019;7:44247–57.
Liu C, Chen L-C, Schroff F, Adam H, Hua W, Yuille AL, et al., editors. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019.
Yan X, Jiang W, Shi Y, Zhuo C, editors. Ms-nas: Multi-scale neural architecture search for medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention; 2020: Springer.
Wang X, Xiang T, Zhang C, Song Y, Liu D, Huang H, et al. BiX-NAS: Searching Efficient Bi-directional Architecture for Medical Image Segmentation. arXiv preprint arXiv:210614033. 2021.
Chen X, Xie S, He K. An empirical study of training self-supervised visual transformers. arXiv e-prints. 2021. arXiv: 2104.02057.
Chen X, Fan H, Girshick R, He K. Improved baselines with momentum contrastive learning. arXiv preprint: arXiv:200304297. 2020.
He K, Fan H, Wu Y, Xie S, Girshick R, editors. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020.
Chen T, Kornblith S, Swersky K, Norouzi M, Hinton G. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:200610029. 2020.
Chen T, Kornblith S, Norouzi M, Hinton G, editors. A simple framework for contrastive learning of visual representations. In: International conference on machine learning; 2020: PMLR.
Grill J-B, Strub F, Altché F, Tallec C, Richemond PH, Buchatskaya E, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:200607733. 2020.
Sowrirajan H, Yang J, Ng AY, Rajpurkar P. MoCo-CXR: MoCo Pretraining Improves Representation and Transferability of Chest X-ray Models. arXiv preprint arXiv:201005352. 2020.
Vu YNT, Wang R, Balachandar N, Liu C, Ng AY, Rajpurkar P. MedAug: Contrastive learning leveraging patient metadata improves representations for chest X-ray interpretation. arXiv preprint arXiv:210210663. 2021.
Abualigah L, Yousri D, Abd Elaziz M, Ewees AA, Al-qaness MA, Gandomi AH. Aquila Optimizer: A novel meta-heuristic optimization Algorithm. Comput Ind Eng. 2021;157:107250.
Abualigah L, Abd Elaziz M, Sumari P, Geem ZW, Gandomi AH. Reptile Search Algorithm (RSA): A nature-inspired meta-heuristic optimizer. Expert Syst Appl. 2022;191:116158.
Abualigah L, Diabat A, Mirjalili S, Abd Elaziz M, Gandomi AH. The arithmetic optimization algorithm. Computer Methods Appl Mech Eng. 2021;376:113609.
This research was supported by the National Key Research and Development Project (Grant No. 2019YFC0117901), the National Major Scientific Research Instrument Development Project (Grant No. 81827804), and the Robotics Institute of Zhejiang University (Grant No. K11806), the National Key Research and Development Project (Grant No. 2017YFC0110802), and the Key Research and Development Plan of Zhejiang Province (Grant Nos. 2017C03036 and 2018C03064).
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Jin, Z., Gan, T., Wang, P. et al. Deep learning for gastroscopic images: computer-aided techniques for clinicians. BioMed Eng OnLine 21, 12 (2022). https://doi.org/10.1186/s12938-022-00979-8
- Deep learning