Time-lapse system
Time-lapse (TL) system is part of the IVF incubator, which is used to register embryo development during its cultivation (see Fig. 6). It captures images of an embryo at certain time intervals (in our case, every 5 min) and stores the images. Typically, such a system consists of three main components: (1) a light source, (2) microscope optics and (3) a video camera. Usually, red light at 650 nm is used to illuminate an embryo, which is cultivated in a specially designed culturing dish, called a culture coin. Microscope optics magnify the embryo cells by 20 times. The TL system is equipped with a 2-megapixels video camera that allows the capture of an embryo in a 121 × 121 μm area. The TL system uses a special mirror (prism) that concentrates light and directs it to the embryo and camera sensor.
Embryo assessment is based on the time intervals between cell cleavages, which are visually registered. The embryo is considered of high quality when the cleavage time intervals fit the normative data. Intervals that are too short or too long between cleavages signal the abnormal development of an embryo, which might lead to pregnancy failure. The TL system facilitates the recording of embryo development for up to 5 days at 5-min intervals to create the sequence of images. Modern time-lapse incubators such as ESCO Miri TL have optical microscopes with which is possible to capture a human embryo at seven different focal planes for more information. Now, embryologists must evaluate each individual image in the sequence and decide which embryo is suitable for transfer. It is a complicated task not only because the embryo can behave unexpectedly during its development, but also because of the massive image data set that includes over 10,000 images per embryo, which must be manually assessed. A skilled embryologist requires less than 2 min to annotate one embryo in the case where embryos do not have a high percentage of fragmentation. Usually, IVF patients have up to 5 or 10 embryos. Henceforth, the manual annotation of all embryos for one patient can take up to 20 min. The automated annotation system can do the same work 10 times faster and without human intervention.
Therefore, an automated detection system of embryo development is presented in the paper that consists of two main components: (1) the localisation of an embryo in an image and (2) the identification of embryo development stages with the aim to identify abnormal division patterns. Since the detection of an embryo localisation in an image is a crucial step, the algorithm is proposed that uses a Haar feature-based cascade classifier to determine the rough embryo location and specify the accurate location with the help of the radiating lines.
Automatic detection of embryo location
Cascade classifier
One of the main steps in this research is to automatically determine embryo location. IVF embryos usually have a round shape with brighter edges. A cascade classifier was trained on a sample containing images with the target object labelled as positives, with negative images containing none of these objects. After the classifier is trained, it can be applied to identify targets in the image. In order to investigate the entire frame, the search window is moved across the image. The search window of a classifier can be easily changed when the size of the target object is unknown. In this case, the search should be performed several times using all possible search window sizes, which are placed on all possible locations in the image [26,27,28].
Cascading is a particular case of ensemble model that is built from several classifiers that are sequentially connected. Learning is a multi-stage process where an extension of the original data by the insertion of new attributes is performed in each step. This process accelerates image processing multiple times, as there is no need to check all of the features that are already learned. Haar-like features (see Fig. 7c) are usually used as inputs to the basic classifiers.
As seen in Fig. 7, Haar-like features are extracted from adjacent rectangular regions at a specific location in a search window. Then, the difference between the sums of the pixel intensities in each region is computed. The numerical value of one Haar-like feature is computed using integral images. The integral images are two-dimensional lookup tables in the form of matrix of the same size as the original image. Each element in the integral image is a sum of all pixels located on the up-left position of the original image. The numerical value or the sum S of Haar-like feature is expressed using formula
$$\begin{aligned} S=I(\textit{C})+I(\textit{A})-I(\textit{B})-I(\textit{D}), \end{aligned}$$
where A, B, C and D are the points, which belong to the integral image I. The sum S depends on the type of Haar-like feature to be selected. Usually, a large number of Haar-like features must be retrieved to describe the target object with sufficient accuracy. Therefore, these features are fed into a cascade classifier to construct a strong learner.
Proposed algorithm for the detection of embryo location
By default, a cascade classifier allows us quickly to determine the approximate location of an embryo; however, this is not sufficient for solving our problem. Therefore, the embryo location detection algorithm is developed (see Algorithm 1). Embryo detection consists of two main processing steps. The first step involves the application of a cascade classifier for the detection of rough location. A more accurate location of the embryo is then estimated in the next step using the radiating lines over the image filtered by a Sobel filter. Two Sobel operators \(G_x\) and \(G_y\) are used in this work, which are expressed as
$$\begin{aligned} G_x =\begin{bmatrix} -1&0&1\\ -2&0&2\\ -1&0&1\\ \end{bmatrix}, \quad \\ G_y =\begin{bmatrix} 1&2&1\\ 0&0&0\\ -1&-2&-1\\ \end{bmatrix}, \end{aligned}$$
where \(G_x\) is the image gradient in horizontal direction and \(G_y\) is the image gradient in vertical direction. Absolute gradient value G is given by
$$\begin{aligned} \begin{vmatrix} G \end{vmatrix} = \sqrt{G^2_x+G^2_y}. \end{aligned}$$
The proposed algorithm uses a gray-scale image as an input. The rectangular region of interest (ROI) is returned after the execution of the algorithm. The input image is processed in different scales in order to locate an embryo of the correct size (steps 3–10). If all Haar-like features are applied to satisfy the condition in step 7, then the rough location of the embryo is detected (step 8). A more accurate location (ROI*) of the embryo is estimated in steps 11–15. Sobel filter [29] is used to find the approximate gradient magnitude at each point in the gray-scale image at the ROI (step 11). The radiating lines at each point of the detected square are drawn based on the given parameters, such as line length and the angle between lines. For this purpose, Bresenham’s line-drawing algorithm [30] is applied (step 13). Please refer to Appendix A, for a more detailed explanation of this algorithm. The sum of gradient magnitude for each concentric circle is determined at each point located on the lines. The result of this step is a histogram of obtained values (see Appendix B). The point estimate is computed by determining the maximal value in the histogram and its distance from the centre (step 14).
The advantage of the proposed algorithm is the ability to strengthen edges at a substantially equal distance from the central point. Although Sobolev gradient-based optimisers have been used in some previous studies [31,32,33], the method proposed in this work efficiently uses the traditional optimiser. In addition, the proposed approach is suitable for detecting weak and round curves in a noisy background since it provides successful results without an extra step for noise reduction or intensity normalisation, as seen in previous studies [34, 35]. In comparison, noise reduction is usually applied based on the determined noise types or levels while using traditional methods [36, 37]. For the further processing of images, it is important that the entire embryo is correctly cropped, which is the basis for the determining the cell size, monitoring embryo development stages and then classifying them into defined classes.
Alternatively, this task could be solved using object detection methods such as Local Binary Patterns (LBP) or Histogram of Oriented Gradients (HOG). Both methods were tested, but the cascade classifier was selected for further development. The HOG and LBP methods lack localisation accuracy because they require a high-contrast image, where the target object is captured with sharp edges. Moreover, these methods fail in detecting partially overlapped, noisy or blurred objects, as well as they are too sensitive to object rotation and the location of a region of the target object [38,39,40,41]. An embryo image captured using a time-lapse system is slightly blurry and the boundaries of the embryo are too fuzzy; therefore, methods that are able to generalise the results should be employed.
Identification of embryo development stage by developing a convolutional neural network-based classification system
The identification of early-stage embryo development is formulated as a multi-class prediction problem with the aim to identify the cell number during the division process until day 5 of embryo development. The first attempt to solve the given problem incorporated the use of principal component analysis (PCA) and SVM. A cascade classifier was used to detect the location of the embryo in the image. PCA was for the reduction of data dimensionality and feature extraction. SVM was trained to classify different cell stages based on PCA features. The combination of a cascade classifier, PCA and SVM gave approximately 85% classification accuracy. Therefore, we employed CNNs to construct an embryo cell classification system, since CNNs have become one of the most widely used models of deep learning and demonstrate high accuracy performance results in various image recognition tasks [42, 43]. A general CNNs architecture consists of several convolutions, pooling, and fully connected layers. A convolutional layer computes the output of neurons that are connected to the local regions in the input. A pooling layer reduces the spatial size of the representation in order to minimise the number of parameters and computations in the network. These layers are followed by fully connected layers leading to the Softmax layer, which is the final classifier. Two popular architectures, AlexNet and VGG16, were selected for the present experiments (see Fig. 8). Experimental investigations were executed on a Windows 10 machine with 16.0 GB of RAM installed with an Intel Core i7-7700K 4.20GHz CPU. Less than 45 ms were required to process one image and around 1 min (depending on the number of incubating days) was required to analyse entire embryo development from the beginning to end.
AlexNet demonstrates high classification results in different types of applications while retaining a simple and clear structure [44]. As a result, the network of this architecture is easy to implement. The small number of parameters does not require large computational and memory resources. This architecture consists of five convolutional layers and three fully connected layers. AlexNet includes max pooling and makes use of a rectified linear unit (ReLU) nonlinearity which allows training of the network much faster compared to using a common activation function (e.g. tanh or sigmoid) together with data augmentation and dropout regularisation in order to avoid overfitting.
VGG16 network [45] is an improvement over AlexNet by providing the deeper architecture. A total of 16 layers exist in this architecture, including 13 convolutional layers and 3 fully connected (FC) layers followed by a Softmax classifier. In VGG16, large kernel-sized filters in the first convolutional layers (\(11\times 11\), \(5 \times 5\)) are replaced with multiple \(3 \times 3\) filters that are used in all 13 convolutional layers. Max pooling layers use only a \(2 \times 2\) px window with stride of 2. For all convolutional layers, the stride and padding are set to 1 px.
Comparison of these two architectures reveals that VGG16 has twice as many parameters (\(\sim\)527 MB of required memory) as AlexNet (\(\sim\)232 MB of required memory), which makes it more likely to observe VGG16 demonstrating \(\sim\)15% higher classification accuracy over AlexNet [46]. However, the computational complexity of VGG16 is very high, being 10 times greater than that of AlexNet. Notably, AlexNet is one of a few CNNs models capable of achieving super real-time performance with very small batch sizes, thus allowing it to reduce the consumption of system memory (e.g. a batch size of 1 requires less than 1 GB memory). In this research, both architectures are used to explore and estimate their possibilities of achieving high accuracy results (more than 90%) in identifying a total cell number in images of an embryo.
The classification model has been implemented using MatConvNet [47], an open-source implementation of CNNs in the MATLAB environment that can be easily extended in order to develop new CNNs architectures. Specific software and hardware requirements exist for deep learning model implementations, such as MATLAB 2016a (or later version), a C\C++ compiler, and a computer with a CUDA-enabled NVIDIA GPU supporting compute capability 2.0 or above.
In general, different types of measures are used to evaluate the performance of the selected classifiers. In the multi-class setting, the outcome is produced for many predefined classes \(\{C_1, \ldots , C_i, \ldots ,C_K\}\), where K is the class cardinality [20, 21]. Accordingly, for an individual class \(C_i\), the main counts are defined as true positives \(TP_i\), false positives \(FP_i\), false negatives \(FN_i\), and true negatives \(TN_i\). These are the main entrances for the confusion matrix. A list of measures used to assess the performance of a multi-class predictor is richer compared to binary classification. The conventional performance measures are modified to consider the class distribution resulting in macro-averaging or micro-averaging computation. A macro-average defines the performance treating all classes equally, whereas a micro-average considers the contributions of all classes to compute the selected measure. Obviously, in a multi-class setting, a micro-average is preferable if the class imbalance is prominent.