Automatic analysis of 2D polyacrylamide gels in the diagnosis of DNA polymorphisms

Introduction The analysis of polyacrylamide gels is currently carried out manually or automatically. In the automatic method, there are limitations related to the acceptable degree of distortion of lane and band continuity. The available software cannot deal satisfactorily with this type of situations. Therefore, the paper presents an original image analysis method devoid of the aforementioned drawbacks. Material This paper examines polyacrylamide gel images from Li-Cor DNA Sequencer 4300S resulting from the use of the electrophoretic separation of DNA fragments. The acquired images have a resolution dependent on the length of the analysed DNA fragments and typically it is MG×NG=3806×1027 pixels. The images are saved in TIFF format with a grayscale resolution of 16 bits/pixel. The presented image analysis method was performed on gel images resulting from the analysis of DNA methylome profiling in plants exposed to drought stress, carried out with the MSAP (Methylation Sensitive Amplification Polymorphism) technique. Results The results of DNA polymorphism analysis were obtained in less than one second for the Intel Core™ 2 Quad CPU Q9300@2.5GHz, 8GB RAM. In comparison with other known methods, specificity was 0.95, sensitivity = 0.94 and AUC (Area Under Curve) = 0.98. Conclusions It is possible to carry out this method of DNA polymorphism analysis on distorted images of polyacrylamide gels. The method is fully automatic and does not require any operator intervention. Compared with other methods, it produces the best results and the resulting image is easy to interpret. The presented method of measurement is used in the practical analysis of polyacrylamide gels in the Department of Genetics at the University of Silesia in Katowice, Poland.


Introduction
Modern DNA analysis is used in many areas of life sciences, from biology [1] to forensic medicine or microwave analysis [2,3]. For many such cases, the analysis of DNA is associated with electrophoresis carried out on polyacrylamide gels, an universal analytical technique used to separate DNA fragments by size. The advantages of using polyacrylamide gels are low cost of staining separated DNA fractions and also easily interpretable analysis results. The obtained results are compared manually or semiautomatically. The manual method involves a manual selection of lanes and bands that are in the analysed area. Most often this occurs by selecting interesting bands on the printed analysis result (Figure 1) or by placing the lines along the lane on the computer screen. This analysis involves finding the location of the subsequent bands along the lane that was marked by the operator. In this case, errors occur: due to different print quality depending on the type of printer (contrast, type of paper used etc.), related to the participation of human factors, i.e.: the impact of experience, fatigue, sensitivity to illusions related to the impact of the expected result on the course of the analysis and an individual threshold of distinguishing bands from the background.
A major difficulty is also the amount of time devoted to the same analysis and the lack of reproducibility of measurements.
Known methods and software for automatic analysis have disadvantages mainly related to the analysis of gel images in which individual bands are not located on a straight lane (Figure 1). There are also problems with the proper separation of lanes and detection of bands which are arranged close to each other. The analysis of the DNA fragments, observed as bands on a gel image, can thus be divided into two parts: the separation of lanes and the separation of bands in each lane. The result of detection of band positions is most often the matrix L DNA containing the value "1" in the places where a band occurs and "0" in the other places. The number of rows of the matrix L DNA corresponds to the number of positions of all the bands, and the number of columns corresponds to the number of gel lanes [4,5]. Since the matrix L DNA is, by definition, a binary matrix, further analysis and comparison of results for subsequent lanes is easy. Therefore, a key issue is appropriate separation of lanes and bands for each lane related to image analysis and processing.
The first works on the analysis and processing of polyacrylamide gel images obtained from electrophoresis are from the 80's, for example, the works of L. Lipkln [6] or Stanley et al. [7]. These relate to the basic methods of analysis of image brightness for each lane. The authors of [8] does not include any information on how to separate individual lanes. The authors assume that they are arranged perfectly parallel. Similar assumptions are in [9,10]. The authors of [11] from 2001 present the analysis of individual bands using information about the brightness gradient. Bands are defined depending on the distance between the changes of the gradient sign. This method is useless when two neighbouring bands are connected or there is uneven brightness on the whole gel. In other works, different methods of image analysis and processing are used, e.g.: active contour [12], the Gaussian distribution [11], fuzzy c-means algorithm [13] or statistical analysis [14]. Another group of works is devoted to the development of these methods. For example, the works of J. Pizzonia [15] and L. Carol [16], GILE software (Gel-Image-Extractor) [17] or [18][19][20][21][22][23]. In [18], gels in large scale were analysed, [19] used the method of least squares, and [20] shows a method of using morphological operations (erosion) in the analysis of ROI (Region Of Interest) of gels. The aforementioned GILE software [17] is not the only available software. There are other applications for automatic or semi-automatic analyses of 2D gels, such as GelQuant [24] [30], PDQuest [31], Progenesis Samespots [32] or REDFIN [33] and many others. A wide range of available programs for gel image analysis enables to obtain satisfactory results in the case of simple gels with individual lanes arranged in parallel. If there are artefacts, connected lanes or bands, this group of software [19,[24][25][26][27][28][29][30][31][32][33] allows for their manual editing. In these cases, the method is semi-automatic or fully manual. Therefore, more sophisticated methods of image analysis must be used or the analysis algorithm must be profiled precisely to the specified problem (a given type of gels). One such method proposed by the authors is described below. It is characterized by a new approach to the analysis of polyacrylamide gels which provides: fully automatic measurement of the band position, automatic determination of the lane position in cases of their local distortion, results in the form of a matrix of band occurrences (for all lanes). A special feature that distinguishes the approach presented below from other wellknown methods, is the correct algorithm operation in cases of changes in lane thickness.

Material
This paper examines polyacrylamide gel images from Li-Cor DNA Sequencer 4300S resulting from the use of the electrophoretic separation of DNA fragments. The acquired images have a resolution dependent on the length of the analysed DNA fragments and typically it is M G ×N G =3806×1027 pixels. The images are saved in TIFF format with a grayscale resolution of 16 bits/pixel. The images of banding patterns of DNA amplification products were obtained after digestion with two enzymes, namely HpaII and MspI, used in the DNA methylome profiling method, MSAP (Methylation Sensitive Amplification Polymorphism).
The analysis was performed on DNA isolated from plants exposed to drought stress at four time points t 1 , t 2 , t 3 and t 4 and from control plants at one time point (t 5 ), which gave a total of 5 points (the first two time points t 1 and t 2 are shown in Table 1). At Table 1 The arrangement of biological and technical replicates on the gel -the first two time points   Table 1). The reaction was carried out in two technical replicates for each enzyme, for a total of four trials for each of the three biological replicates. The trials are arranged on gels vertically in successive lanes according to Table 1. Banding patterns were analysed by assessing the presence or absence of a band for a given track by transforming the gel image into a matrix consisting of "0" or "1", where "0" means no band and "1" means its presence. Further analysis involves designation of the dominant banding pattern for all replicates, both the technical as well as biological ones. What is subject to assessment are the differences in banding patterns between the points t 1, t 2, t 3, t 4 and t 5 . A total of 20 gel images were analysed in each of 60 tracks with different levels of distortion.

Implementation of the new method
The new method, proposed by the authors, is based on the analysis of gel images which is carried out in two separate stages: analysis of lanes and analysis of bands. In both stages there are similar problems with the detection of objects and the removal of their redundancy (interference). Both the first analysis and the latter one require prefiltration of the image and normalization described below.

Preprocessing
In the first stage, the image L G with a resolution of M G ×N G =3806×1027 pixels is subjected to filtration with a median filter whose mask size is 3×3 pixels (the result is the image L F ) [34][35][36]. The median filter mask size was chosen based on the image resolution and the size of possible artefacts present in the image (minor defects, CCD errors and noise of the electronics). In a further step, the image L F undergoes normalization operation from the range of brightness levels corresponding to 16 bits (2 16 ) to the floating range 0-1. The image L O thus obtained is subjected to further processing steps.
Processingthe analysis of lanes The input image L O , after filtration and normalization, was subjected to the analysis of lanes. It involves the operation of closure with a structural element SE sized M SE ×N SE =40×1 pixels [37][38][39][40][41]. The size of the structural element SE was chosen in such a way as to highlight the changes in brightness between individual lanes in the gel image. When decreasing the resolution M SE , the contrast between individual lanes decreases. An increase in the resolution M SE causes the loss of information about the angle changes of the lane. When increasing N SE , on the other hand, the contrast between adjacent lanes decreases. The resulting image L C is further analysed in the subsequent rows. For each row in the range m∈[0,M G ], level differences of brightness y C (n) are analysed in relation to the filtered function y T (n) of y C (n). The value of y C (n) is brightness for the selected m i.e. y C (n) =L C (m,n) for m=const, while the value of y T (n) results from the filtration of y C (n) with an averaging filter sized 1×10. The averaging filter size is chosen once and is equal to the typical lane width, which in this case is 10 pixels. The resulting differences are shown in Figures 2 and 3 where differences below zero (black bands in the white background) form the white pixels and the other ones form black pixels. The resulting image L B is the final stage of the analysis of lanes. For further analysis, it will be used in conjunction with the source image L O .
Processingthe analysis of bands  [42][43][44][45][46]. Due to such situations, the analysis of individual bands and their location for individual lanes must be performed in smaller ROIs. The ROI size must be no less than the width of a single lane and not greater than an average distance between the artefacts. In practice, it appears that the best results are obtained for ROI sizes that are 10 to 20 times the width of the lane [47,48]. In the analysed case, it is M ROI ×N ROI =200×200 pixels. The image L B is divided into M G /M ROI in rows and N G /N ROI in columns. In total, for the resolution of the image L G equal to M G ×N G =3806×1027 pixels, there are, after rounding, 95 ROIs for the analysis. For each ROI, the lanes were labelled, which gave the image L IND . Then, band detection is carried out for each lane in an analogous manner to lane detection [49][50][51]. The difference   There are shown differences between y C (n) and y T (n) which provide valuable information about local changes in brightness. Depending on the value of these differences, decisions are made about the detection of lanes whose width must fall within the adopted range. Narrower objects are considered as interference and the wider are split into smaller ones.  Figure 5. The parameter Δm i , which enables the separation of combined bands, will be used in further analysis. In practice, it is most convenient to adopt the acceptable range of variation of Δm i covering the range of 50-150% of a typical band width. Below this range (<50%), a detected object is considered to be interference, whereas above this range (>150%) a detected object is considered as a combination of two bands. Another considered parameter p r of thresholding y PG i was chosen on the basis of the analysis of sensitivity SPC and specificity TPR. The values of sensitivity and specificity were determined by comparing the performed automatic analysis with the manual analysis carried out by an expert for 20 images containing 60 lanes each. Depending on the detection or omission of a band in any of the trials, the results were determined as false negative FN, false positive FP, true negative TN and true positive TP. The obtained results of FN, FP, TN, TP for the optimal value of p r =9% are shown in Table 2 (SPC=0.95, TPR=0.94).
The last stage of the analysis is the conversion of individual band positions to the matrix L DNA in which columns correspond to subsequent lanes and rows to the location of and Δm i . This matrix is further verified in terms of reproducibility between technical replicates (Table 1), biological replicates and finally differences in the DNA structure. Then, these differences are easy to mark automatically (a comparison of adjacent columns with the operation xor). In this case, the matrix L DNA has a constant number of columns equal to the number of lanes, whereas the number of rows is variable and depends on the threshold value, namely p r . This matrix has a resolution of 5×60 to 200×60 pixels for typical conditions. The next section shows the comparison of the quality of the obtained results with other methods described in other works.

Comparison with other methods
The comparison of the quality of the results was carried out on 20 available gel images containing 60 lanes each, which gave a total of 1200 lanes. Two image analysis algorithms, known from the literature, were implemented; method 1 - [5] and method 2 - [11]: Method 1proposed by I. Bajla et al. in [5] - Figure 7A. This method involves filtration with a non-linear two-dimensional filter. Then, lane detection, smoothing and the analysis of peaks in the background area of the lane are carried out in the resulting image. In the next step, an operator manually corrects false results. Obtained results are shown in the form of bands in a gel diagram. This method is not fully automatic. The operator must manually correct the falsely detected bands. The number of wrongly identified bands varies and is highly dependent on the operator's individual features, mainly contrast threshold below which the band is considered as interference.  the authors for the database of gel images described in [11] are 6.7% and 12.8% when using the N. Otsu thresholding method [52].
Method 3described in this paper - Figure 7C. All 3 methods were implemented according to the descriptions in [5,11] and in accordance with the block diagrams shown in Figure 7A,B and C. The results depend on three elements: the degree of lane tortuosity, the threshold of band distinction and shifts in the position of bands for individual lanes.
The degree of lane tortuosity influences, to a significant extent, the distinction error of the band position in the axis 0x. The consequence of winding lanes or their uneven thickness is that the error of the correct assignment of a band to a lane for the 1200 analysed lanes for methods 1 and 2 is very large and highly dependent on the amplitude of lane tortuosity ( Figure 4E). In method 3, discussed in this paper, the error does not exceed a few per cent (the exact comparison is carried out in the next paragraph). The advantage of method 3 over methods 1 and 2 results from the lane area analysis. However, the problem with method 3 is an appropriate separation of lanes when the shift does not fall within the range of the ROI analysis. Although methods 1 and 2 enable manual correction of the obtained results, it consumes a lot of time and requires operator intervention in the results. Method 3, on the other hand, is fully automatic.
The threshold of band distinction is highly dependent on the adopted methodology of gel image analysis. In method 1, the band width on one-dimensional waveform is specified. The changes in the median values in front of and behind the band are analysed. Based on this comparison, a new band is recognized on a given lane. Due to the constant band width adopted in method 1 Δm i =const. This method cannot deal with the proper detection of bands which are close to each other, especially in situations when the combined total width of the bands is not close to a multiple of their width (2*Δm i ), and these situations often occur in practice [50,51,53]. In method 2, two closely situated bands are well separated even when their combined width is not equal to a multiple of their width.  [5] and Jiann-Der Lee in [11]. Visible differences relate primarily to the main analysis of data. In the proposed method, the analysis consists of two steps: lane detection followed by band detection. In the other compared methods, the main element of the algorithm is band detection, whereas lane identification is neither analyzed nor presented.
In each method, the selection (manual or automatic [52,54]) of the brightness threshold p r , that determines the visibility of bands, becomes dominant. As a result, for methods 1, 2 and 3, the threshold p r was changed while observing changes in specificity and sensitivity of the ROC curves (Receiver Operating Characteristic). The results are shown in Figure 8. The best results were obtained for method 3, i.e.: SPC=0.95, TPR=0.94. In the case of methods 1 and 2, low sensitivity and specificity are due to the lack of manual correction of the results. For method 1, specificity and sensitivity are as follows: SPC=0.72, TPR=0.5. For method 2, they are: SPC=0.71, TPR=0.72. Manual correction of the results obtained in methods 1 and 2 improves the results to the ideal values. Therefore, in the case of manual correction of the results, they are always only slightly better than the results obtained from the presented automatic method 3 (SPC=0.95, TPR=0.94 and AUC=0.98) - Table 3.
Comparing the results for the analysed lanes of all gels, the following conclusions can be drawn: -method 1 enables to obtain satisfactory results for little complex analysis in which there is no need to analyse lane tortuosity; it is possible to manually correct the incorrect results, -method 2 enables to obtain good results for complex analysis; it is possible to manually correct the incorrect results, -method 3 enables to obtain good results fully automatically even in the case of winding lanes; it is fully automatic.
In addition to methods 1 and 2, the described method 3 can be compared with many other known methods. These are the ones mentioned in the introduction, or for example, Figure 8 ROC graph. Dependency of sensitivity and specificity changes for the changes in the cut-off threshold p r for methods 1, 2 and 3 without manual correction of the results. Changes in p r are in the range from 0 to 1 for the compared methods 1, 2 and 3. The best results were obtained for method 3 presented in this article. Methods 1 and 2 have worse results. For manual correction of the results, possible in methods 1 and 2, the obtained results are perfect. the method described in detail by Caridade C. in [55]. The author describes the method (GEIAS) of gel image analysis and the correction of their wrong position -rotation. For 12 images (1082 bands in total) analysed in [55] the error is 9.2%. The images are also rotated in the angular range from -10°to 10°at 0.5°increments. The resistance of the proposed algorithm GEIAS to rotation is then analysed. The method described in [55] does not cover the width distortion of individual lanes which often occurs in practice. It is one of the major differences in comparison to method 3 proposed in this paper.
Method 3, in the described cases, enables to obtain the best results, but it does have its drawbacks. These include: -limited resistance to lane width distortion -especially in situations when they occur together with the noise of image acquisition, -limited resistance to the decay of individual lanes -for example due to the errors in image acquisition or other (biological) factors, -the need for single introduction of selected parameters of the algorithm for each new gel image acquisition device.
Elimination of these defects is difficult in practice and will be the subject of the authors' future works.

Conclusions
The proposed methodology for the analysis and processing of polyacrylamide gel images enables to perform an automatic and repeatable measurement of the position of lanes and bands. This method is superior to the previously presented methods, described in [5] and [11], in cases of gel deformation. In such situations, each track width is different and it is difficult to identify to which lane the recognized band belongs.
Proper identification of bands and lanes in the described method enables to obtain the output image L DNA . This image (L DNA ) contains information about the band positions in rows and information about the lane positions in columns. The number of recognized bands is highly dependent on the threshold p r which is determined manually or automatically. The best results (SPC=0.95, TPR=0.94 and AUC=0.98) were obtained for p r =8%. The results are worse for other threshold values ( Figure 6).
The presented measurement method is used in practical analysis of polyacrylamide gels in the Department of Genetics at the University of Silesia in Katowice, Poland.
Further studies concern the correction of the obtained results shown in Figure 6. In the case of differences between repeats shown in the image L DNA , it is necessary to move back to the image L O and analyse the brightness of the area of interest. Depending on the comparison, correction must be made in the image L DNA . However, this