 Research
 Open access
 Published:
An improved parallel fuzzy connected image segmentation method based on CUDA
BioMedical Engineering OnLine volumeÂ 15, ArticleÂ number:Â 56 (2016)
Abstract
Purpose
Fuzzy connectedness method (FC) is an effective method for extracting fuzzy objects from medical images. However, when FC is applied to large medical image datasets, its running time will be greatly expensive. Therefore, a parallel CUDA version of FC (CUDAkFOE) was proposed by Ying et al. to accelerate the original FC. Unfortunately, CUDAkFOE does not consider the edges between GPU blocks, which causes miscalculation of edge points. In this paper, an improved algorithm is proposed by adding a correction step on the edge points. The improved algorithm can greatly enhance the calculation accuracy.
Methods
In the improved method, an iterative manner is applied. In the first iteration, the affinity computation strategy is changed and a look up table is employed for memory reduction. In the second iteration, the error voxels because of asynchronism are updated again.
Results
Three different CT sequences of hepatic vascular with different sizes were used in the experiments with three different seeds. NVIDIA Tesla C2075 is used to evaluate our improved method over these three data sets. Experimental results show that the improved algorithm can achieve a faster segmentation compared to the CPU version and higher accuracy than CUDAkFOE.
Conclusions
The calculation results were consistent with the CPU version, which demonstrates that it corrects the edge point calculation error of the original CUDAkFOE. The proposed method has a comparable time cost and has less errors compared to the original CUDAkFOE as demonstrated in the experimental results. In the future, we will focus on automatic acquisition method and automatic processing.
Background
Vessel segmentation is important for evaluation of vascularrelated diseases and has applications in surgical planning. Vascular structure is a reliable mark to localize a tumor, especially in liver surgery. Therefore, accurately extracting the liver vessel from CT slices in real time is the most important factor in preliminary examination and hepatic surgical planning.
In recent years, many methods of vascular segmentation have been proposed. For example, Gooya et al.Â [1] proposed a levelset based geometric regularization method for vascular segmentation. Yi et al.Â [2] used a locally adaptive region growing algorithm to segment vessel. Jiang et al.Â [3] employed a region growing method based on spectrum information to perform vessel segmentation.
In 1996, Udupa et al.Â [4] addressed a theory of fuzzy objects for ndimensional digital spaces based on a notion of fuzzy connectedness of image elements and presented algorithms for extracting a specified fuzzy object and identifying all fuzzy objects present in the image data. Lots of medical applications of the fuzzy connectedness are proposed, including multiple abdominal organ segmentationÂ [5], tumor segmentationÂ [6], vascular segmentation in liver, and so on. Based on fuzzy connectedness algorithm, Harati et al.Â [6] developed a fully automatic and accurate method for tumor region detection and segmentation in brain MR images. Liu et al.Â [7] presented a method for brain tumor volume estimation via MR imaging and fuzzy connectedness.
However, with the size of medical data increasing, the sequential FC algorithm, which depends on the sequential performance of CPU, is greatly timeconsuming. On the other hand, parallel technology developments in many domains, such as highthrough DNA sequence alignment using GPUsÂ [8], accelerating advanced MRI reconstructions on GPUsÂ [9]. Therefore, some researchers proposed parallel implementations of FC. An OpenMPbased FC was proposed in 2008, the authors adapted a sequential fuzzy segmentation algorithm to multiprocessor machinesÂ [10]. Thereafter, Zhuge et al. [11] addressed a CUDAkFOE algorithm which is based on NVIDIAâ€™s compute unified device architecture (CUDA) platform. CUDAkFOE computes the fuzzy affinity relations and the fuzzy connectedness relations as CUDA kernels and executes them on GPU. The authors improved their method in 2011Â [12] and 2013Â [13]. However, their methods has expensive computational cost because their method is in an iterative manner and lacks of interblock communication on the GPUÂ [13].
In this paper, we proposed a novel solution to the limited communication capability between threads of different blocks. The purpose of our study is to improve the implementation of CUDAkFOE and enhance the calculation accuracy on GPU by CUDA. The main contributions of the proposed method are in two folds. Firstly, the improved method doesnâ€™t need large memory for large data set since we use a look up table. Secondly, the error voxels because of asynchronism are updated again and corrected in the last iteration of the proposed method.
The paper is organized as follows. InÂ "Background" section, we first summarize the literature of fuzzy connectedness and the CPUbased FC algorithms. Then a brief description of fuzzy connectedness and the original CUDAkFOE is presented in the "Fuzzy connectedness and CUDA executing model" and "Previous work" sections respectively. The proposed improved CUDAkFOE is explained in theÂ "Methods" section. The experiments and conclusion are given in the "Results and discussion" and "Conclusion" sections respectively.
Fuzzy connectedness and CUDA executing model
Fuzzy connectedness
Fuzzy connectedness segmentation method [14] was first proposed by Udupa et al. in 1996. The idea of the algorithm is by comparing connectivity of seed points between target area and background area to separate the target and background.
Letâ€™s define X be any reference set. Fuzzy subset A of X is a set of ordered pairs,
where \(\mu _{A}:X\rightarrow [0,1]\) is the member function of A in X. A fuzzy relation \(\rho\) in X is a fuzzy subset of \(X\times X\), \(\rho =\left\{ \left( x,y \right) ,\mu _{\rho }\left( x,y \right) x,y\in X \right\}\), where \(\mu _\rho :X\times X\rightarrow [0,1]\).
In addition, \(\rho\) is reflexive if \(\forall x, \forall x\in X, \mu _\rho \left( x,x \right) =1\); \(\rho\) is symmetric, if \(\forall x,y\in X, \mu _\rho \left( x,y \right) =\mu _\rho \left( y,x \right)\); \(\rho\) is transitive, if \(\forall x,z \in X, \mu _\rho \left( x,z \right) =max _{y \in x}[min(\mu _\rho \left( x,y \right) ,\mu _\rho (y,z))]\).
Let \(C=(C,f)\) be a scene of \((Z^n,a)\), and if any fuzzy relation k in C is reflexive and symmetric, we said k to be a fuzzy spel affinity in C. We define \(\mu _k\) as
where \(g_1,g_2\) are Gaussian function represented by \(\frac{f(c)+f(d)}{2}\) and \(\frac{f(c)f(d)}{2}\) respectively. The mean and variance of \(g_1\) are computed by the intensity of objects surrounded in fuzzy scene, \(g_2\) is a zeromean Gaussian.
CUDA executing model
The basic strategy of CUDA is for all computing threads to run concurrently in logic. Actually, tasks will divide thread blocks according to the equipments of different CUDA devices, and GPU will automatically distribute task blocks to each stream multiprocessor (SM). Figure 1 shows a procedure of blocks divided from software level to hardware level. In this procedure, all SMs will run in parallel independently. This means any task blocks in different SMs wonâ€™t execute synchronization instructionsÂ [15].
Previous work
In this section, a brief introduction of the CUDAkFOE Algorithm proposed by Ying Zhuge et al. is presented, in which the kFOE is well parallelized. The CUDAkFOE algorithm consists of two parts.

1.
Affinity computation. We can use Eq. (2) to compute the affinity of voxel (c,Â d), and the result of affinity \(\mu _k (c,d)\) is stored in the special GPU device memory.

2.
Updating fuzzy connectivity. The nature of computation for the fuzzy connectivity is a singlesourceshortestpath (SSSP) problem. How to parallelize the SSSP is a challenge problem. Fortunately, CUDAbased SSSP algorithm proposed by Harish and Narayanan solves the problemÂ [16]. With the computing capability of Eq. (2), the atomic operations are employed to solve multiple threads by accessing the same address conflict which basically achieve SSSP parallelization, and the algorithm is presented in [11].
Methods
Performance analysis and improvement
In the first step of CUDAkFOE algorithm, we need release enormous memory space to store the sixadjacent affinity when computing large CT series data. In addition, CUDAkFOE will suffer from errors in some voxels in the scenario of different blocks hard to execute synchronously.
In order to overcome these drawbacks of the CUDAkFOE algorithm, in this section, we propose an improved double iterative method which can be implemented easily and has more accurate performance. The main advantages of the improved method are as follows.

1.
The proposed algorithm needs less memory compared to CUDAkFOE when processing large data sets. (We change the affinity computation strategy by using look up table for memory reduction).

2.
The proposed algorithm doesnâ€™t need CPU involved to handle extra computing and therefore achieve more accurate results. (The main idea is to process twice the error voxels because of asynchronism. Therefore those error voxels will be processed again in the last iteration).
Letâ€™s analyze the performance of CUDAkFOE. Considering a single seed to start the CUDAkFOE algorithm, and using breadthfirst for computing fuzzy scenes. Figure 2 illustrates the processing of edge points, where red points represent its neighbors required to be updated and blue points represent being updated points. If the red points denote fuzzy affinity for propagation outside, the competition problem will be triggered when red points reach the blocksâ€™ edge. The reason is that the fuzzy affinity must be propagated between different blocks. Since the procedure of outward propagation of seed point looks like a tree shape and therefore the path will not appear in a circle. Thus the calculation procedure can be seen as the generation of tree structure which is built on seed points as the tree root.
In Fig. 2, pixel 1, (2, 4), 3 and 5 locate at different thread blocks. Pixel 1, 2 and 3 are in \(C_1\)(c) array and pixel 4 and 5 are updated points which are the neighbors of pixel 2. Considering the worst situation: because the runnings of thread blocks are disorder, when judging \(f_{min}>f(e)\), pixel 5 will be influenced by pixel 2 and 3 together. The running orders have six situations:

(a)
\(\, 2\rightarrow 5, 3\rightarrow 5;\)

(b)
\(\, 3\rightarrow 5, 2\rightarrow 5;\)

(c)
\(\, 1\rightarrow 3, 1\rightarrow 2, 3\rightarrow 5, 2\rightarrow 5;\)

(d)
\(\, 1\rightarrow 3, 1\rightarrow 2, 2\rightarrow 5, 3\rightarrow 5;\)

(e)
\(\, 2\rightarrow 1, 2\rightarrow 5, 1\rightarrow 3, 3\rightarrow 5;\)

(f)
\(\, 3\rightarrow 1, 3\rightarrow 5, 1\rightarrow 2, 2\rightarrow 5;\)
Because updating the pixel 5 only need selecting the max values of fuzzy affinity between pixel 1 and 2, the orders of situation (a) and (b) wonâ€™t influence the propagating result of fuzzy affinity. Therefore, situation (a) and (b) wonâ€™t generate errors because of thread block asynchrony. In the situation (c) and (d), if the pixel 1 doesnâ€™t influence the values of pixel 2 and 3, the results are the same as the situation (a) and (b). However, If pixel 1 influences the pixel 2 or 3, the pixel 5 will be influenced by updating the pixel 2 and 3. At this condition, if run \(2\rightarrow 5\), \(3\rightarrow 5\) or \(3\rightarrow 5\), \(2\rightarrow 5\) first, new value of pixel wonâ€™t reach pixel 5, thus pixel 5 canâ€™t compute the correct value. Therefore, we can run a correction iterator to propagate the correct value of pixel 1. Double iterations can solve the problem of situation (c) and (d). In the situation (e) and (f), pixels will cross 3 thread blocks. Itâ€™s the same situation as (c) and (d), thus we can run triple iterations to solve the asynchronous problem.
Improved algorithm and implementation
The flow chart of improved GPU implementation is illustrated in Fig. 3, which is modified from Ref.Â [13]. The pseudo code of the proposed method is given in the following algorithm.
As shown in the procedure of the algorithm, improved CUDAFOE is an iteration algorithm. In the first iteration, only one voxel will participate in computing affinity and updating the sixadjacent connectivity. While the number of iteration increase, more and more voxels will be computed in parallel until there is no any update operation from all threads, which means every voxel value in \(C_1\) is all false. In the step 6 of algorithm improved CUDAkFOE, we use atomic operation for consistencyÂ [16] since more than one thread in update operation may access the same address simultaneously. In addition, the edges of different blocks can not be easily controlled which may cause error values for the voxels at the edge of blocks. Therefore we use two iterations to solve the problem.
Results and discussion
In the experiments, the accuracy of the proposed method is evaluated by compared to original CUDAkFOE and the CPU version of FC at the same condition. The CPU version source code of fuzzy connectedness is from Insight Segmentation and Registration Toolkit (ITK).
The experiments use a computer of DELL Precision WorkStation T7500 Tower which is equipped with two quadcores 2.93 GHz Intel Xeon X5674 CPU. It runs Windows 7 (64 bit) with 48Â GB device memory. We use NVIDIA Quadro 2000 for display and NVIDIA Tesla C2075 for computing. The NVIDIA Tesla C2075 is equipped with 6Â GB memory and 14 multiprocessors, in which each multiprocessor consists of 32 CUDA cores. Table 1 shows the data set used in the experiments and the results of CPU version, original GPU version and improved GPU version in running time and accuracy. Error pointers is defined as the difference between CPU version and GPU version and its result is displayed in a new image.
Figure 4a shows the result of original CUDAkFOE in one slice and (b) is the result of improved CUDAkFOE. There are error points in the result of original CUDAkFOE compared to our improved one. we choose one region with red rectangle in the results to demonstrate the error points. The region are blown up at the leftupper corner of the results, in which we can clear see there are missing pixels in the result of original CUDAkFOE compared to the improved one.
Figure 5 demonstrates the performance comparison of the original CUDAkFOE and the improved one in different size of data set. In each row, column (a) shows one slice of origin CT series; column (b) and (c) show original fuzzy scenes and threshold segmentation result respectively; column (d) is the different points of origin GPU version and CPU version. From top to bottom, the data set size is \(512*512*131\) in the first row, \(512*512*261\) in the second row, \(512*512*576\) in the third row. It is demonstrated that the bigger vascular, the more different points generated.
In addition, the improved method is further evaluated in different iteration directions as shown in Table 2. The results are also visualized in the Fig. 6. It is illustrated that the results have higher accuracy and less number of error points when choosing more adjacent edges during iterations.
The time cost of each iteration direction is shown in the Fig. 7. For each data set, time cost slightly change while increase the iteration directions, because in the proposed twiceiteration method, most pointers reach their right values and only a few threads will participate in recomputing step.
Conclusions
In this study, we proposed an improved CUDAkFOE to overcome the drawbacks of the original one. The improved CUDAkFOE is in an two iterations manner. Two advantages are in the improved CUDAkFOE. Firstly, the improved method doesnâ€™t need large memory for large data set since we use a look up table. Secondly, the error voxels because of asynchronism are updated again in the last iteration of the improved CUDAkFOE. To evaluate the proposed method, three data sets of different size are used. The improved CUDAkFOE has a comparable time cost and has less errors compared with the original one as demonstrated in the experiments. In the future, we will study automatic acquisition method and complete automatic processing.
Abbreviations
 CUDA:

compute unified device architecture
 FC:

fuzzy connectedness
 CUDAkFOE:

CUDA version of FC
 CT:

computed tomography
 MR:

magnetic resonance
 SM:

stream multiprocessor
References
Gooya A, Liao H, Matsumiya K, Masamune K, Masutani Y, Dohi T. A variational method for geometric regularization of vascular segmentation in medical images. IEEE Trans Image Process. 2008;17(8):1295â€“312.
Yi J, Ra JB. A locally adaptive region growing algorithm for vascular segmentation. Int J Imaging Syst Technol. 2003;13(4):208â€“14.
Jiang H, He B, Fang D, Ma Z, Yang B, Zhang L. A region growing vessel segmentation algorithm based on spectrum information. Comput Math Methods Med 2013;2013:743870. doi: 10.1155/2013/743870
Saha PK, Udupa JK, Odhner D. Scalebased fuzzy connected image segmentation: theory, algorithms, and validation. Comput Vis Image Underst. 2000;77(2):145â€“74.
Zhou Y, Bai J. Multiple abdominal organ segmentation: an atlasbased fuzzy connectedness approach. IEEE Trans Inform Technol Biomed. 2007;11(3):348â€“52.
Harati V, Khayati R, Farzan A. Fully automated tumor segmentation based on improved fuzzy connectedness algorithm in brain mr images. Comput Biol Med. 2011;41(7):483â€“92.
Liu J, Udupa JK, Odhner D, Hackney D, Moonis G. A system for brain tumor volume estimation via mr imaging and fuzzy connectedness. Comput Med Imaging Graph. 2005;29(1):21â€“34.
Lu M, Zhao J, Luo Q, Wang B, Fu S, Lin Z. GSNP: a DNA singlenucleotide polymorphism detection system with GPU acceleration. In: 2011 International conference on parallel processing (ICPP). New York: IEEE; 2011. p. 592â€“601.
Stone SS, Haldar JP, Tsao SC, Sutton B, Liang ZP, et al. Accelerating advanced mri reconstructions on gpus. J Parallel Distrib Comput. 2008;68(10):1307â€“18.
GarduÃ±o E, Herman GT. Parallel fuzzy segmentation of multiple objects. Int J Imaging Syst Technol. 2008;18(5â€“6):336â€“44.
Zhuge Y, Cao Y, Miller RW. Gpu accelerated fuzzy connected image segmentation by using cuda. In: Engineering in Medicine and Biology Society, 2009. EMBC 2009. Annual international conference of the IEEE. New York: IEEE; 2009. p. 6341â€“4.
Zhuge Y, Cao Y, Udupa JK, Miller RW. Parallel fuzzy connected image segmentation on gpu. Med Phys. 2011;38(7):4365â€“71.
Zhuge Y, Ciesielski KC, Udupa JK, Miller RW. GPUbased relative fuzzy connectedness image segmentation. Med Phys. 2013;40(1):011903.
Udupa JK, Samarasekera S. Fuzzy connectedness and object definition: theory, algorithms, and applications in image segmentation. Graph Model Image Process. 1996;58(3):246â€“61.
Kirk DB, Wenmei WH. Programming massively parallel processors: a handson approach. Oxford: Newnes; 2012.
Harish P, Narayanan P. Accelerating large graph algorithms on the gpu using cuda. In: High performance computingâ€”HiPC 2007. Berlin: Springer; 2007. p. 197â€“208.
Documentation CT. v6. 0. Santa Clara (CA, USA): NVIDIA Corporation; 2014.
Authors' contributions
LSW and SHH developed the algorithm. LSW and DL carried out the experiments and drafted the manuscript. LSW, SHH and DL analyzed the data and provided suggestions and comments. All authors read and approved the final manuscript.
Acknowledgements
This work was supported by National Natural Science Foundation of China (Grant No. 61301010, 61327001, 61271336), the Natural Science Foundation of Fujian Province (Grant No. 2014J05080), Research Fund for the Doctoral Program of Higher Education (20130121120045) and by the Fundamental Research Funds for the Central Universities (Grant No. 2013SH005, 20720150110).
Competing interests
The authors declare that they have no competing interests.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Wang, L., Li, D. & Huang, S. An improved parallel fuzzy connected image segmentation method based on CUDA. BioMed Eng OnLine 15, 56 (2016). https://doi.org/10.1186/s1293801601652
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1293801601652