 Research
 Open Access
 Published:
GPU accelerated voxeldriven forward projection for iterative reconstruction of conebeam CT
BioMedical Engineering OnLine volume 16, Article number: 2 (2017)
Abstract
Background
For conebeam computed tomography (CBCT), which has been playing an important role in clinical applications, iterative reconstruction algorithms are able to provide advantageous image qualities over the classical FDK. However, the computational speed of iterative reconstruction is a notable issue for CBCT, of which the forward projection calculation is one of the most timeconsuming components.
Method and results
In this study, the conebeam forward projection problem using the voxeldriven model is analysed, and a GPUbased acceleration method for CBCT forward projection is proposed with the method rationale and implementation workflow detailed as well. For method validation and evaluation, computational simulations are performed, and the calculation times of different methods are collected. Compared with the benchmark CPU processing time, the proposed method performs effectively in handling the interthread interference problem, and an acceleration ratio as high as more than 100 is achieved compared to a singlethreaded CPU implementation.
Conclusion
The voxeldriven forward projection calculation for CBCT is highly paralleled by the proposed method, and we believe it will serve as a critical module to develop iterative reconstruction and correction methods for CBCT imaging.
Background
Conebeam computed tomography (CBCT) has been advanced to serve as a widely available and commonly used imaging modality in clinical applications, such as dental diagnostics [1], imageguided radiotherapy [2], intraoperative navigation [3], and implant planning [4], and has broadened its usage in new settings, including breast cancer screening and endodontics [5]. However, due to the insufficient data conditioning caused by the circular trajectory, the images of CBCT are susceptible to artefacts, noise and the scatter effect [6]. In order to improve image qualities, increasing research efforts have been directed towards iterative reconstruction algorithms [7].
For iterative methods, most computation time is spent calculating the forward and back projections iteratively, which are indispensable and essential components to model the imaging geometry and Xray physics. Due to the use of highresolution flat panel detectors in CBCT, when an iterative reconstruction algorithm is used, the computational load becomes a major issue. Thanks to the advent of graphic processing units (GPUs), massive computation power has been unleashed [8, 9]. In principles, forward and back projections can be generated either in a linedriven or voxeldriven approach. Although both methods deliver the equivalent results with identical theoretical complexities, the compute operations are different in numerical implementation, as shown in Table 1 [9]. When the algorithm shifts from CPUs to GPUs, it is not an intuitive issue because the concurrent threads write data in GPU memories in a scattered manner [10]. The scatter operations potentially cause the interthread interference (or threadracing) problem with write hazards. Since gather operations are more efficient than scatter operations for faster memory reads, the strategy of using unmatched projector–backprojector pairs in iterative methods becomes a common solution, as in [11–13] using the raydriven technique as the projector and the voxeldriven as the backprojector. Nevertheless, Zeng [14] has proved that this bypass scheme will mathematically induce the iterative process to diverge from the true values, and thus matched projector/backprojector pairs are preferred for their mathematical stability and robustness to noise.
Several compute models have been proposed as matched forward/back projector pairs, including distancedriven [15] and separablefootprint approaches [16], and some have been successively GPUaccelerated with specific strategies [17–19]. Among these models, the voxeldriven method is extensively used to perform CBCT forward and back projections for its low complexity. While the voxeldriven backprojection is easy to be GPUaccelerated, due to the nature of scatter operation (as in Table 1), the implementation of its matched forward projector on GPUs is embarrassingly nonparallel, and, to our knowledge, its efficient GPUbased acceleration has never been reported yet.
In this study, a GPU acceleration method is present to calculate voxeldriven forward projections for CBCT iterative reconstruction. This paper is organized as follows: the voxeldriven projection algorithm and the interthread interference problem are first investigated in “Voxeldriven model and interthread interference study” section; based on the analysis, the proposed GPU acceleration method is detailed in “Combating strategy by optimizing threadgrid allocation” section, with a brief workflow in “Implementation outline” section; as method validation, computational simulations are performed with results given in “Experiment and results” section; some issues are discussed and major conclusions are drawn in “Discussion and conclusion” section.
Methods
Voxeldriven model and interthread interference study
For a typical CBCT scanner, the patient (or scanned object) is kept stationary, and the Xray source and the flat panel detector are rotating simultaneously around the object in a circular trajectory. To facilitate the mathematical description, the scanned object is discretized as a threedimensional image matrix, and the flat panel detector as a twodimensional grid, as in Fig. 1a. In the voxeldriven method, the values of the image matrix are assumed to locate at the centre of each cubic voxel. To generate the twodimensional forward projections for CBCT through the image matrix, the algorithm can be summarized into three steps: (1) draw a virtual line from the source (S) to a voxel centre (F(x,y,z)), which represents an Xray pencil beamlet casting through the voxel; (2) extend the line from the voxel to intersect the detector plane at one point (U(u,v)), which represents the position where the traversal beamlet reaches the flat panel detector; (3) scatter the image value of the voxel into the adjacent detector units as the simplified process of Xray signal detection, as illustrated in Fig. 1b.
Conventionally thread grids are allocated to adjacent voxels in axial planes, as reconstructed images are preferred to be displayed in the axial direction. To facilitate the analysis of the interthread problem, the threedimensional forward projection scenario in CBCT is simplified into twodimension, as illustrated in Fig. 2. The beamlets from the Xray source (S) go through each voxel and cast onto the detector. For two arbitrary voxels, the distance between their raycasting intersections on the detector, Δu, can be derived from the imaging geometry relationship as
where β is the projection angle, F _{ g } the geometric factor, and Δv the distance between the voxels.
For two neighbouring voxels, Δv is equal to the voxel size, i.e.:
In the meantime, we can also rewrite the distance between the raycasting intersections, Δu, using the detector unit size as:
where ΔN represents the relative distance normalized by the detector unit size S _{ unit }.
The geometric factor F _{ g } can be derived according to the imaging geometry and written as
where SVD stands for the sourcetovoxel distance, and SDD the sourcetodetector distance.
For a typical CBCT, the size of the image voxel is settable according to the user’s choice. Since the highest resolution is usually preferred by radiologists for more image details, the voxel size can be expressed as
where SAD stands for sourcetoaxis distance.
When we replace the respective terms of Eq. (1) with Eqs. (2)–(5), the distance between the projection intersections of two neighbouring voxels is rewritten as:
where ΔN stands for the relative distance normalized by the detector unit size.
Due to the conebeam effect in CBCT, the value of the first term, \(\left( {\frac{SAD}{SVD}} \right)\), changes along the beamlet, but is always around 1; for the second term, cos β, it’s always less than or equal to 1.
Meanwhile, in Fig. 2, we can see that for adjacent voxels in the same plane, some of them are cast into adjacent detector grids (as Ray _{1} and Ray _{3} in Fig. 2), and some into different grids (as in Ray _{3} and Ray _{4} in Fig. 2). Moreover, for the voxels whose beamlet paths are quite close to each other (as Ray _{1} and Ray _{2} in Fig. 2), they will be projected into the same detector grids. Note that, since each thread is assigned to each image voxel and each detector grid to each tally address on the GPU, when two voxels are cast into adjacent or identical detector grids, the underlying two threads will try to write data to the same memory address on the GPU simultaneously, which leads to write hazards—this is what we call the interthread interference problem, or the thread racing problem.
The analysis implies that if the thread grids are allocated to image voxels closely one by one, some threads will race against each other in GPU memory accessing. Unless a specific strategy is taken, this phenomenon is certain to happen and is impossible to avoid. In the meantime, Fig. 2 also shows that if thread grids are allocated in axial planes or horizontally, the worst case will show up in the central axial plane at all projection angles. However, for the planes above or below the axial plane, the blow of the threadracing (interthread interference) problem is softened because of the conebeam geometry.
It is noted that although the geometric analysis above is based on the axial plane, because of the symmetry of conebeam geometry along the central axis, the discussion is also applicative in the vertical planes. Similar conclusions can be drawn when the threadgrid are allocated in the vertical planes.
Combating strategy by optimizing threadgrid allocation
Based on our discussion, the interthread interference phenomenon always come across to a certain degree, which becomes the major hindrance for GPU acceleration. To combat the problem, what we need is a concrete solution to reduce the occurrence frequency to as low as possible and serialize the residual racing threads in the same process. Rising out of the idea that the conebeam geometry can be utilized to soften the blow of thread racing, we propose a strategy of optimizing the threadgrid allocation to achieve GPU acceleration. The method comprises three key steps:

(a)
Allocate thread grids in the vertical planes (or vertically)
We denote the axial direction as the horizon direction (as in Fig. 3a) and the coronal and sagittal directions as the vertical directions (as in Fig. 3b). By allocating threads vertically, the threadracing frequency of the voxels along the same Xray light path is much decreased. However, as a sideeffect, the worst case of interthread interference is shifted from the central axial plane at all projection angles to the vertical planes at perpendicular angles to the detector plane, where β is equal to 90° or 270°. Then the second step is needed to solve this sideeffect problem.

(b)
Interchange the threadplane direction at the critical projection angles
In fact, as illustrated in Fig. 3c, the worst case of interthread interference induced by Step (a) can be easily solved by interchanging the threadplane direction from the coronal planes to the sagittal planes at certain projection angles. Here we call the angles for threadplane direction interchange the critical angles. The critical angles are dependent on the imaging and scanner specifications, including SAD, SDD, and S _{ unit }, but can be easily obtained by simulation.

(c)
Serialize the residual interfering threads by atomic operations
By the two steps above, the threadracing occurrence frequency can be much decreased. To combat the residual threads that still interfere with each other, we use the GPUenabling atomic operations to serialize the readandwrite operations among these threads. The mechanism of atomic operations is like an address access lock: at the same moment, only one thread is authorized, and all the others are forced to wait in queue [20].
Implementation outline
The key idea of the acceleration method is described in the above. For reference, the core framework is depicted in the form of pseudocodes in Table 2. Once the initialization on GPU is completed, the key processes can be implemented as a kernel CUDA function.
Experiment and results
For method validation, computational simulations are performed using the SheppLogan phantom. The simulation scenario specifications are similar to our inhouse CBCT scanner geometry [21]: the flat detector panel has 512 × 512 units, and the size of each unit is 0.127 mm; the sourcetoaxis distance is 80 cm, and the sourcetodetector distance 100 cm; projections are calculated over 360° with a 1° interval.
The program is deployed on a Windows Server 2012 workstation with 32bit single precision. The CPU is Intel Xeon E52620, which offers two processors with 12 cores running at a frequency of 2.1 GHz. The GPU is nVidia Tesla K20M. Its capability version number is 3.5, and it has 2696 cores running at a frequency of 0.71 GHz. For comparison, the voxeldriven forward projection generation method is programmed and deployed on the same platform. Since multithread parallelization of the voxeldriven forward projection algorithm on CPU also has to deal with the interthread interference problem among CPU threads, which is beyond the scope of this study, the algorithm is implemented on a single threaded CPU, and the single threaded running time is recorded as benchmark for performance assessment. Besides, in order to achieve higher accuracy, an 8subvoxel splitting strategy used: each voxel is first divided 8 cubic subvoxels, and then each subvoxel is forward projected on the detector with 1/8 weight of the father voxel value. Note that the recorded times only account for the process of forward projection kernel excluding the time of transferring data between CPU and GPU.
To obtain the optimal interchange angles or critical angles, we first ran the GPUenabled programme without the thread plane interchange, and collected the calculation times (green curve in Fig. 4). Then, we interchanged the thread plane at 45°, 135°, 225°, and 315°, and got the new calculation times (blue curve in Fig. 4). When the two temporal curves together were plotted, they intersected with each other, and the intersection angles were the optimal interchange angles. In this scenario, we can see that the optimal interchange angles are 80°, 100°, 260°, and 280°, which are then used as critical angles for threadplane direction interchange.
As reference, the GPU processing time of using atomic operations to solve all race conditions is plotted as the black curve, and the time without threadplane interchange is drawn as the green curve in Fig. 4. The computation times of different methods are listed in Table 3, with the CPU computation time as benchmark. We can see that the GPU acceleration ratio of the proposed method is as high as 105.
Discussion and conclusion
As detailed in “Methods” section, the proposed method consists of three key steps. For the first two steps, they are mainly aimed to reduce the interthread interference occurrence frequency. From the results in Table 3, we can see that both steps contribute to the calculation acceleration, and Fig. 4 unveils their respective roles: (1) comparing the method of allocating thread grids in axial planes (black curve) and in vertical planes (green curve), we can see the optimization of threadgrid plane can save more than 20% processing time; (2) comparing the method with and without interchanging the threadgrid plane direction, i.e. the red and green curve respectively, we can conclude this operation performs effectively in reducing the peak compute time.
Besides, in Fig. 4, we can see a stair jump effect in the calculation time (as the blue curve) after we interchange the thread grids from coronal planes to sagittal planes. Since the threedimensional image matrix is stored voxel by voxel in linear memory addresses on GPUs, when a thread is accessing the memory it not only reads the data in the specified address, but also loads the data in adjacent addresses into the GPU cache for possible further usage: this mechanism is what we call memory coalescing, which is highly beneficial for fast data accessing [22]. For our method, thread grids are initially bound to voxels that are saved in coalescing addresses. When we interchange the threadplane direction, the address coalescing condition is corrupted, and data accessing will take more time.
In terms of the critical angles, to investigate their dependence on the CBCT geometric specifications, several scenarios were set up with SAD/SDD ranging from 0.6 to 1. The critical angles were obtained in the same way as in “Experiment and results” section. Only a slight dependence is observed, and the critical angles in different scenarios are fairly close to each other—around 80°, 100°, 260°, and 280°. So we can imply that, from a practical perspective, this set of critical angles performs effectively, and they can be used as empirical values.
In summary, we propose a GPU acceleration method of calculating voxeldriven forward projection for conebeam CT. The method is composed of three key steps and is easy to implement. The experimental results demonstrate its effectiveness and efficiency in handling the interthread interference problem, and a surprising acceleration ratio, as high as 105, has been achieved. It should be noted that the CPU implementation runs on a single thread. A multicore CPU implementation using 6 cores can be accelerated and run faster (for example using OpenMP and streaming SIMD extensions (SSE)), which would reduce the speedup, but certain approach is also required to combat the thread racing problem on CPU.
Besides, using a more sophisticated forward projection method is probably able to achieve improved accuracy. For example, Long et al. [16] proposed a voxeldriven method combining a full voxel model and a detector unit response. In their method, the boundaries of each cubic voxel are first raycast onto the detector to generate a polygonal pattern, and then the pattern multiplies a trapezoid/rectangular function to produce the respective forward projection footprint. As discussed in [16], highly realistic projection images can be delivered, but at the expense of tremendously increasing computational complexities compared with the proposed method here. In the meantime, as in [17], it is of scatter operation in nature as well, so special GPU acceleration approaches are also required to combat the thread racing problem (denoted as readmodifywrite errors in [17]). Therefore, in some extent, method selection is like a tradeoff between approximation and computation complexity, and it all depends on the application requirements.
We believe the proposed acceleration method is probable to serve as a critical module to develop the iterative reconstruction and correction methods for CBCT imaging, as in our case where this method has already been incorporated into our iterative algorithm development platform and working properly [23]. Since the algorithm is programmed for research only, we believe that, with further coding optimization, a higher speedup can be further achieved.
Abbreviations
 GPU:

graphic processing unit
 CT:

computed tomography
 CBCT:

cone beam computed tomography
 SDD:

sourcetodetector distance
 SAD:

sourcetoaxial distance
 SVD:

sourcetovoxel distance
References
 1.
Horner K, Islam M, Flygare L, Tsiklakis K, Whaites E. Basic principles for use of dental cone beam computed tomography: consensus guidelines of the European Academy of Dental and Maxillofacial Radiology. Dentomaxillofacial Radiol. 2009;38:187–95.
 2.
Sharp GC, Jiang SB, Shimizu S, Shirato H. Prediction of respiratory tumour motion for realtime imageguided radiotherapy. Phys Med Biol. 2004;49:425.
 3.
Dobbe JGG, Curnier F, Rondeau X, Streekstra GJ. Precision of imagebased registration for intraoperative navigation in the presence of metal artifacts: application to corrective osteotomy surgery. Med Eng Phys. 2015;37:524–30.
 4.
Chang SH, Lin CL, Hsue SS, Lin YS, Huang SR. Biomechanical analysis of the effects of implant diameter and bone quality in short implants placed in the atrophic posterior maxilla. Med Eng Phys. 2012;34:153–60.
 5.
Patel S, Durack C, Abella F, Shemesh H, Roig M, Lemberg K. Cone beam computed tomography in endodontics—a review. Int Endod J. 2015;48:3–15.
 6.
Schulze R, Heil U, Groß D, Bruellmann DD, Dranischnikow E, Schwanecke U, et al. Artefacts in CBCT: a review. Dentomaxillofacial Radiol. 2011;40:265–73.
 7.
Beister M, Kolditz D, Kalender WA. Iterative reconstruction methods in Xray CT. Phys Med. 2012;28:94–108.
 8.
Eklund A, Dufort P, Forsberg D, LaConte SM. Medical image processing on the GPU—past, present and future. Med Image Anal. 2013;17:1073–94.
 9.
Pratx G, Xing L. GPU computing in medical physics: a review. Med Phys. 2011;38:2685–97.
 10.
Flores L, Vidal V, Mayo P, Rodenas F, Verdu G. Iterative reconstruction of CT images on GPUs. Conf Proc IEEE Eng Med Biol Soc. 2013;2013:5143–6.
 11.
Zhao X, Hu J, Zhang P. GPUbased 3D conebeam ct image reconstruction for large data volume. Int J Biomed Imaging. 2009;2009:1–8.
 12.
Noël PB, Walczak AM, Xu J, Corso JJ, Hoffmann KR, Schafer S. GPUbased cone beam computed tomography. Comput Methods Programs Biomed. 2010;98:271–7.
 13.
Hillebrand L, Lapp RM, Kyriakou Y, Kalender WA. Interactive GPUaccelerated image reconstruction in conebeam CT. Proc SPIE. 2009;7258:72582A–1–72582A–8.
 14.
Zeng GL, Gullberg GT. Unmatched projector/backprojector pairs in an iterative reconstruction algorithm. IEEE Trans Med Imaging. 2000;19:548–55.
 15.
De Man B, Basu S. Distancedriven projection and backprojection in three dimensions. Phys Med Biol. 2004;49:2463–75.
 16.
Long Y, Fessler JA, Balter JM. A 3D forward and backprojection method for Xray CT using separable footprint. IEEE Trans Med Imaging. 2010;29:3–6.
 17.
Wu M, Fessler J. GPU acceleration of 3D forward and backward projection using separable footprints for Xray CT image reconstruction. In Proceedings international meeting on fully 3D image reconstruction 2011; p. 56–9.
 18.
Gao H. Fast parallel algorithms for the xray transform and its adjoint. Med Phys. 2012;39:7110–20.
 19.
Nguyen VG, Jeong J, Lee SJ. GPUaccelerated iterative 3D CT reconstruction using exact raytracing method for both projection and backprojection. In 2013 IEEE nuclear science symposium on medical imaging conference 2013; p. 1–4.
 20.
NVIDIA. Cuda C programming guide. Program Guid. 2015. p. 1–261. http://docs.nvidia.com/cuda//pdf/CUDA_C_Programming_Guide.pdf.
 21.
Yi DU, Xiangang W, Xincheng X, Bing LIU. Automatic Xray inspection for the HTRPM spherical fuel elements. Nucl Eng Des. 2014;280:144–9.
 22.
Cook S. CUDA programming: a developer’s guide to parallel computing with GPUs. Newnes; 2012. https://www.amazon.com/CUDAProgrammingDevelopersComputingApplications/dp/0124159338.
 23.
Du Y, Wang X, Xiang X, Wei Z. Evaluation of hybrid SART + OS + TV iterative reconstruction algorithm for opticalCT gel dosimeter imaging. Phys Med Biol. 2016;61:8425–39.
Authors’ contributions
YD carried out most of the study, numerical implementation and statistical analysis. GY helped in algorithm development and programming. XG and XX helped in the algorithm optimization. YP contributed in the result analysis and manuscript review. All authors read and approved the final manuscript.
Acknowledgements
This work was jointly supported by the National Natural Science Foundation of China (Nos. 61571262, 11575095), and National Key Research and Development Program (No. 2016YFC0105406).
Competing interests
The authors declare that they have no competing interests.
Availability of data and supporting materials
The simulation phantom is derived from the general SheppLogan phantom. The pseudocode of algorithm is listed in Table 1, and the computing and programming environment is detailed in “Experiment and results” section. As part of iterative reconstruction and artefact correction algorithm development platform, the code is not opensource at the moment.
Author information
Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Du, Y., Yu, G., Xiang, X. et al. GPU accelerated voxeldriven forward projection for iterative reconstruction of conebeam CT. BioMed Eng OnLine 16, 2 (2017). https://doi.org/10.1186/s1293801602938
Received:
Accepted:
Published:
Keywords
 Conebeam CT
 GPU
 Forward projection