Multi-view 3D skin feature recognition and localization for patient tracking in spinal surgery applications

Minimally invasive spine surgery is dependent on accurate navigation. Computer-assisted navigation is increasingly used in minimally invasive surgery (MIS), but current solutions require the use of reference markers in the surgical field for both patient and instruments tracking. To improve reliability and facilitate clinical workflow, this study proposes a new marker-free tracking framework based on skin feature recognition. Maximally Stable Extremal Regions (MSER) and Speeded Up Robust Feature (SURF) algorithms are applied for skin feature detection. The proposed tracking framework is based on a multi-camera setup for obtaining multi-view acquisitions of the surgical area. Features can then be accurately detected using MSER and SURF and afterward localized by triangulation. The triangulation error is used for assessing the localization quality in 3D. The framework was tested on a cadaver dataset and in eight clinical cases. The detected features for the entire patient datasets were found to have an overall triangulation error of 0.207 mm for MSER and 0.204 mm for SURF. The localization accuracy was compared to a system with conventional markers, serving as a ground truth. An average accuracy of 0.627 and 0.622 mm was achieved for MSER and SURF, respectively. This study demonstrates that skin feature localization for patient tracking in a surgical setting is feasible. The technology shows promising results in terms of detected features and localization accuracy. In the future, the framework may be further improved by exploiting extended feature processing using modern optical imaging techniques for clinical applications where patient tracking is crucial.

Page 2 of 15 Manni et al. BioMed Eng OnLine (2021) 20:6 Background The insertion of pedicle screws is a critical step in spine fixation surgery. Conventional open surgery is performed through a mid-line incision where the posterior aspect of the spine is exposed. However, there is a trend toward increased use of minimally invasive surgical techniques, due to reductions in blood loss, length of hospital stay, and surgical site infections [1]. Minimally invasive surgery (MIS) is performed through small skin incisions where the vertebrae are reached by use of tubular retractors [2]. Due to the reduced visibility during MIS procedures, intraoperative imaging such as fluoroscopy is frequently used. However, to reduce radiation exposure and increase accuracy, a number of computer-assisted navigation solutions have been devised [3][4][5][6]. Clinical studies have shown that the use of intraoperative three-dimensional (3D) imaging coupled to a navigation system leads to higher accuracies than competing technologies [7]. All navigation technologies require co-registration of the patient and the pre-or intraoperative images to allow tracking of both patient and surgical instruments relative to the medical images. Conventional navigation solutions typically include infra-red camera systems tracking a dynamic reference frame attached to a vertebra [8]. Efforts have been made to design patient tracking methods based on unobtrusive markers or no markers at all. One such system using non-invasive optical markers has been described by Malham et al. [9,10] (SpineMask, Stryker, Kalamazoo, Michigan, USA). The system enables high accuracy placement of minimally invasive lumbar pedicle screws. Markerless tracking solutions have been used experimentally on phantoms in other surgical fields. However, studies on spine surgery are lacking [11,12]. A robot system using light to track the bony anatomy and performing pedicle screw placements was recently presented. The device was validated on cervical vertebrae phantoms, reaching a mean positional error of 0.28 ± 0.16 mm [13]. The Microsoft Hololens uses surface matching for tracking and has been used experimentally in non-medical phantoms with an accuracy ranging from 9 to 45 mm [14], while in a spine phantom study, an accuracy of roughly 5 mm was achieved [15]. The navigation technology used in this study is an augmented-reality surgical navigation (ARSN) system relying on adhesive optical skin markers for motion tracking and compensation [16,17]. Four high-resolution optical cameras are integrated in the flat detector of a C-arm with cone-beam computed tomography (CBCT) capability. The markers are recognized by the cameras and their relative positions in space are used to create a virtual reference grid, which is co-registered to the patient during CBCT acquisition. Optical markers attached to the patient's skin have been used for respiratory motion tracking [18] and for medical imaging applications [19]. The use of optical markers for motion tracking is combined with digital image correlation and tracking techniques [20][21][22][23]. Recently, Xue et al. [24] demonstrated that ink dots on the skin could be video tracked with high precision and that the post-processing retrieved more detailed information compared to marker-based methods. Similarly, direct tracking of spine features and tracking of skin features using hyperspectral cameras for spine surgery have recently been proven feasible.
In this study, a new markerless tracking technology using grey-scale video cameras based on skin feature detection was evaluated. Image analysis techniques were applied to detect and track natural features of the skin. There are several advantages when refraining from using optical markers for motion tracking. First, the workflow of the procedure can be improved by simplifying the protocol for patient preparation and by increasing the reliability of tracking during the surgical procedure. Second, the risk of losing sight of the markers can be abolished when skin features can be used as a reference. The wellknown feature detection algorithms, Maximally Stable Extremal Regions (MSER), and Speeded Up Robust Features (SURF) were applied to detect and extract skin features such as moles and pigment spots. These methods were chosen, since they offer a good reproducibility under different image views, being invariant to rotation, scaling, and affine transformation [25][26][27][28]. The proposed 3D-localization framework, used multiview geometry principles to perform image rectification, enhance feature detectability, improve feature matching, and calculate and assess each triangulated feature. The sum of squared differences (SSD) was used as a feature matching metric on scan lines between multiple-view acquisitions. To remove 3D outliers, a second feature selection step was applied, specifically performed for the z-coordinate mismatch after the triangulation of all pairs of matched features. The final inliers were evaluated by computing the overall mean triangulation error. In summary, the contribution of this paper is an alternative to marker-based tracking. We hypothesize that the camera feed provides enough details for skin feature detection and tracking. The sub-millimeter localization accuracy achieved was sufficient for surgical navigation. The framework included 3D reconstruction and feature localization over multiple-view acquisitions. It was validated on eight clinical spinal surgery cases performed in an academic tertiary medical center. This paper concentrates on the application of skin feature detection techniques to achieve accurate markerless tracking in spinal surgery. In the development of tracking systems, feature detectors and descriptors are widely investigated, since they demand the highest percentage of the processing time. The former is dependent on available image information, while the latter defines the encoding [29]. Aspects such as adaptability to image transformation and mismatched features need to be evaluated, as they can potentially affect tracking [30].
In this paper, we evaluate different feature detectors and extractors (SURF and MSER), for studying the number of inliers and the overall localization error on multi-view images from several spine patients, subjected to different illumination conditions. In addition, to strengthen the stability in tracking by eliminating the mismatched multiview keypoints and improve image matching, we proposed a 3D outlier removal step, imposing the matching to the keypoint relying on the same epipolar line. The 3D triangulations were obtained only from the matches relying on the epipolar constraint. The contributions are: (1) building a computer vision framework for preprocessing optical skin images, detecting and matching local invariant image regions for two different image views; (2) assessing the most stable feature detection approach for reliability and accuracy; (3) improving the image matching by introducing a 3D epipolar constraint; (4) validating the methodology on optical images acquired in eight patients to assess the 3D-localization error for the matched features.

Results
Optical data for assessing the 3D localization were collected from two sources: a cadaver study and a prospective clinical observational study. The cadaver study was performed according to all applicable laws and directives. The clinical study was approved by the local ethics committee and all enrolled patients signed informed consent. The data that support the findings of this study are generated by Philips Electronics B.V., Best, The Netherlands and Karolinska University Hospital, Stockholm, Sweden. All images of the datasets were acquired at the same UHD resolution of 2592 pixels by 1920 lines. The first dataset consisted of one multi-view acquisition, thus four images, of a cadaver. MSER and SURF were applied to several selected regions to perform the first multi-view experiment for skin localization. The second dataset consisted of image data from eight patients included in a spine navigation study and taken during the surgical procedures. The data were used to perform two different experiments, first a skin feature localization and later an optical marker localization and the ground-truth comparison. In Table 1, the total number of analyzed frames during the acquisitions and the corresponding acquisition times are reported.All patients were classified by the physicians regarding Fitzpatrick Skin Type I, II, or III. The first feasibility study was performed by analyzing the skin of the cadaver. The localization system was used for four flat selected regions with the c1|c3 camera pair. The features detected for this dataset were triangulated with a mean triangulation error of 0.239 mm for MSER and 0.218 mm for SURF. Due to its intrinsic functional operation, the MSER algorithm detects multiple blobs located at the same coordinates. This explains why MSER seems to have the capability to detect more features than SURF. The discarded-feature ratio of the matched features to the selected inliers was 3.96 ± 0.80 and 2.93 ± 0.45 for MSER and SURF, respectively. The clinical dataset involved patients undergoing open surgical procedures via mid-line incisions along the spine. Several plane regions were carefully selected for some patients (2nd, 4th, and 7th), where the skin was partially covered by blood. The performance results of the localization framework for all the eight patients in the study are shown in Tables 2 and 3. Descriptive statistics for triangulation error for each detection method are reported in Tables 2 and 3. Figure 1 shows two examples of MSER and SURF feature detection and corresponding matches at the same location for an image pair. On a total amount of 4934 (MSER) and 1727 (SURF) features, mean triangulation errors of 0.207 and 0.204 mm were reached for MSER and SURF, respectively. An important observation was that 75% of the detected features had a triangulation error within 0.3 mm (Fig. 2), appropriate for spinal surgery applications. The discarding ratio of the matched features to the selected inliers in this case was 3.73 ± 2.69 and 2.61 ± 1.70 for MSER and SURF, respectively. The median errors show a similar  20:6 variability in the triangulation error when SURF and MSER are used, respectively (Fig. 3a, b). The variability and the outliers may be caused by lighting differences or limited visibility of the skin area [31]. The triangulation error using SURF and MSER for each individual case is depicted in Fig. 4.
Two-sample t tests were used to assess differences when using MSER and SURF. A p value of less than 0.05 was considered statistically significant. No statistically significant differences between the two methods were found (p > 0.05). A two-sample t test was also performed to assess the accuracy of the markerless approach, which was found to be superior (p < 0.05) compared to the ground truth (marker-based detection). A significant statistical difference was also found when detecting skin features among different patients (p < 0.001). This can reflect differences in the number of analyzed frames per patient, illumination conditions, number of detected features,   and skin type. In this case, the f test rejects the null hypothesis at the default 5% significance level and suggests that the true variance is greater than 25%. The computation time for the skin feature detection was on average 0.19 and 1.86 s per frame when SURF and MSER were used respectively. Per-patient results are visualized in Fig. 5a. With a mean of 5 fps, SURF is most suitable for a future real-time implementation. The preprocessing step reached a computation time of 1.14 s. Notably, for realtime navigation, shortened preprocessing times may be achieved with improved lighting conditions.

Marker localization and ground-truth comparison
The marker detection and ground-truth comparisons were performed by applying both MSER and SURF feature detection algorithms, to detect the optical markers positioned on the patient. The mean triangulation error of the tracked markers was 0.290 mm for MSER and 0.303 mm for SURF, as shown in Table 4. Table 4 portrays that an average Euclidean distance of 0.627 mm for MSER and that of 0.622 mm for SURF are reached, in relation to the ground truth. Descriptive statistics for triangulation error and Euclidean distance when the detection methods are applied to the optical markers are reported in Table 4. Figure 6a, b, shows the frequency distributions of triangulation errors and Euclidean distances for MSER and SURF detection methods. Notably, the thresholding performed for segmenting the optical markers prior to applying the feature detection can cause a non-ideal identification of the markers and decrease the triangulation accuracy. This is the main reason why the coordinates of the triangulated markers differ slightly with respect to the ground truth. However, all markers are triangulated with a sub-millimeter accuracy, resulting in a triangulation error comparable to the one obtained with the skin features.

Discussion
This feasibility study proposes a new innovative, accurate, and unobtrusive alternative approach for skin feature localization which can be used for patient tracking in surgical navigation. This result was achieved with the direct detection of features on the patient's skin using high-resolution grey-scale cameras and the subsequent analysis of the captured multi-view images. The framework is based on MSER and SURF feature detection algorithms to localize the visual skin features. The accuracy, of roughly 0.6 mm at skin level, achieved with the current framework should be seen in light of previous results obtained by the ARSN system relying on adhesive skin markers. In a recent study using ARSN, Burström et al. [32] demonstrated  [33][34][35][36][37][38]. An off the shelf video camera and a 3D surface scanner were used to create a representation of the surface of the patient. Previously segmented structures could then be projected back on the patient. This system made dynamic tracking of the patient possible and reached a high accuracy of 1.5 mm. It has been used for radiotherapy and in craniofacial surgery, but no use in spine has ever been published [39]. Microsoft Hololens has recently been used in many applications. The obtained accuracies range from 9 to 45 mm in non-medical to roughly 5 mm in a spine phantom study [14,15]. A recent robotic study using a structured light camera for markerless tracking of the bony anatomy reached a precision of 0.28 ± 0.16 mm [13]. A study using a similar approach as used here for spine feature tracking reached an accuracy of 0.5 mm [31]. When exploring the use of HSI to detect skin features in 2D on healthy volunteers, an accuracy better than 0.25 mm was achieved [40]. In the current study, the feasibility of 3D-localization of skin features in patients undergoing spine surgery was demonstrated, employing a preexisting surgical navigation system using optical cameras. The use of grey-scale cameras rather than HSI is motivated by several factors. First, HSI is highly dependent on proper lighting conditions. Surgical lights illuminating the skin surface can interfere with the image acquisition process [40]. Second, the HSI system did not reach deeper than 1 mm in the skin, limiting its added value. Third, the integration of two or more HS cameras in the navigation system, to enable stereo-vision, would come at a considerable cost. In this scenario, the use of grey-scale cameras represents a good comprise. The results obtained by the proposed framework, using grey-scale cameras, are thus well in line with these previous results. A markerless tracking framework has the advantage of building a virtual reference grid that cannot be dislodged or completely obscured during surgery. Furthermore, compared to conventional dynamic reference frames that track a single vertebra, a markerless framework can track the entire spine and compensate for inherent movements within the spinal column during surgery. In this study, several datasets were used for validation. MSER showed a better capability of detecting features of observable anatomical skin details. It was visually verified that MSER provided a higher number of detected features contributing to a better plane selection in the 3D outlier removal step. The multi-camera system enabled triangulation of each feature, to obtain an accurate 3D triangulation performance. This performance can be potentially evaluated by automatically computing the triangulation error continuously and in real time within a software function for navigation. The proposed framework may simplify the existing patient preparation procedure and improve the reliability of the tracking process by relying on skin features instead of optical markers or reference frames.

Limitations
The limitation of this study is the small sample size and the retrospective setup. The results were validated on eight clinical cases. The addition of more cases would strengthen the conclusions. A prospective study comparing different modes of patient

Conclusion
This study demonstrates the feasibility of skin feature localization by exploiting an optical multi-view, grey-scale, camera system combined with image analysis and tracking techniques. The system has been tested on several patients undergoing spinal surgery with sub-millimeter accuracy. This study can be the basis for future surgical applications where optical patient tracking is required.

Image preprocessing
The principles of multi-view geometry [41] are based on assuming a pinhole camera model, applied for correction of camera images with respect to intrinsic parameters. A preprocessing step, the Contrast-Limited Adaptive Histogram Equalization (CLAHE) algorithm, was used to maximize the detection of skin features and reduce the noise during the acquisition [42]. The simple computation of the fundamental matrix F with the normalized eight-point algorithm was used for image rectification [43]. The fundamental matrix F was defined as: For any pair of matching points x ′ and x, there are two images in the same coordinate system. The obtained pixel points were imposed on the corrected input images, to enable feature detection.

Feature detection, extraction, and matching
Let us consider an image pair captured with different cameras c i and c j , from different views: Ic i and Ic j . For both images, a corresponding set of n(c i ) and n(c j ) features, respectively, were extracted and saved in a dedicated object ensemble Fc i and Fc j , to capture all information of the detected features f(c, n): MSER and SURF algorithms were applied for blob-similar feature design and feature detection. Afterward, the SURF feature descriptor was applied for feature extraction. The regions of interest (ROIs) on the skin were selected manually and saved as bounding-box coordinates for future iterations. A manual selection was performed, because the regions of the same subject were located within different views. Unfortunately, it was not possible to manually select precisely the same region boundary within multiple views. For this reason, the matching process contained some outliers, which were filtered out at a later step of the processing. Attention was paid to select regions where the skin was as flat as possible. The chosen skin area was located around the optical markers Fc j =f (c j , 1), f (c j , 2), ..., f (c j , n(c j )). (used as ground truth), so that both the markers and skin area had the same illumination conditions. This manual extraction may represent a limitation for a real-time application. For every detected feature, it was necessary to extract a feature vector, known as descriptor that provided information about the feature, in this case the pixels surrounding the center of the blob. For this purpose, the SURF algorithm was used to extract the feature vector [27]. This method was adopted, since it offers high reproducibility, even under different viewing conditions. The descriptor vectors were then saved in the following set of dedicated descriptor vectors: Where every descriptor δ(c, n) consisted of an SURF descriptor vector and the y-coordinate of a specific feature n from a generic camera c. Using the previous dedicated descriptor vectors c i and c j , the feature matching step performed a matching between the (c i ) feature detected in one view with respect to the n(c j ) feature from another view and provided an index of correspondences between the two dedicated descriptor vectors c i and c j , and the feature vectors F c j and F c j . These correspondences were achieved by computing the SSD between the SURF descriptor vectors of those features lying within the scan lines of interest. At this point, fusing the epipolar constraint was crucial, since it leads the matching process between features that were shifted along an epipolar line for a specific range. Thanks to this top-bottom scan-line stereo matching, matches between features that lie on different epipolar lines were omitted, to reduce the computational cost and maximize the chance of a good match. This step returned two vectors with the indexes related to the matched features. Using these indexes, it was possible to build two new dedicated object ensembles M c i and M c j of equal size, with matched features.

Feature triangulation and outlier removal
In computer vision, triangulation allows determination of the 3D position of a point, given that the positions of the same points are matched in at least two alternative views [41]. This was achieved with two or more lines projected from each camera center to the respective point on the camera plane. Consequently, the projected lines did not always intersect in the same 3D point. It was important to evaluate and quantify the accuracy of this method. In this feasibility study, the triangulation error was used as evaluation metric to obtain an index of the triangulation accuracy, and then, the 3D point locations were used to perform a benchmark against an existing tracking system. The triangulation error was computed by calculating the location of the shortest distance between the two projected lines. The center of the line segment was the triangulated point, and the length of the line segment is the triangulation error, expressed in millimeters (or in micrometers). The triangulation function of the ARSN system returns the 3D Cartesian coordinates of the triangulated point and the corresponding triangulation error through the projections from the camera centers, resulting in: (4) � c i =δ(c i , 1), δ(c i , 2), ..., δ(c i , n(c i )), (5) � c j =δ(c j , 1), δ(c j , 2), ..., δ(c j , n(c j )).