Statistical learning methods as a preprocessing step for survival analysis: evaluation of concept using lung cancer data
© Behera et al; licensee BioMed Central Ltd. 2011
Received: 13 September 2011
Accepted: 8 November 2011
Published: 8 November 2011
Statistical learning (SL) techniques can address non-linear relationships and small datasets but do not provide an output that has an epidemiologic interpretation.
A small set of clinical variables (CVs) for stage-1 non-small cell lung cancer patients was used to evaluate an approach for using SL methods as a preprocessing step for survival analysis. A stochastic method of training a probabilistic neural network (PNN) was used with differential evolution (DE) optimization. Survival scores were derived stochastically by combining CVs with the PNN. Patients (n = 151) were dichotomized into favorable (n = 92) and unfavorable (n = 59) survival outcome groups. These PNN derived scores were used with logistic regression (LR) modeling to predict favorable survival outcome and were integrated into the survival analysis (i.e. Kaplan-Meier analysis and Cox regression). The hybrid modeling was compared with the respective modeling using raw CVs. The area under the receiver operating characteristic curve (Az) was used to compare model predictive capability. Odds ratios (ORs) and hazard ratios (HRs) were used to compare disease associations with 95% confidence intervals (CIs).
The LR model with the best predictive capability gave Az = 0.703. While controlling for gender and tumor grade, the OR = 0.63 (CI: 0.43, 0.91) per standard deviation (SD) increase in age indicates increasing age confers unfavorable outcome. The hybrid LR model gave Az = 0.778 by combining age and tumor grade with the PNN and controlling for gender. The PNN score and age translate inversely with respect to risk. The OR = 0.27 (CI: 0.14, 0.53) per SD increase in PNN score indicates those patients with decreased score confer unfavorable outcome. The tumor grade adjusted hazard for patients above the median age compared with those below the median was HR = 1.78 (CI: 1.06, 3.02), whereas the hazard for those patients below the median PNN score compared to those above the median was HR = 4.0 (CI: 2.13, 7.14).
We have provided preliminary evidence showing that the SL preprocessing may provide benefits in comparison with accepted approaches. The work will require further evaluation with varying datasets to confirm these findings.
Statistical learning (SL) techniques with kernel mappings can provide benefits when addressing complicated decision problems [1–3]. These techniques are capable of capturing non-linear input-output characteristics, operating on small datasets with feature correlation, and do not require modeling or distribution assumptions. These attributes are not derived without tradeoffs. These methods do not provide an output that has a useful epidemiologic interpretation and their training often requires specialized techniques. In contrast, logistic regression (LR) modeling, Kaplan-Meier analysis, and Cox regression provide important epidemiologic interpretations and are used extensively due to their availability. This report is an advancement of our earlier simulation work  in adapting SL methods for epidemiologic application (see Appendix).
Lung cancer is the leading cause of cancer-related mortality in the world with more than a million deaths each year . Lung cancer is often diagnosed at an advanced stage since early detection has been elusive . Recent evidence indicates that lung cancer mortality can be reduced when screening high-risk patients with a low-dose computerized tomography (CT) scan . Before this promising approach is incorporated into general practice, several important outstanding clinical issues have to be addressed [6, 7]. For patients with early stage lung cancer, local therapy with surgical resection is associated with the best survival outcomes. This is limited to those with non-small cell lung cancer (NSCLC), which accounts for approximately 85% of all cases of lung cancer in the United States. Despite optimal surgical resection, recurrence of disease is noted in 30-75 percent of the patients based on the initial stage. Development of prognostic models for predicting survival outcomes for patients with NSCLC after resection will have important healthcare implications.
To adapt an SL methodology for epidemiologic application, a problem in NSCLC survival prognosis was analyzed for stage-1 patients using a relatively small set of variables collected routinely for patients of this kind, similar to those investigated previously . A probabilistic neural network (PNN)  was combined with LR modeling and survival analyses (i.e. Kaplan-Meier analysis and Cox regression) to demonstrate proof of concept. This hybrid approach combines the strengths of the SL methodology with these important epidemiologic techniques. The PNN is a statistically inspired neural network  that uses a kernel mapping [10, 11] to estimate the underlying probabilities. For the LR modeling comparisons, the NSCLC dataset was dichotomized into two groups comprised of patients with favorable or unfavorable survival outcomes. Raw clinical variables and a new patient score variable formed with the modified PPN were considered as prognostic factors. Additionally, the PPN output was used as the study variable and compared with age using survival analysis. There are weight parameters within the PNN that must be estimated properly. Differential evolution (DE) was used for this optimization problem . Stochastic methods were developed to provide feedback to the DE optimization and to derive the patient PNN scores. We also evaluated this new system with the simulated datasets and methods described previously , as discussed in the Appendix.
The dataset was comprised of data from 151 NSCLC patients that underwent surgical resection from 2002 - 2006. All data were selected retrospectively and consecutively. Stage-1 patients that had complete case ascertainment for the variables under consideration were selected. Ninety-two (n1) of these patients were alive at last contact (censored), and 59 (n2) patients died (incident) during the course of the contact interval. The clinical variables abstracted from the patient files included age (i.e. age of the patient at the time of procedure), gender (binary), history of smoking (binary), histology sub-type (four categories), and tumor grade. Past or current smokers were categorized as smokers (yes), otherwise patients were characterized as non-smokers (no). The four histological sub-types were: adenocarcinoma (AC), squamous cell carcinoma (SCC), large cell carcinoma (LCC), and adenosquamous carcinoma (ASC). Tumor grade is a 1-3 integer scale describing cancer cell differentiation (a measure of abnormality) derived from pathology reports. This data was collected under an approved protocol by the Western Institutional Review Board.
Favorable Outcome and Survival Analysis
The non-interaction LR model  was used to predict favorable and unfavorable survival outcome by dichotomizing the population into two groups. The 92 censored patients were designated as the favorable survival outcome group defined as group-1 (i.e. the censored group). Fifty-nine patients were designated as the unfavorable survival outcome group defined as group-2 (i.e. the incident group). Other methods of dichotomizing the population were considered but discarded as discussed in the Results Section. Overall survival (OS) time was measured as the distance between the date of procedure and the date of death for a given patient when applicable. Censor time was measured as the distance between the date of the procedure and the date that a given patient was censored, when applicable. The LR model was referenced to predict the probability of a favorable outcome. Age was treated as a continuous variable with integer accuracy, and grade was considered as a three state continuous integer variable (grades 1-3). Histology (four-state) and gender (two-state) were treated as categorical variables. Age and grade were combined to form a continuous patient score (or z) using a variation of the PNN. The reasons for this follow from the LR modeling (non-hybrid) findings and that they were treated as continuous variables, whereas the remaining variables were categorical or binary and not strictly amendable to probability density estimations. Kaplan-Meier product-limit estimators and Cox regression were used for the survival probability curve analyses. In this analysis, two groups were formed by choosing the median age and median PNN score as the separation points. The other relevant variables were introduced with both age and PNN score to evaluate their influence on the respective survival probability curves.
For the LR modeling comparisons, odds ratios (ORs) were used to assess measurement association. For age and PNN score (i.e. the continuous variables), the LR model coefficients were re-scaled to provide ORs per standard deviation (SD) change for each variable. The ORs for grade were cited in per unit increase. The area under the receiver operating characteristic curve (Az) was used to measure the predictive capability for a given model. The Az was estimated with three methods. First, to assess the SL training and patient scores, the definition of Az was applied  using the respective distributions. Secondly, the Az quantities for the LR models were generated within the SAS (SAS Institute, NC) software package using the output of the LR model (same interpretation as provided by the fist method). For the Kaplan-Meier analysis, chi-square Wilcoxon (more sensitive to shorter term survival differences) and log-rank (more sensitive to longer term survival differences) tests were used for differences in stratification. Hazard ratios (HRs) were estimated with Cox Regression. Thirdly, Az was also derived from Cox regression and is a measure of the agreement between the model and actual time-to-event outcome . For the ORs and HRs 95% confidence intervals (CI) were provided. The survival analysis was also performed with SAS.
Probabilistic Neural Network and Kernel Methods
The multivariate normalization factors were not important because both g1 and g2 contained the same sigma-weights. These scores were used with LR modeling and the survival analysis. Because the above expression is always positive and can be large, we used z = ln (patient - score) in the analyses as the PNN derived patient score and performed a range compression technique to reduce statistical outlier interference in the LR modeling.
Probabilistic Neural Network Training and Operation
Incident mean/SD or %
Censored mean/SD or %
Grade and Gender adjusted
Male vs Female
z (Age and Grade)
Male vs. Female
The DE training for the modified PNN resulted in two sigma-weights with σ1 = 0.013610961 and σ2 = 0.35805283 for age and grade, respectively. Using Nt = 1 produced training Az values between 0.700-0.830. Choosing Nt = 5 gave consistent findings and was used in the analysis. The stochastic cross-validation performance coinciding with these weights gave Az = 0.710 with SE = 0.03 after three generations (G = 3), which is in agreement with the Az derived from holdout cross-validation analysis (see Appendix). We used these parameters to generate z for each patient with Nsc = 20. Processing age and grade separately through the PNN gave Az = 0.656 for age and Az = 0.538 for grade, which are statistically similar to the Az values when assessing these variables individually without the PNN processing. The continuous hybrid LR findings are shown in the bottom of Table 2. The combined effect shows that for a SD increase in z (SD = 1.69), the respective patient is about 4.15 times more likely to experience a favorable survival outcome (or incident group member is 0.24 more likely to experience a favorable outcome) with Az = 0.763, which was significantly larger (p = 0.0062) than that provided by the respective age and grade LR model. Due to the way the PNN was defined, increasing z was protective, whereas increasing in age was not. Adjusting for gender increased the predictive capability of the model with Az = 0.778 (SE = 0.03), although the gender OR lost significance. Gender also reduced the association for z with OR = 3.67 per standard deviation increase, which was a stronger association than provided by age in the corresponding model. The Az derived from the hybrid model (z and gender) was significantly greater than that of the corresponding LR model with age, grade, and gender (p = 0.0173). As above, including histology-type or smoking status with z had a marginal influence on the relationships (not shown).
Age and z relationships
We used the OS and censor times to form two groups because of the separation between the respective distribution means. The favorable group had a mean censor time of 3.97 years (i.e. mean known OS time, which is a low-side limit assuming these patients did not expire the day after study-contact), whereas the incident group had a mean OS time of 2.20 years (data not shown). The minimum censor time (2.35 years) is greater than the mean OS time for the incident group indicating validity of the dichotomization method. Other methods of dichotomizing the population were considered, such as choosing a cutoff-point at given OS time but this technique added ambiguity with those censored on the left-side of the cut-point and left few samples on the right-side of the cut-point when considering four or five year OS times as the demarcation.
Hazard relationships for dichotomous age and z
Age Hazard Ratio
1.72 (1.02, 2.90)
1.78 (1.06, 3.02)
1.64 (0.96, 2.78)
Grade Gender adjusted
1.68 (0.99, 2.85)
z Hazard Ratio
0.25 (0.14, 0.47)
0.28 (0.15, 0.53)
Survival probability statistical test summaries
Dichotomous Age over Strata
Dichotomous Age and Gender over Strata
Dichotomous z over Strata
Dichotomous z and Gender
A technique for incorporating SL methods with epidemiologic analyses was illustrated. The approach used ensemble averaging based on bootstrap sampling. These preliminary findings indicate the hybrid approach provided benefits. With this data, the hybrid approach provided greater Az in the logistic regression modeling and greater hazard relationships in the survival analyses than that of the accepted approaches. The internal validity of our findings is supported by the analysis provided in the Appendix. This approach represents a framework that is easily generalized. We used the SL output as the input into LR model and survival analysis, essentially combining the strengths of the various modeling techniques. In this capacity, the SL device was operating as frontend preprocessing step for these accepted analysis techniques. Processing the SL output with these approaches provides a mechanism for converting the SL output into epidemiologic metrics (i.e. ORs and HRs). We used a relatively simple SL device to demonstrate the concept with a two-class problem. This specific approach can be extended to include more than two classes (e.g. death, greater than three, and five year survival benchmarks). The PNN applies to multiclass problems, as well, and multinomial logistic regression can address multiple level outcomes. It could be argued that the LR modeling was suboptimal because the time-to-event variable resolution was reduced to a coarser dichotomous variable. However for a specific set of variables, the LR output provides a different metric (i.e. probability of having a favorable outcome) than that provided by Cox regression (i.e. instantaneous relative risk). Thus, the resolution reduction is the price paid for an alternative output. More generally, the same hybrid approach is applicable for the output of any other type of SL method or decision device (e.g. support vector machines, kernel based partial least squares, or other types of neural networks).
There are several limitations with our findings. The analysis was performed with a limited number of samples derived retrospectively. Censoring limits the survival time estimations. Although the DE is a robust approach, there is no guarantee that it will converge indicating that the findings may be less than optimal. The generation termination limit, G = 3, was empirically set because we found that letting the process evolve over many generations produced weights that were too finely tuned and did not provide performance consistency between the training evaluation and the final score assessments. Because the dataset was limited, further evaluation using both simulation methods and holdout cross-validation with the z-score was provided in the Appendix. The findings from the hybrid modeling will require further evaluations with different datasets to show generalization. In principle to use a system as illustrated here in practice, the sampled patient population should be representative of stage-1 lung cancer patients in general. The operation of this system with new datasets would relegate this current dataset to assume the position of w i in the final score generation using prospective samples as w (without further training) to generate z for assessing survival probabilities or predicting favorable outcomes. The SL method was trained with a dichotomized survival output, which was the same output used to train the hybrid LR model. Therefore, the corresponding hybrid LR model (or the survival curve separation) could be confounded by the z variable if the choice of kernel or weights were suboptimal. Determining the optimal kernel was beyond the scope of this research.
Generalization of the LR model and incorporating kernel based techniques into epidemiologic survival analyses represents a diverse field of inquiry. Earlier research used a PNN and LR modeling to predict survival in early stage NSCLC but did not fuse the models . Logistic regression is a member of a family of generalized linear models. Replacing the LR argument with various forms of smooth functions has provided benefits in the study of colon-cancer , heart-disease  and infant mortality . Other researchers have incorporated univariate kernel density estimations for studying prostate-cancer , health disparities , and nutrient intake . Similarly, univariate kernel density estimations have been used to estimate summary measures that were incorporated into LR modeling in fast-food consumption studies . Our work differs from this other work in that the PNN application makes no assumption concerning the functional relationship of the variables under study and we incorporated the measures into LR.
An SL methodology comprised of DE optimization, a kernel mapping, and stochastic ensemble averaging was presented as an illustration to generalize widely used analysis techniques. The technique gives the SL methodology an epidemiologic interpretation. Although we used a specific example, the framework applies to all situations where LR modeling and survival analysis are appropriate. The approach can be easily modified to include as many input variables as required and new samples can be added into the training procedure with the proper clinical feedback indicating the system can learn continually without computer processing demands due to its relative simplicity. The system will require further evaluation with different datasets before it can be applied in practice.
Additional evaluation was performed to assess the PNN z score method that included a simulation study and holdout cross-validation analysis.
A simulation was performed to assess the training, optimization, and patient scoring system shown in Figure 1 under ideal conditions. We used the same simulation methods with two correlated input features and non-linear separation boundary as described in our earlier work . We used 200 samples per class giving 400 samples total as previously for the training dataset. We used this training dataset to estimate the sigma-weights using the algorithm described above (Figure 1). We used the same stochastic averaging (N t = 5, and Nsc = 20) and bootstrap methods. We stopped the differential evolution optimization for G = 3 as above, which gave two sigma-weights (0.291156797, 0.0872920) with a training Az = 0.987. The training dataset was then used for w i in the score generation using independent data. We then simulated an evaluation dataset of the same dimension (200 per class giving 400 samples total) that was not used in the sigma-weight generation. These new samples were then used as w in the stochastic score generation and evaluated. This evaluation gave Az = 0.979. This shows in principle, the system is viable and that the training distribution must be representative of the population. It is also worth noting that the separation provided by this modified PNN system was larger than that described previously using a different statistical learning system when processing the same type of simulated datasets (i.e. Az ≈ 0.950).
To assess the internal validity of the approach, we used the scheme shown in Figure 1 with one main difference. Two patient samples (one sample from each group) were selected at random and held out (i.e. leave two-out cross-validation) of the training process To slow the DE convergence, we set Cr = 0.1. The system comprised of the remaining n-2 patients was trained for 20 DE generations for each holdout pair. These n-2 samples were used for training and for generating training z scores (age combined with grade with the PNN) and Azs. For each DE generation, a bootstrap population was generated from the fixed n-2 population and an Az was generated. The weights that gave the largest Az for the 20 DE generations were used to generate the z scores for the two samples (holdout pair). We used stochastic averaging for the output scores, where 20 bootstrap populations were generated from the fixed n-2 training samples (generated 20 scores for each of the two left out samples). This process cycled (i.e. choosing another pair at random leaving a new n-2 training population for the next 20 DE generations) until all patients received a score. The resulting leave two out cross-validations gave Az = 0.700, indicating the approach was internally valid.
The authors thank Dr. Robert C. Hermann, Northwest Oncology Center, Marietta GA, for his efforts in the data collection for this project. Drs. Owonikoko, Khuri, and Ramalingam are Distinguished Cancer Scholars of the Georgia Cancer Coalition.
- Vapnik VN: The Nature of Statistical Learning Theory. Second edition. NY: Springer; 2000.View ArticleGoogle Scholar
- Vapnik VN: Statistical Learning Theory. NY: John Wiley & Sons, Inc.; 1998.Google Scholar
- Shawe-Taylor J, Cristianini N: Kernel Methods for Pattern Analysis. Cambridge, UK Cambridge University Press; 2004.View ArticleGoogle Scholar
- Heine JJ, Land WH, Egan KM: Statistical learning techniques applied to epidemiology: a simulated case-control comparison study with logistic regression. BMC Bioinformatics 2011, 12: 37. 10.1186/1471-2105-12-37View ArticleGoogle Scholar
- Manser RL, Irving LB, Byrnes G, Abramson MJ, Stone CA, Campbell DA: Screening for lung cancer: a systematic review and meta-analysis of controlled trials. Thorax 2003, 58(9):784–789. 10.1136/thorax.58.9.784View ArticleGoogle Scholar
- Bach PB: Inconsistencies in findings from the early lung cancer action project studies of lung cancer screening. J Natl Cancer Inst 2011, 103(13):1002–1006. 10.1093/jnci/djr202View ArticleGoogle Scholar
- Team NLSTR, Aberle DR, Adams AM, Berg CD, Black WC, Clapp JD, Fagerstrom RM, Gareen IF, Gatsonis C, Marcus PM, et al.: Reduced lung-cancer mortality with low-dose computed tomographic screening. N Engl J Med 2011, 365(5):395–409.View ArticleGoogle Scholar
- Montesinos J, Bare M, Dalmau E, Saigi E, Villace P, Nogue M, Angel Segui M, Arnau A, Bonfill X: The changing pattern of non-small cell lung cancer between the 90 and 2000 decades. Open Respir Med J 2011, 5: 24–30. 10.2174/1874306401105010024View ArticleGoogle Scholar
- Specht DF: Probabilistic neural networks. Neural Networks 1990, 3: 109–118. 10.1016/0893-6080(90)90049-QView ArticleGoogle Scholar
- Parzen E: On estimation of a probability density function and mode. Annals of Mathematical Statistics 1962, 33(3):1065–1076. 10.1214/aoms/1177704472MathSciNetView ArticleGoogle Scholar
- Cacoullos T: Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics 1966, 18(1):179–189. 10.1007/BF02869528MathSciNetView ArticleGoogle Scholar
- Price KV, Storn RM, Lampinen JA: Differential Evolution: A Practical Approach to Global Optimization. Heidelberg: Springer; 2005.Google Scholar
- Hosmer DW, Lemeshow S: Applied Logistic Regression. second edition. NY: John Wiley & Sons, Inc.; 2000.View ArticleGoogle Scholar
- Hanley JA, McNeil BJ: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143(1):29–36.View ArticleGoogle Scholar
- Pencina MJ, D'Agostino RB: Overall C as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation. Stat Med 2004, 23(13):2109–2123. 10.1002/sim.1802View ArticleGoogle Scholar
- Mercer J: Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London Series A, Containing Papers of a Mathematical or Physical Character 1909, 209: 415–446. 10.1098/rsta.1909.0016View ArticleGoogle Scholar
- Land WH Jr, Margolis D, Kallergi M, Heine JJ: A Kernel Approach for Ensemble Decision Combinations with two-view Mammography Applications. International Journal of Functional Informatics and Personalised Medicine 2010, 3(2):157–182. 10.1504/IJFIPM.2010.037152View ArticleGoogle Scholar
- Efron B, Tibshirani RJ: An Introduction to the Bootstrap. Boca Raton, FL: Chapman & Hall/CRC; 1993.View ArticleGoogle Scholar
- Albain KS, Crowley JJ, LeBlanc M, Livingston RB: Survival determinants in extensive-stage non-small-cell lung cancer: the Southwest Oncology Group experience. J Clin Oncol 1991, 9(9):1618–1626.Google Scholar
- Marchevsky AM, Patel S, Wiley KJ, Stephenson MA, Gondo M, Brown RW, Yi ES, Benedict WF, Anton RC, Cagle PT: Artificial neural networks and logistic regression as tools for prediction of survival in patients with Stages I and II non-small cell lung cancer. Mod Pathol 1998, 11(7):618–625.Google Scholar
- Zhao LP, Kristal AR, White E: Estimating relative risk functions in case-control studies using a nonparametric logistic regression. Am J Epidemiol 1996, 144(6):598–609.View ArticleGoogle Scholar
- Abrahamowicz M, du Berger R, Grover SA: Flexible modeling of the effects of serum cholesterol on coronary heart disease mortality. Am J Epidemiol 1997, 145(8):714–729. 10.1093/aje/145.8.714View ArticleGoogle Scholar
- Gage TB, Fang F, O'Neill E, Stratton H: Maternal age and infant mortality: a test of the Wilcox-Russell hypothesis. Am J Epidemiol 2009, 169(3):294–303.View ArticleGoogle Scholar
- Savage CJ, Lilja H, Cronin AM, Ulmert D, Vickers AJ: Empirical estimates of the lead time distribution for prostate cancer based on two independent representative cohorts of men not subject to prostate-specific antigen screening. Cancer Epidemiol Biomarkers Prev 2010, 19(5):1201–1207. 10.1158/1055-9965.EPI-09-1251View ArticleGoogle Scholar
- Osypuk TL, Acevedo-Garcia D: Are racial disparities in preterm birth larger in hypersegregated areas? Am J Epidemiol 2008, 167(11):1295–1304. 10.1093/aje/kwn043View ArticleGoogle Scholar
- Vercambre MN, Fournier A, Boutron-Ruault MC, Clavel-Chapelon F, Ringa V, Berr C: Differential dietary nutrient intake according to hormone replacement therapy use: an underestimated confounding factor in epidemiologic studies? Am J Epidemiol 2007, 166(12):1451–1460. 10.1093/aje/kwm162View ArticleGoogle Scholar
- Moore LV, Diez Roux AV, Nettleton JA, Jacobs DR, Franco M: Fast-food consumption, diet quality, and neighborhood exposure to fast food: the multi-ethnic study of atherosclerosis. Am J Epidemiol 2009, 170(1):29–36. 10.1093/aje/kwp090View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.