Development and validation of a RNA binding protein gene pair-associated prognostic signature for prediction of overall survival in hepatocellular carcinoma

Background Increasing evidence has demonstrated the correlation between hepatocellular carcinoma (HCC) prognosis and RNA binding proteins (RBPs) dysregulation. Thus, we aimed to develop and validate a reliable prognostic signature that can estimate the prognosis for HCC. Methods Gene expression profiling and clinical information of 374 HCC patients were derived from the TCGA data portal. The survival-related RBP pairs were determined using univariate cox-regression analysis and the signature was built based on LASSO analysis. All patients were divided patients into high-and low-risk groups according to the optimal cut off of the signature score determined by time-dependent receiver operating characteristic (ROC) curve analysis. The predictive value of the signature was further validated in an independent cohort. Results A 37-RBP pairs signature consisting of 61 unique genes was constructed which was significantly associated with the survival. The RBP-related signature accurately predicted the prognosis of HCC patients, and patients in high-risk groups showed poor survival in two cohorts. The novel signature was an independent prognostic factor of HCC in two cohorts (all P < 0.001). Furthermore, the C-index of the prognostic model was 0.799, which was higher than that of many established risk models. Pathway and process enrichment analysis showed that the 61 unique genes were mainly enriched in translation, ncRNA metabolic process, RNA splicing, RNA modification, and translational termination. Conclusion The novel proposed RBP-related signature based on relative expression orderings could serve as a promising independent prognostic biomarker for patients with HCC, and could improve the individualized survival prediction in HCC.


Background
Hepatocellular carcinoma (HCC) was the most common primary malignancy of the liver and its incidence rate is increasing [1]. It has been commonly known that many risk factors contribute to HCC carcinogenesis and progression, including chronic hepatitis B virus (HBV)/hepatitis C virus (HCV) infection, diabetes mellitus, alcohol abuse, obesity, nonalcoholic fatty liver disease, metabolic diseases, autoimmune hepatitis, and exposure to dietary toxins such as aflatoxins [2]. Recently, it has reported that the incidence of HCC all over the world is highly heterogeneous owe to different risk factors, and that most HCC patients occur in South-eastern Asia and Saharan Africa, where HBV infection is the leading risk factor [3]. Despite the rapid progress over the past few decades in earlier diagnosis and treatment of HCC, the long-term prognosis remains poor, and the 5-year survival rate remains below 20% [4]. Surgery resection remains the most effective treatment of HCC, and it has markedly improved the overall survival (OS) of HCC patients. However, the long-term survival rate is still low [4,5]. The classic tumor-nodemetastasis (TNM) staging, as a classic prognostic model, helps predict HCC prognosis and is widely used in current clinical practice [6]. However, the predictive efficacy of TNM model is still far from satisfying. HCC is a highly heterogeneous malignancy with substantially variable clinical outcomes, the prognoses of patients with the same TNM stage may varied due to inherent clinical and molecular diversities, and even among patients with HCC who are diagnosed as the same TNM stage and received similar clinical management, survival outcomes are various, suggesting that TNM provides incomplete prognosis information [7,8]. Therefore, novel valid and robust prognostic signatures are indispensable to improve risk prediction and offer better information for guiding personalized therapy.
RNA-binding proteins (RBPs) are inherently pleiotropic proteins, which bind RNA through one or more spherical RNA-binding domains and control RNA stabilization, degradation and modification at the post-transcriptional level [9]. So far, more than 1,500 RBP genes have been confirmed by genome-wide screening in human genome [10]. Emerging evidences have demonstrated that RBPs are critical in the regulation of human cancer progression by influencing multifaceted cellular functions [11,12]. RBPs play a vital role in post-transcription and dysfunction of RBPs expression is closely related to various diseases including cancer and immune disorders [13]. Considering the importance of post-transcriptional regulation of RBPs, aberrantly deregulated RBPs are closely related to the occurrence and progression of cancers. To best of our knowledge, very few RBP based signatures have been established for HCC prognosis prediction.
With the rapid development of sequencing and precision medicine, increasing evidence indicated that gene signatures at mRNA level had promising potential in predicting HCC prognosis. Numerous studies have proposed various signatures for survival prediction in patients with HCC [14][15][16][17][18][19]. For example, Liu et al. [20] have established a novel six-gene signature for HCC prognosis prediction, but only one external validation cohort was used to validate the performance of the predicted model without comparison of its performance with other existed biomarkers. Li et al. developed and identified a seven-gene signature related to the DNA repair process to predict survival in HCC, but without any external validation the performance of the predicted model [21]. However, none of these signatures have yet been widely accepted in routine clinical practice because of technical limitations and evaluation difficulties. Recently, a novel established method based on the within-sample relative expression orderings of gene pairs has been proposed. It can overcome the disadvantages of gene expression data normalization and scaling, and has yielded robust results in various studies [22,23].
Therefore, based on 1542 RBP genes [10], we used two cohorts to develop and validate an individualized prognostic signature for HCC. We compared this signature with other previous prognostic signatures to validate the predictive effectiveness and accuracy of the novel signature.

Prognostic signature construction
The analysis process of present study is shown in Fig. 1. A total of 647 RBP genes were common among two datasets, and 23,353 RRGPs were constructed in two cohorts. Using univariate Cox regression analysis, we identified 581 prognostic RRGPs that were significantly associated with patient OS (P < 0.05). To determine the optimal model for predicting prognosis, the prognostic RRGPs were used to build prognostic signature by using LASSO penalized Cox regression on the TCGA cohort. The risk score was calculated as combination of gene' expression values weighted by regression coefficient derived from LASSO regression. After 1000 iterations, we identified the 37 gene pairs to construct an RRGP risk signature (Fig. 2). The 37-RRGP prognostic signature information is shown in Table 1.

Validation and assessment of the novel signature
The clinical data of patients in the internal validation group and the external validation group is displayed in Table 2. We calculated the risk score for each patient as mentioned   previously in the two groups. The optimal cut-off of the signature risk score for separating patients into the high-or low-risk groups was set at -0.241 using time-dependent ROC curve analysis at 5 years (Fig. 3). It was revealed that the high-risk group presented a poorer OS than the low-risk group (P < 0.001; HR = 6.43, 95%CI 4.15-9.96) (Fig. 4a). The patients in the ICGC cohort were classified into high-and low-risk groups based on the same cutoff value in the TCGA cohort. It was confirmed that high-risk group exhibited a poorer OS than the low-risk group (P < 0.001; HR = 6.87, 95%CI 2.43-19.38) (Fig. 4b). The Kaplan-Meier curves indicated that the risk score was a stable prognostic marker for patients with HCC stratified by age (< 60 or ≥ 60), sex (male or female), stage (I-II or III-IV), and grade (I-II or III-IV) ( Fig. 4c-j). Furthermore, the AUC values of the prognostic signature risk score for the 1-year, 3-year and 5-year OS using the time-dependent ROC curves in the TCGA cohort were 0.83, 0.86, and 0.83, respectively (Fig. 5a). As expected, the robust predictive performance of the signature was further validated in the ICGC cohort with the 1-, and 3-year survival rates of 0.75 and 0.73, respectively, (Fig. 5b), which demonstrated that the predictive ability of our prognostic signature was robust and accurately. To further explore the prognostic power of the signature for other clinical factors, univariate and multivariate Cox proportional hazards regression analyses were used to the TCGA cohort. The univariate and multivariate Cox regression analysis showed that the novel signature risk score was an independent prognostic factor for predicting OS of HCC after adjustment for by age, gender, grade, and stage in the TCGA cohort (HR = 9.492, 95%CI 6.196-14.539; P < 0.001, Fig. 6a, b), and further confirmed in the external cohort (HR = 2.680, 95%CI 1.424-5.046, P = 0.002, Fig. 6c-d).

Comparison with other established prognostic signatures
To further confirm the prognostic performance of the model, we built prognostic risk models of age, sex, stage and grade and compared these models with the novel a b  signature risk score. The 37-RRGP signature achieved a higher predictive accuracy than the risk models of age, sex, grade, stage, and their combined models with C-index of 0.799 (Fig. 7). The RBP prognostic signature was verified as a robust complement    to clinical and pathological factors for the prognosis assessment of HCC patients with C-index of 0.803. We further compared the novel established signature with twelve published molecular signatures [16][17][18][19][20][21][24][25][26][27][28][29], which were all prognostic signatures for HCC survival prediction. More importantly, the C-index of the prognostic model yielded much higher value than that of other risk models (Fig. 7). Thus, the novel prognostic signature was more effective than the previous signatures in the prognosis of HCC patients.

Biological processes related to the RBP signature
Functional enrichment of the RBP genes relevant to the signature in the TCGA cohort were mostly involved in RNA modification, translation, translational termination, nuclear-transcribed mRNA catabolic process, RNA splicing, ribonucleoprotein complex biogenesis, and ncRNA metabolic process (Fig. 8). The enrichment of related biological processes provided evidence of molecular mechanisms affected by the RBP signature and, therefore, contributed to predict the prognosis of HCC.

Discussion
HCC is a heterogeneous solid malignancies and the prognosis of HCC mainly depends on the degree of surgical resection and intervention [18]. Therefore, it's important to accurate predict survival and adopt novel therapy methods in time for the patients at high risk of mortality. Microarray technology, which is a highly efficient and accurate transcriptional expression technology, has been successfully used in the identifying of molecular biomarkers of nearly all human malignant cancers. With the rapid developments in gene chips as well as high-throughput sequencing, gene signatures based on mRNA expression levels have exhibited significant potential in predicting HCC prognosis. Many risk-coefficient models based on a multigene mRNA expression signature has been reported and confirmed to be an independent prognostic predictor for OS in HCC and could classify patients into high-and low-risk group with notably different OS [14, 18-21, 24, 27]. Although a lot of multigene prognostic signatures have been developed for HCC patients, but the accuracy of their prognosis predictions remains far from satisfying [14-18, 24, 25]. Therefore, there is an urgent to build robust prognostic signatures to predict the survival of patients with HCC. In the last years, post-transcriptional regulation of gene expression has stimulated much interest in studies focused on RBPs and the interactions with their RNA targets [30]. Previous studies have demonstrated that RBPs were abnormally expressed in different cancer types, which affected the translation of mRNA into protein, and were indeed involved in carcinogenesis [31]. For example, SERBP1 (Serpine1 mRNA-binding protein 1) is a member of the RG/RGG family of RNA-binding proteins and has been recognized as a novel oncogenic factor in glioblastoma, and the high SERBP1 expression correlates with poor patient survival and adverse response to chemo-and radiotherapy [32]. RNAbinding protein CELF1 contributes to migration, invasion, and chemotherapy resistance by targeting ETS2 in colorectal cancer and could be a potential diagnostic and prognostic marker [33]. However, little attention has been paid to the molecular characteristics of RBP genes interaction in HCC. Therefore, there is a need to develop and validate of a robust RBP genes signature for prediction of the prognosis of HCC.
In this study, we used the within-sample relative expression orderings to develop a signature based on 37-RRGP for HCC patients and validated its accuracy and effectiveness in an independent cohort via comprehensive bioinformatics methods. Univariate Cox regression analysis was used to identify RRGPs significantly associated with OS in HCC. We found that 581 gene pairs were associated with OS, with p < 0.05. We identified the 37 RRGPs that predicted survival in patients with a LASSO penalized Cox regression on the TCGA cohort. These gene pairs were used to derive a signature risk score. Based on the cutoff of the risk score, patients could effectively classify into low-risk and highrisk groups with distinct outcomes. Furthermore, the same conclusion was reached in survival analysis stratified by age, sex, stage, and grade. We performed univariate and multivariate Cox regression analysis to investigate the combined ability of the signature risk score and other clinicopathological factors to predict survival. It was revealed that the novel signature risk score may be an independent prognostic factor for patients with HCC. Besides, an independent dataset from ICGC was used as validation set to ensure the robustness of our results. The results confirmed that the risk score is a stable, independent prognostic indicator and indicates that the risk score could be of important significance for patients with HCC as an effective clinical classification tool. However, most of existing signatures have not been widely used in clinical practice, which may owe to multiple factors. First, many traditional prognostic signatures failed to validate their findings in another independent cohort [17,21]. More importantly, the diversity of data also represents a major challenge, and data processing across various sequencing platforms require suitable normalization and elimination of the batch effects. In addition, the cutoff in previous signatures could not be used across multiple datasets. These disadvantages severely limit their clinical application. In this study, we performed gene pairwise analysis to identify reliable biomarkers for prognosis of HCC based on the within-sample relative expression orderings of genes, which are robust against to experimental batch effects [34,35]. Using this algorithm, the bias caused by gene normalization was eliminated. Furthermore, the signature and cutoff value could be used across different datasets, which was an important advantage in our study. In addition, the RBP-based prognostic signature identified in the present study performed well compared with twelve existing prognostic signatures, which was another vital advantage in our study. Furthermore, the time-dependent ROC analysis showed that it performed well in 1-, 3, and 5 years for HCC OS prediction. All these results demonstrated that the novel signature could provide an accurate prognosis of patients with HCC. The RBP genes involved in the signature were mainly involved in RNA modification, translation, translational termination, RNA splicing, nuclear-transcribed mRNA catabolic process, ribonucleoprotein complex biogenesis, and ncRNA metabolic process. Previous studies revealed that RBPs can bind to their target RNAs in a structure or sequence-dependent manner to form ribonucleoprotein complexes that regulate processes ranging from mRNA stability to RNA processing, splicing, localization, export, as well as translation at the post-transcriptional level [36]. Post-transcriptional regulation by non-coding RNAs has been reported involved in RBP expression in cancer. For instance, the expression of HuR protein is antagonized by miR-519 and miR-125a, remarkably inhibiting the proliferation of colon, cervical, breast, and ovarian carcinoma cells [37,38]. It was revealed that ribonucleoprotein granule is a crucial region that executes protein biosynthesis. The alteration of ribonucleoprotein influences the translation processing and related to cancer progression [39]. RNA binding protein SERBP1, which regulates mRNA translation, has been identified as a target of the tumor suppressor miR-218 in HCC and confirmed associated with cell migration/invasion and epithelial mesenchymal transition [40]. A recent study revealed that methyltransferase-like 3 (METTL3) suppressed the expression of morphological effect on genitalia 1 (SMG1) through m6A modification-mediated miR-873-5p up-regulation, therefore, serving an oncogenic role in HCC [41]. The number of HCC cells is significantly reduced when RNA binding protein NSUN6 is overexpressed, revealing that the occurrence and development of LIHC is related to alterations of tRNACys and tRNAThr biogenesis [42]. These evidences suggested that the signature may be concerned with HCC-related biological pathways and their functional dysregulations may be closely related to the survival of HCC.
Nevertheless, some notable limitations must be acknowledged. The main limitation of our findings is its retrospective nature, and more prospective studies should be conducted to validate the findings. Moreover, the underlying mechanism through which the identified RBP genes contribute to the initiation and progression of HCC still requires further evaluated using RT-PCR or IHC.

Conclusion
We developed and validated of a novel robust prognostic signature based on relative RBP genes expression orderings, which could serve as a promising independent prognostic biomarker for patients with HCC. The novel signature could improve the individualized outcome prediction in HCC. The RBP-related signature provided new insights into the identification of HCC patients with a high risk of mortality.

HCC data sources and data preprocessing
Level-three transcriptome RNA-sequencing data (HTSeq-FPKM) of 374 primary HCC samples, as well as their clinical follow-up information were downloaded from the TCGA data portal (https ://cance rgeno me.nih.gov/). Another RNA-seq dataset of 240 primary HCC patients together with corresponding clinical information were accessed from the International Cancer Genome Consortium database (ICGC, https ://dcc.icgc. org/, LIRI-JP), which was used as cohort for external validation of the signature. For the TCGA cohort, the expression values at probe level (probe ID) were converted to the corresponding gene symbol according to the annotation files without further standardization. When several probes matched to an identical gene symbol, the mean value was calculated as the expression value of this gene. We matched the ID numbers of the patients with their corresponding mRNA expression profile and clinical data and excluded patients whose ID numbers failed to match. Only selected patients with complete overall survival (OS) information were used for further analysis. We downloaded 1542 RBP genes as mentioned above. The RBP genes expression matrixes were further extracted from the two publicly datasets, respectively.

Prognostic signature construction
Before formal statistical analysis, we first filtered out RBPs measured on all the platforms with relatively high variation (determined by median absolute deviation > 0.5) to decrease the false discoveries [43,44]. The gene expression levels were compared pairwise in a given sample or profile to compute a score for each RBP-related gene pair (RRGP). In a pairwise comparison, if the expression value of the first RBP gene was higher than that of the second one, the output score of this RRGP was 1; otherwise, the output was 0, according to the proposed method [15,22]. The advantage of analyzing genes in a pairwise comparison is without the requirement for standardization process for individualized analysis. A summary of 23,353 shared RRGPs in two datasets were included as the initial candidate factors for prognosis prediction. Univariate analysis was performed to identify prognostic RRGPs associated with OS (P < 0.05). The RRGPs was further reduced while maintaining high accuracy. Therefore, LASSO penalized Cox regression was used to establish a more stable prognostic signature using an R package glmnet after 1000 iterations with tenfold cross-validation. Subsequently, the risk score of the prognostic RBP signature for each sample was calculated by the relative expression level of RRGPs with weighted by the estimated regression coefficient derived from the LASSO regression model. Risk score = (Exprgenepair-1 × Coefgenepair-1) + (Exprgenepair-2 × Coefgenepair-2) + … + (Exprgenepair-n × Coefgenepair-n). The patients were divided into high-and low-risk groups according to the RRGPs score cutoff, which was determined by a time-dependent ROC curve at 5 years.

Validation and assessment of the novel signature
We adopted the nearest neighbor estimation (NNE) method to plot the ROC curve. Time-dependent ROC with AUC at 1, 3, and 5 years was used to explore the prognostic accuracy in both cohorts. The OS difference between the low-risk and high-risk groups were evaluated by the log-rank test and Cox regression analysis. We then integrated the signature risk score with existing clinical and pathologic variables for multivariate Cox regression analysis. The Kaplan-Meier survival curves in patients with HCC stratified by sex, age, grade, and stage were used to further validate the performance of the prognostic signature. In addition, we compared the prognostic signature with other clinicopathological features and twelve recent established prognostic signatures by the Harrell's concordance index (C-index) with 1,000 bootstraps resamples in the TCGA cohort.

Functional enrichment of the RBP genes in the prognostic signature
To understand the underlying biological mechanisms of the novel RBP-related prognostic signature, functional enrichment analysis was conducted among the 61 unique genes using the Metascape database, which is a biologist-oriented resource for the analysis of systems-level datasets [45].

Statistical analysis
Survival curves were generated using the Kaplan-Meier method and the differences in survival curves were compared by the log-rank test using the 'survival' package. Multivariate analyses were conducted using the Cox proportional hazards regression model and hazard ratios (HR) with their 95% confidence interval (CI) were calculated. The ROC curves were performed by an R package "survivalROC". The R package rms was used to compare other models with the RRGP prognostic signature. A p-value < 0.05 was considered to be significant. All statistical analyses were performed using R (version 3.6.3; https ://www.r-proje ct.org/).