Public transcriptome database-based selection and validation of reliable reference genes for breast cancer research

Background Quantitative reverse transcription-polymerase chain reaction (qRT-PCR) is the most sensitive technique for evaluating gene expression levels. Choosing appropriate reference genes (RGs) is critical for normalizing and evaluating changes in the expression of target genes. However, uniform and reliable RGs for breast cancer research have not been identified, limiting the value of target gene expression studies. Here, we aimed to identify reliable and accurate RGs for breast cancer tissues and cell lines using the RNA-seq dataset. Methods First, we compiled the transcriptome profiling data from the TCGA database involving 1217 samples to identify novel RGs. Next, ten genes with relatively stable expression levels were chosen as novel candidate RGs, together with six conventional RGs. To determine and validate the optimal RGs we performed qRT-PCR experiments on 87 samples from 11 types of surgically excised breast tumor specimens (n = 66) and seven breast cancer cell lines (n = 21). Five publicly available algorithms (geNorm, NormFinder, ΔCt method, BestKeeper, and ComprFinder) were used to assess the expression stability of each RG across all breast cancer tissues and cell lines. Results Our results show that RG combinations SF1 + TRA2B + THRAP3 and THRAP3 + RHOA + QRICH1 showed stable expression in breast cancer tissues and cell lines, respectively, and that they displayed good interchangeability. We propose that these combinations are optimal triplet RGs for breast cancer research. Conclusions In summary, we identified novel and reliable RG combinations for breast cancer research based on a public RNA-seq dataset. Our results lay a solid foundation for the accurate normalization of qRT-PCR results across different breast cancer tissues and cells. Supplementary Information The online version contains supplementary material available at 10.1186/s12938-021-00963-8.


Identification of candidate RGs based on a public transcriptomic database
Transcriptome sequencing data of 1217 breast cancer samples were obtained from the TCGA database. Next, 15,450 unigenes that were identified after processing were evaluated by Fragments Per Kilobase Million (FPKM) (high expression level, FPKM ≥ 10), coefficients of variation (CV) (low variability as determined by CV ≤ 40%), fold change (FC)-5% (the top 5% of 1217 samples divided by the lowest 5%, FC-5% ≤ 5), and dispersion measure (DPM) (DPM ≤ 0.3) values. The results for the different statistical algorithms, shown in Fig. 1, were as follows: (1) FPKM: A total of 4723 genes satisfied the requirement (30.57% of 15,450, the blue area in Fig. 1A). Gene overlap between the four algorithms was identified using a Venn diagram with 4-color blocks (blue, purple, green, and red), showing that 197 genes satisfied all four requirements (Fig. 1E). Of these 197 genes, 10 genes (SF1, TARDBP, THRAP3, QRICH1, TRA2B, SRSF3, YY1, DNAJC8, RNF10, and RHOA) were selected as novel candidate RGs due to their higher FPKM values and easier primers design. In addition, GUSB, TUBA1A, RPL13A, and B2M, which previous studies suggested being stable RGs in . The 197 RGs that met various criteria were identified by Venn diagram analysis (E), and among these, 10 genes were selected as the novel candidate RGs. FPKM, gene fragments per kilobase of exon model per million mapped reads; CV, coefficient of variation; DPM, dispersion measure; FC-5%, the fold change between the top 5% highest expression levels divided by the bottom 5% within the 1217 samples Page 4 of 19 Song et al. BioMedical Engineering OnLine (2021) 20:124 breast cancer research, and two classical RGs, ACTB and GAPDH, were also considered. These genes were ranked based on their CV values (shown in Table 1).

Primer specificity and amplification efficiency for qRT-PCR
A total of 20 paired primers including 16 candidate RGs and 4 target genes were designed for qRT-PCR experiments. The unigene name, ENSid, gene description, primer sequences, theoretical Tm (°C), product size, primer efficiency (E), and coefficient of determination (R 2 ) are summarized in Table S1. The primer efficiency for all 20 genes ranged from 90.0% for YY1 to 105.4% for DNAJC8, and correlation coefficients varied from 0.996 (ACTB) to 0.999 (B2M, YY1). All paired primers showed adequate specificity (Additional file 1: Fig. S1).

Ct values of candidate reference genes
The mean Ct values (average of 3 technical replicates) for all 16 RGs are shown in Fig. 2 and Additional file 4:  However, the Ct values of the breast cancer cell lines were overall lower than those of breast cancer tissues (Fig. 2B). A similar result of standard deviations was obtained in the breast cancer cells. To estimate the gene expression stability of these candidate RGs, more scientific algorithms will have to be introduced and used.

Expression stability of candidate reference genes
In this study, the qRT-PCR data matrix was analyzed using five differential algorithms: geNorm, NormFinder, BestKeeper, ΔCtmethod, and ComprFinder.

geNorm analysis
Gene expression stability was evaluated by the M value using geNorm analysis. This program determines the pairwise variation of each gene with all other analyzed genes under the same experimental conditions: the lower the M value, the more stable the gene expression. In the breast cancer tissue group, the three most stably expressed genes (with the lowest M values) were SF1, THRAP3, and TARDBP, while GAPDH, DNAJC8, and B2M were the least stably expressed genes ( Table 2). In the breast cancer cell group THRAP3, RHOA, and QRICH1 were the top three stably expressed genes, while B2M, TUBA1A, and ACTB were the least stably expressed genes (Table 3). Among all samples, TARDBP was the most stably expressed gene, followed by SF1 and QRICH1. Conversely, TUBA1A, B2M, and ACTB were the least stably expressed RGs (Additional file 5: Table S3).

NormFinder analysis
Based on variance analysis to calculate the stable value of each gene, a higher Nor-mFinder value indicates a less stably expressed gene. In the breast cancer tissue group, TRA2B, RHOA, and THRAP3 were the most stable genes, and DNAJC8, GAPDH, and B2M were the most unstable genes ( Table 2). In the breast cancer cell group, THRAP3, DNAJC8, and RPL13A were the three most stably expressed genes, while TUBA1A, B2M, and ACTB were the least stably expressed genes (Table 3). For all breast cancer tissue and cell line samples, THRAP3, RHOA, QRICH1 were the most stably expressed genes, and TUBA1A, B2M, ACTB were the least stably expressed RGs (Additional file 5: Table S3).

BestKeeper analysis
To further analyze the expression stability of the RGs, BestKeeper was applied, in which a lower standard-value indicates a more stably expressed RG. As shown in Table 2, in the breast cancer tissue group DNAJC8, RPL13A, and TARDBP were the most stably expressed genes, while ACTB, B2M, and GAPDH were the least stably expressed genes (shown in Table 2). In the breast cancer cell line group, THRAP3, RHOA, and QRICH1 were the three most stably expressed genes, while B2M, ACTB, and TUBA1A were the least stably expressed genes (shown in Table 3). For all samples combined, DNAJC8, RPL13A, and TUBA1A were the most stably expressed genes, while GAPDH, RNF10, and ACTB were the least stably expressed RGs (Additional file 5: Table S3).

ΔCt analysis
According to the ΔCt method, TRA2B, RHOA, and THRAP3 were the most stably expressed genes, while DNAJC8, GAPDH, and B2M were the least stable genes in the breast cancer tissue group (Table 2), which was consistent with the analysis according to NormFinder. In addition, THRAP3, TARDBP, and YY1 were the most stably expressed genes in the breast cancer cell lines, while B2M, TUBA1A, and ACTB were the least stably expressed genes (Table 3). For all samples combined, THRAP3, RHOA, and QRICH1 were the most stably expressed genes, while TUBA1A, B2M, and ACTB were the least stable RGs (Additional file 5: Table S3).

A comprehensive ranking of the four methods examined
The ComprFinder algorithm was carried out to obtain a final score which was used to rank the potential RGs. In the breast tumor tissue group, the 3 most stably expressed RGs were SF1, TRA2B, and THRAP3 (Table 2). In the breast cancer cell lines, THRAP3, RHOA, and QRICH1 were the most stably expressed RGs (Table 3). For all samples combined, we ranked the RGs from the highest to the lowest stability as follows:  Table S3). These promising results warranted further validation of the selected RGs.

Validation of the selected genes (1): comparison of target gene profiles when using different normalized RGs
To verify the reliability of the selected RGs, the expression profiles of MAPK3, MAPK9, FAAH, and HIF1A were determined in different breast cancer tissues and cell lines. Our results indicated that SF1, TRA2B, and THRAP3 were the top 3 stably expressed RGs in breast cancer tissues and that THRAP3, RHOA, and QRICH1 were the top 3 stably expressed RGs in breast cancer cell lines. Moreover, five genes (SF1, TRA2B, THRAP3, RHOA, and QRICH1) were the top 5 stably expressed candidate RGs in all samples. Therefore, we considered the multi-RG combination SF1 + TRA2B + THRAP3 + RHOA + QRICH1 as the most promising choice for breast cancer research (both in breast cancer tissues and cell lines). Thus, the multi-gene combinations including SF1 + TRA2B + THRAP3 + RHOA + QRICH1, SF1 + TRA2B + THRAP3, THRAP3 + RHOA + QRICH1, SF1 + THRAP3, THRAP3 + RHOA, and single RGs including SF1, TRA2B, THRAP3, RHOA, and QRICH1 were compared. In addition, ACTB, GAPDH, and ACTB + GAPDH were also used for comparison with the novel RGs. In total, 13 different multi-RG combinations or single RGs were assessed. For multiple gene combinations, the geometric average of their Ct value was calculated. The relative gene expression level was calculated as 2 −ΔCt , where ΔCt = Δ (Ct Target gene -Ct RGs ).
As shown in Fig. 3A, the expression of MAPK3 was significantly higher (P < 0.05) in HR + HER2− cancer tissue than in para-carcinoma tissue or benign tumor tissue when assessed by 5 or 3 multi-gene RG combinations. However, the expression pattern of MAPK3 changed when we used single or 2 multi-gene RG combinations, such as SF1 + THRAP3, SF1, RHOA, or QRICH1. Importantly, when we investigated the least . The error bars represent the SEM, and the independent-sample t-test was performed between any two groups, *P < 0.05, **P < 0.01, n = 6 for each BC tissue group, n = 3 for each BC cell strain. The patterns of target gene expression were different between those normalized by the most stable RGs and those normalized by the least stable RGs stably expressed RGs (ACTB, GAPDH, or ACTB + GAPDH), the expression of MAPK3 was significantly changed compared with the most stably expressed RGs. As shown in Fig. 3B, when using 3 or 5 multi-gene combinations, the expression level of the MAPK9 gene was higher in HR + HER2 − cancer tissue than in para-carcinoma tissue (P < 0.05), while there was no significant difference between para-carcinoma tissue and benign tumor tissue. This may lead to small errors when using single or 2 multigene combinations. For example, when the less stably expressed genes ACTB, GAPDH, or ACTB + GAPDH, were used for data normalization, the expression of MAPK9 did not show a clear expression trend compared with those of 3 or 5 multi-gene combinations.
The complete relative expression levels (2 −ΔCt ) of MAPK3, MAPK9, FAAH, and HIF1A genes normalized using all 13 types of single or multiple-RG combinations are listed in Additional file 6: Table S4 and Additional file 7: Table S5.

Validation of the selected genes (2): the relationship among different normalized RGs
Based on the method described in our previous study [28], the relationship among different normalized RGs was explored. As shown in Additional file 2: Fig. S2, there was a high correlation (R 2 from 0.815 to 0.979 in breast cancer tissues, and R 2 from 0.927 to 0.995 in breast cancer cell lines) between stable RGs and SF1 + TRA2B + THRAP3 + RHOA + QRICH1. There was also a moderate-to-high correlation (R 2 from 0.621 to 0.709 in breast cancer tissues, and R 2 from 0.600 to 0.916 in breast cancer cell lines) between unstable RGs and SF1 + TRA2B + THRAP3 + RHOA + QRICH1. There were few differences between the most stably expressed RGs and the least stably expressed RGs. Therefore, we performed additional analyses of their normalized efficacy, including a correlation analysis on the p-value yielded by the t-test analysis (see Method section).

The importance of reference genes
The qRT-PCR technique is one of the most valuable and reliable research tools to quantify the expression of a target gene under different experimental conditions. Proper use of RGs is necessary to get a reliable estimate of gene expression in different types of breast cancer tissues and cell lines to avoid detecting variations that are not cancer-specific [29][30][31]. Therefore, the selection of the appropriate RGs for breast cancer research is important when using qRT-PCR to quantify gene expression. Many studies use a single endogenous control for normalization, which can influence the statistical results and may lead to erroneous data interpretation [2,32]. In fact, in the present study, no single RGs were identified that were stably expressed in all tissues or cell types across different types of breast cancer [7,33,34]. Theoretically, RGs should be stably expressed in all samples, and their expression levels should be unaffected by the external environment, e.g., during tumorigenesis [35]. The selection and validation of RGs have to be corroborated by using a large number of samples [36,37]. To implement this idea, in this study we collected a large number (n = 87) of samples including 6 types of breast cancer tissues and 7 types of breast cancer cell lines. This allowed us to obtain strong results and conclusions. There was a great diversity of samples in our study for the following reasons: (a) both benign and malignant tumor types were chosen; (b) breast cancer samples following neoadjuvant chemotherapy were included; (c) the breast cancer cell lines included overexpression and knock-down groups. With the above caveats explained, we propose that we have identified combinations of RGs that have high applicability in breast cancer research and treatment.

Confirm the application of RGs
In our study, we used five algorithms to determine the stability of the expression of 16 candidate RGs across several different types of breast tumors and breast cancer cell lines. We found, that SF1 was the most stably expressed gene according to geNorm and ComprFinder but ranked fifth by NormFinder and fourth by Best-Keeper and the ΔCt method in breast cancer tissue samples. On the other hand, NormFinder and the ΔCt method recommended TRA2B as most appropriate for normalizing expression in breast cancer tissue samples. Surprisingly, THRAP3 was the most stably expressed gene according to all five algorithms in breast cancer cell lines.
The ideal reference gene shows a constant level of expression that does not vary by tissue or cell type and is not influenced by the treatment that is applied. However, numerous studies have shown that no gene is permanently and stably expressed under all circumstances. Therefore, reference genes must be evaluated for each breast cancer type and each experimental setup and multiple gene combinations must be used. Even for the same algorithm, the results varied between breast cancer tissues and cell lines. The top three genes for breast cancer tissues and cell lines were SF1 + TRA2B + THRAP3 and THRAP3 + RHOA + QRICH1, respectively, and therefore a total of 5 RGs (SF1, TRA2B, THRAP3, RHOA, QRICH1) should be considered. Unfortunately, determination of the expression of all five RGs simultaneously would require a lot of effort.
There are no specific literature reports prescribing how many candidate RGs should be used for qRT-PCR-dependent studies [38]. In particular, it is unknown which single or multiple gene combinations (SF1 + TRA2B + THRAP3 + RHOA + QRICH1, SF1 + TRA2B + THRAP3, THRAP3 + RHOA + QRICH1, SF1 + THRAP3, THRAP3 + RHOA, SF1, TRA2B, THRAP3, RHOA, or QRICH) should be used. Considering that our results indicate that the single gene performances of both novel and traditional RGs are not adequate, we propose that these types of studies should not be based on the use of single RGs, even if they are top-level RGs. The double gene combinations SF1 + THRAP3 and THRAP3 + RHOA showed similar gene expression profiles consistent with SF1 + TRA2B + THRAP3 + RHOA + QRICH1, SF1 + TRA2B + THRAP3, and THRAP3 + RHOA + QRICH1. However, the SF1 + THRAP3 combination behaved similarly to the 3 or 5-gene combinations except for the MAPK3 and MAPK9 expression. Meanwhile, the THRAP3 + RHOA combination behaved similarly to the 3 or 5-gene combinations except for the MAPK9 expression. Therefore, considering the need for normalization accuracy, double RGs are not the optimal choice either.
The expression pattern of target genes was the same when 3-gene combinations or 5-gene combinations were used and they can be applied to various factors in breast cancer research. However, 3 RGs is a more manageable number for normalizing qRT-PCR experiments than 5 RGs. Therefore, we recommend that SF1 + TRA2B + THRAP3 and THRAP3 + RHOA + QRICH1 be adopted as the RG combinations for breast cancer tissue and cell line research, respectively. In the case of studies including both breast cancer tissue and cell line research, the THRAP3 + RHOA + QRICH1 combination would be optimal. The previous RGs comparison The target genes that were used in this study are involved in different biological processes of breast carcinogenesis and metastasis. Particularly, tumorigenesis, proliferation, apoptosis, and invasion are associated with many genes and signaling pathways. For example, genes such as MAPK3 and MAPK9 encoding MAP kinases of the ERK signal pathway participate in transcription factor regulation of many biological processes [39,40]. Recently, novel results have indicated proteins that serve important roles during the process of cancer development. FAAH is a membrane-bound protein belonging to the serine hydrolase family of enzymes that plays a significant role in the termination of signaling of fatty acid amides (FAAs), a class of bioactive lipids, both in the central nervous system and in some cancer tissues [41]. Hypoxia-inducible factors (such as HIF1A) play an important role in the development of tumors, thus the study of these factors is indispensable for cancer research [42,43]. Therefore, to confirm the roles of these genes on the vital regulatory mechanisms in breast cancer, we compared the potential role of novel RGs (SF1, TRA2B, THRAP3, RHOA, and QRICH1) vs. traditional RGs (ACTB, and GAPDH) in the normalization of target gene expression.

Our proposal
In this study, we did not merely verify the use of conventional RGs, but also identified and selected more appropriate novel RGs for breast cancer research. Our results show, that the use of a single RGs should be avoided for breast cancer research. Similarly, the use of double RGs is not recommended. These findings are similar to what has been suggested in most of the studies using transcriptomic datasets [44]. As far as we know, only one previous study reported on the role of RGs in the normalization of breast cancer gene expression studies. The previous studies of RGs used traditional RGs, and other breast cancer studies were also based on traditional RG [3,7]. In the present study, a large number of biological samples were provided for determination and validation, and multiple algorithms were used for evaluation, with the RNA-seq dataset being used for prediction and selection. Therefore, in terms of both the number and quality of RGs, this study is a significant step forward from previous studies. Our results suggest that the recommended number of RG is at least three for breast cancer tissues or cell lines. Nevertheless, these promising results require further verification of target genes in order to obtain more reliable data sets.

Limitations and future research suggestions
Although this study was based on a large amount of transcriptome data to predict the new RGs, and a large number of breast cancer samples were used for confirmation and verification, we still cannot guarantee that our research results apply to all breast cancer types, especially those rare disease types, such as medullary breast carcinoma. In addition, mutations of gene expression always exist, and the number of samples in our study was limited. Therefore, our final recommendation may not be an absolute perfect choice, but a relatively better choice. Sequencing technology is widespread with the development of genomics, and large amounts of accumulated data need to be interpreted from a multi-disciplinary perspective in order to choose suitable RGs [45]. Some emerging technologies and methods for data mining require us to borrow and learn, such as the multi-objective Particle Swarm Optimization [46], and the meta-heuristic optimization algorithm [47]. In future work, more novel algorithms need to be developed for explaining how normalization affects breast cancer expression data gathered by qRT-PCR, which will allow us to improve the accuracy and standardization across study systems.

Conclusions
In this study, we tested 16 different candidate RGs in six different breast cancer tissues and seven breast cancer cell lines, using five different statistical algorithms for evaluation.
Our results indicate that SF1 + TRA2B + THRAP3 and THRAP3 + RHOA + QRICH1 are promising RG combinations for efficient gene normalization under different conditions. Furthermore, the availability of these RGs and the stability of their expression in various tumor tissues and cells will allow performing future studies focusing on genes essential for breast cancer biology, and choosing a reliable and appropriate RG combination will allow more accurate assessments of differential gene expressions in breast cancer research.

Breast cancer tumor
Breast tumor and para-carcinoma tissues were supplied by the Breast Tumor Biobank of the Three Gorges Hospital Affiliated with Chongqing University. Fresh tissues were obtained from patients with written informed consent and with permission of the Three Gorges Hospital Affiliated with Chongqing University Clinical and the Laboratory Research Ethical Council. All tissues were stored frozen at -80 ℃ after pathologic evaluation. We collected a total of 66 tissue samples including benign tumor tissues (n = 6), as well as tissues from four subtypes of breast cancer including HR + /HER2 − (n = 6), HR + /HER2 + (n = 6), HR −/HER2 − (n = 6), HR −/HER2 + (n = 6), and their paired para-carcinoma tissues (n = 6 each) from 24 patients who were diagnosed with breast cancer and from 6 patients who were diagnosed with breast cancer and then were treated with NAC before surgery. The para-carcinoma tissue samples had been taken from outside of the histopathological tumor border (3 cm) in the same excisional biopsy specimen. The clinical patient information is shown in Additional file 10: Table S8. . Moreover, we have constructed the MDA-MB-231 cell lines overexpressing CNR2 or CNR2 knock-down using lentiviruses (Genechem, Shanghai, China). All culture media were supplemented with 20U/mL penicillin, 100 mg/mL streptomycin, and 10% heat-inactivated fetal bovine serum (FBS, Gibco, Australia). Cells were grown at 37 ℃ in a humidified atmosphere including 5% CO 2 . At the end-point of each experiment, the final pH of the supernatant was always measured by a digital pH-meter (pH 301, HANNA Instruments, USA).

Total RNA extraction and cDNA synthesis
Total RNA was isolated with RNAiso Plus (Takara, Dalian, China) using the phenol-chloroform method. Extracted RNA was quantified using Nanodrop One (Ther-moFisher, Wilmington, USA) and its integrity was checked on a 1% agarose gel. Only RNA samples with A260/A280 ratios between 1.9 and 2.2 and A260/A230 ratios greater than 2.0 were used for cDNA synthesis. Total RNA (1 μg) was reverse-transcribed into cDNA using random primers or an oligo dT primer using a PrimeScript RT reagent Kit with gDNA Eraser (Takara, Dalian, China), according to the manufacturer's protocol [48]. All cDNA samples were diluted 1:8 with RNase-free water and stored at -20 ℃.

Selection of candidate reference genes
The transcriptome sequencing dataset of 1217 breast cancer samples was downloaded from the TCGA database (https:// www. curel ine. com/ the-cancer-genome-atlas. html) (Fig. 5A). After obtaining the gene fragments per kilobase of exon model per million mapped reads (FPKM), transcripts that exhibited low levels (FPKM = 0 appearing over