The MalaCards disease universe
To help overcome the impediment of disease name unification stemming from source heterogeneity, we obtained 85,000 disease terms from 15 sources that were examined in a predefined order of importance, and used text unification heuristics to define 19,289 main names and their associated 65,000 aliases. The 15 annotative sections for each disease also include disease-related summaries and publications, disease categories and classifications, symptoms, gene variations, drugs, clinical trials, genetic tests and animal models. Several MalaCards sections (summaries, symptoms and aliases) are incorporated into the GeneCards database. This allows the content of such sections to be searchable in the context of a gene-centric database.
Additionally, MalaCards defines a hierarchical disease relatedness scheme, which includes aliases, disease families, related diseases and disease SuperPaths (Fig. 1). SuperPaths are integrated biological pathways from 12 information sources, shown in the PathCards database, a member of the GeneCards suite [14]. This hierarchical disease relatedness scheme constitutes a major MalaCards strength for NGS interpretation, allowing one to gradually augment the sphere of disease definition pertinent to the genes of interest. Currently, the first level (aliases), is available to the GeneCards search index, and therefore also to VarElect (see below). Higher relatedness levels are in the process of being implemented within GeneCards. Further, related diseases form a basis for defining a disease network which we are now using as a platform for disease neighbourhood scrutiny based on MalaCards disease annotations, such as drugs, symptoms and anatomical context.
The gene-disease network
Crucial for MalaCards’ NGS interpretation capacities is its well-defined scored gene-disease network. We have generated such a network based on MalaCards data. This network has two types of nodes—genes and diseases, with only one type of edge—that which connects a disease to a gene (Fig. 2). The network is weighted and directional; hence, each pair of nodes is associated with two types of scores: D-G signifies the importance of a gene for the disease, and G-D—the association strength of a disease to the gene. D-G is displayed in the genes section of MalaCards [6, 14] and G-D is portrayed in the disorders section of GeneCards. The latter is also portrayed within the “disease-gene” table affiliated with GeneCards.
A conservative level of gene-disease connections is essential for enhancing the NGS interpretation specificity of VarElect. This prompted us to define three levels of association strength in the gene-disease network (Fig. 3). When considering all disease-gene relationships in our knowledge-base, i.e. all cases in which D-G >0, a very large network is obtained, connecting 22,280 genes to 11,580 disease entries, with some diseases having as many as 5000 associated genes. To provide more specific information, we applied filtration heuristics to reduce the number of gene disease connections (e.g. removing publications that provide more than 5 gene-disease associations, typically large scale studies which are more prone to providing noisy data). This reduced the connected gene count to 10,615, with no more than 300 genes per disease, and with nearly all diseases having 25 genes or fewer each. The number of gene-disease associations was reduced to 14.6%, involving 47% of the genes prior to filtration. Our gene-disease network is based on this connectivity level. Finally, it was deemed necessary to highlight gene-disease relations that are curated and evidence-based, for which reason we defined elite genes as the associations coming from curated sources such as OMIM and Orphanet. The elite gene-disease network has 4500 genes and 8000 diseases, with 99% of diseases having 10 or fewer genes (Fig. 3). The disease nodes of this reduced network are termed “elite diseases”, and its edges are likewise termed “elite edges”. An indication of the effectiveness of the data/text mining sources is that after the filtering step they still retain 75% of the elite edges in the network.
The VarElect NGS interpretation tool—direct gene to phenotype associations
NGS analyses identify non-reference variants in a specific subject. The total count of such variants ranges from tens of thousands for whole-exome sequencing to hundreds of thousands for whole-genome sequencing. Typically, only one or just a few variants are expected to be significant for the relevant disorder. In a filtering stage, parameters such as genetic model, rarity in the population, predicted protein impact, and gene evolutionary conservation help shorten the variants list to a few hundred variants, or even just a few dozen variants. Further focusing towards the identification of the causative disease genes requires NGS interpretation, i.e. seeking relationships between a variant-harboring gene and specific phenotype/disease terms. This is sometimes done manually, by consulting numerous heterogeneous databases. An alternative pipeline is to take advantage of a computerized knowledgebase. The GeneCards suite, including GeneCards, MalaCards and PathCards (the database of human biological pathways) [9] offers an effective way to do the latter, with a wealth of automatically-mined information, integration algorithms, as well as powerful unconstrained search and scoring capabilities linked to the VarElect NGS interpreter [11] (Fig. 4).
VarElect, as well as TGex—a broader VarElect-based NGS annotation and interpretation platform [11], infer direct and indirect links between genes, and diseases/phenotypes. The direct mode relies on the combined power of GeneCards and MalaCards, i.e. their shared gene-disease network, as well as their textual association capacities. The evidence for the obtained gene-phenotype relations is portrayed via the “MiniCards” mechanism originally instituted in the GeneCards search [5], and which displays the hit context within the card, with hyperlinks to the source databases. VarElect compares favorably with several popular NGS phenotyping tools [11], and it therefore provides an efficient way of ranking NGS-inferred genes by the strength of relatedness to disease keywords.
To date, VarElect has helped solve ~20 clinical cases in our own laboratory (e.g. [16–19]). In one such case, we studied a 6 year-old boy of Bukharian ancestry with atypical epilepsy combined with retinitis pigmentosa. Following whole-exome sequencing of the patient and his two healthy parents, we identified 81 rare homozygous variants in the patient, which were in a heterozygous state in both parents. We submitted this “medium-sized list” of 63 mapped genes along with the phenotype search terms; ‘epilepsy OR macular OR retinitis’ to VarElect. The resulting analysis placed CLN6 first with a long margin (Fig. 4). The patient had a homozygous missense variation (V148D) in this gene with zero population frequency, and protein impact genomic evolutionary rate profiling (GERP) score [20] of 5.14 (“highly damaging”). The mutation has not been previously reported in the context of this specific syndrome. Further functional studies are being performed to prove causality. Importantly, the result would not have materialized with GeneCards data or MalaCards data alone, underscoring the importance of their joint contribution to VarElect. Following this discovery, the patient was clinically diagnosed with accuracy, enabling appropriate genetic counselling and preimplantation diagnosis for the family in the event of future pregnancies.
Indirectly-inferred connections in VarElect
In the indirect (or “guilt by association”) mode, VarElect can capitalize on the GeneCards suite’s varied gene-to-gene relationships to identify the relevance of NGS-derived genes that have no relationship to the entered phenotype terms on their own (Fig. 4). A major contribution to this comes from protein–protein interactions, as well as integrated pathway information from PathCards [8].
Pathway databases represent collections of genes and their interactions, mapped onto biological processes. Most of the information relevant to pathway definition comes from high-throughput protein–protein interactions [21–23], and from specific studies of cellular function having different emphasis and coverage [24]. There are significant inconsistencies caused by author preferences, and incompatibilities in data formats. Thus, pathways often provide an idiosyncratic view of biological mechanisms. PathCards was constructed to address such challenges, by aggregating 3215 pathways from 12 sources into 1073 SuperPaths, thus reducing redundancy, and maximizing informativeness [8].
VarElect’s indirect mode may further avail itself of the GenesLikeMe suite member (formerly GeneDecks Partner Hunter [13]). This tool relates genes to each other by shared attributes. These include sequence paralogy, GO terms, protein domains, mouse phenotypes, publications and tissue expression patterns. RNA expression patterns are derived from adapting GTEx data [25], and protein expression patterns stem from our human integrated protein expression database (HIPED), encompassing the integration of 4 proteomic data sources [26]. In VarElect, GenesLikeMe facilitates the generation of scored gene-to-gene matrices based on user-selected weighted combinations of attributes.
A good illustration of the strength of this approach is the case of a family diagnosed with systemic capillary leak syndrome. As previously reported [11], VarElect identified a promising candidate gene for this condition (TLN1), which was indirectly related to the phenotype through implicating genes associated with the phenotype.
Disease-based pathways
In PathCards, pathways are defined by biological insights regarding the functionalities of sets of genes. We note that PathCards regards each pathway as a “bag of genes” without regard to topological features such as “gene A activates gene B”. We have explored the use of this kind of simplified yet highly useful “compositional” view to delineate an alternative definition of pathway boundaries that reflect the disease-gene network. We propose to regard all of the genes for a given disease as a “disease pathway” (Fig. 5a). This is done by using the MalaCards integrated and filtered list of genes for each disease. While PathCards already includes a small number of pathways named after a disease, such as “Parkinson’s Disease Pathway”, our endeavour is much more comprehensive, defining thousands of novel pathway-like entities. We note that genes associated with a specific disease may be completely unrelated molecularly. In other words, genes belonging to a disease pathway seldom belong to the same standard biological pathways (Fig. 5b). Thus, in the realm of VarElect, a tool aimed at disease-gene relations, such newly inferred gene-to-gene relations may be beneficial for NGS disease interpretation. An example is as follows: In VarElect, if the search term is a disease name, all genes related to the disease in our disease-gene network will become hits. On the other hand, if one searches with a phenotype term, the indirect mode kicks into action. This shows not only the genes directly related to the term, but also all genes that reside within the same disease pathway shared with a gene associated with the phenotype (Fig. 4).
To further enhance the gene-to-gene relations gleaned through disease pathways, we used the PathCards algorithm [8] in order to coalesce disease pathways into disease SuperPaths (Fig. 6). The power of such action is in casting a wider net, whereby genes that do not share any individual disease pathway become related through the sharing of a disease SuperPath. Such a process may also be viewed as a means for augmenting the scope of disease relatedness (Fig. 1). Disease pathways become connected by mutual gene sharing (signifying an edge in a disease SuperPath), thus defining a collection of disease SuperPaths (Fig. 6 bottom left). This brings about a situation of indirect disease relations. This is exemplified in the specific case shown in Fig. 6 central part, whereby Gaucher disease type iiic and Gaucher disease type 3 have overlapping disease pathways (i.e. are related by gene sharing, Fig. 6 top right), signified by having a direct edge between them. The same is true for Gaucher disease type 3 and chitotriosidase deficiency. But significantly, Gaucher disease type iiic and chitotriosidase deficiency are only indirectly related (Fig. 6): they are 2 edges away, thus they share no associated genes. But they are linked via the fact that there is a third disease pathway (Gaucher disease type 3) that significantly shares genes with both.
Preparing for whole-genome sequencing analyses
GeneCards encompasses a comprehensive collection of 101,976 ncRNA entries, integrated from 15 different data sources [3]. An integration algorithm is used, based on mapping to genomic coordinates via GeneLoc, another GeneCards suite tool [9]. VarElect can make use of this wealth of information to assist in identifying the involvement of ncRNA gene variations in disease.
Addressing transcription regulation, GeneCards is in the process of introducing regulatory element entries, with a special focus on the rather uncharted realm of distant-acting enhancers. This is based on the Ensembl regulatory build [27] as well as several other sources. GeneCards now displays a UCSC custom track showing all genes along with regulatory elements, so that users can inspect and judge relationships among them. In parallel, we are constructing a probability-based model for these associations, which will provide scores for regulatory element variations in relation to individual genes, allowing pipeline analyses by VarElect of variants residing outside the exome [28].