Learning to rank diversified results for biomedical information retrieval from multiple features
© Wu et al.; licensee BioMed Central Ltd. 2014
Published: 11 December 2014
Different from traditional information retrieval (IR), promoting diversity in IR takes consideration of relationship between documents in order to promote novelty and reduce redundancy thus to provide diversified results to satisfy various user intents. Diversity IR in biomedical domain is especially important as biologists sometimes want diversified results pertinent to their query.
A combined learning-to-rank (LTR) framework is learned through a general ranking model (gLTR) and a diversity-biased model. The former is learned from general ranking features by a conventional learning-to-rank approach; the latter is constructed with diversity-indicating features added, which are extracted based on the retrieved passages' topics detected using Wikipedia and ranking order produced by the general learning-to-rank model; final ranking results are given by combination of both models.
Compared with baselines BM25 and DirKL on 2006 and 2007 collections, the gLTR has 0.2292 (+16.23% and +44.1% improvement over BM25 and DirKL respectively) and 0.1873 (+15.78% and +39.0% improvement over BM25 and DirKL respectively) in terms of aspect level of mean average precision (Aspect MAP). The LTR method outperforms gLTR on 2006 and 2007 collections with 4.7% and 2.4% improvement in terms of Aspect MAP.
The learning-to-rank method is an efficient way for biomedical information retrieval and the diversity-biased features are beneficial for promoting diversity in ranking results.
How to promote diversity in ranking for information retrieval has become a very hot topic [1–7] in the past decade. One of the major reasons is the increasing demand of novelty and disambiguation of user query, as described in  as Intrinsic Diversity and Extrinsic Diversity respectively. Beyond counting on relevance between documents and query, diversity IR takes consideration of relationship among documents in ranking order to promote diversity and reduce redundancy. In essence, to promote diversity means to provide various aspects of information in the ranking results list and to reduce redundancy aims to deduce repeatedly mentioned information.
The application of diversity IR has drawn great attention and shown beneficial in previous studies when query turns out to be ambiguous, especially in the scenario of biomedical IR investigated in TREC 2006 and 2007 Genomics tracks where biologists tend to query a certain type of entities covering different aspects that are related to the question, for example, genes, proteins, diseases, and mutations.
In the TREC 2006 Genomics track, University of Wisconsin re-ranked the passages using a clustering-based approach named GRASSHOPPER to promote ranking diversity . GRASSHOPPER is an alternative to MMR  and variants with a principled mathematical model and strong empirical performance on artificial data set . Later in the 2007 track, most teams tried to obtain the aspect level performance through their passage level results, instead of working on the aspect level retrieval directly [11–13].
Recent works [14, 4] show that Wikipedia can be used as an external knowledge resource to facilitate biomedical IR. In these studies, Wikipedia is used as an encyclopaedia to help to detect the topics of documents. The novelty of detected topics are measured by binary novelty measurement  or survival models  for re-ranking to promote diversity of whole ranking list.
Besides methods mentioned above, recently there are some papers dealing with diversity IR using learning-to-rank methods. One typical work is to directly learn a diversified ranking of documents based on users' clicking behavior, and the algorithm maximizes the probability that a relevant document is found in the top k positions of a ranking . Another work is to optimize variants of traditional IR metrics, such as NDCG and ERR, in the way of rewarding aspect coverage thus to penalize redundancy . However, during the model learning process of these methods, only general features are used while none diversity-related features are considered. To the best of our knowledge, there is no learning-to-rank algorithm that addresses the specific features that may reflect the novelty of single document and the diversity of whole ranking list. We believe that with this general representation, the benefits brought by learning-to-rank may not have been fully exploited as the novelty and diversity characteristics of ranking lists are ignored. We argue that it is promising to define and make use of diversity reflecting features to better model diversity information.
In this paper, we propose several features that capture diversity of documents and construct a combined learning-to-rank framework (LTR) by integrating a general ranking model with the diversity-biased model. Our approach adopts the idea of measuring the topics' novelty of documents together with diversity of ranking list. We find a way to combine this dynamic changing feature with the learning-to-rank technology. In our proposed framework, firstly the general ranking model is learned from general ranking features by a conventional learning-to-rank approach; secondly diversity features are extracted based on the retrieved passages' topics detected using Wikipedia and ranking order produced by the general learning-to-rank model; then, a diversity-biased ranking model is constructed from diversity-indicating features together with conventional features; final ranking results are given by combination of both models.
The major contributions of this paper are two-fold. First, we propose several diversity-reflecting features by studying the relationship among documents. Second, we propose a learning-to-rank framework to combine the diversity-biased model with a general ranking model learned from the common features. Extensive experiments on the TREC 2006 and 2007 Genomics tracks [12, 17] demonstrate the effectiveness of our proposed diversity-favored learning-to-rank approach.
where gLTR(d, Q) is the general learning-to-rank model and dLT R(d, Q) the diversity-biased model, and α and β are parameters that control the weight of two parts and they have the relationship of β = 1 − α.
To deploy our proposed learning-to-rank framework in practice, firstly a general ranking model is learned from a set of training queries with their associated relevance assessments information. Next for the first pass retrieval results obtained from the general ranking model, we use Wikipedia Miner to extract their related topics. From this ranking list and the topics information, we generate the diversity-biased features for each query-passage pair. Then the diversity-biased learning-to-rank model is learned based on all these features.
General learning to rank model
General features extraction
Features for general learning-to-rank model.
Term frequency inverse document frequency.
Okapi BM25 model .
The DFR version of BM25 .
An algorithm derived from the divergence from randomness (DFR) framework .
An DLH hyper-geometric DFR model (parameter free) .
KL-divergence language model with Dirichlet smoothing .
Hiemstra's language model .
Proximity of Query Terms: Intuitively, the more close the query terms occur in a document, the more likely the document would be relevant .
It can be seen that our general features contain different paradigms of state-of-the-art IR models, which are usually used as strong baselines in previous studies.
Learning to rank algorithm
where S Λ(D; Q) is a scoring function parameterized by a vector of parameters Λ, and it is computed for each query Q with each document D in documents set , is an evaluation matrix, denotes that the orderings in are induced using scoring function S, and M Λ is the parameter space over Λ.
Diversity-biased learning to rank
Additional features for diversity-biased learning-to-rank model.
Number of relevant aspects the passage contains.
Number of irrelevant aspects the passage contains.
Number of new relevant aspects the passage contains compared with afore ranked passages.
Number of relevant aspects that already existed in afore ranked passages.
Ratio of passages that contains new aspects with all afore ranked passages.
Ratio of number of relevant aspects with all aspects before current rank position.
Ratio of unique relevant aspects with all aspects before current rank position.
Features extraction and model strategy
Our assumption is that there is a perfect diversified ranking list and through learning from the general features, which represent the value of each individual query-passage pair, and diversified features, which capture the novelty and diversity of the whole ranking list, we can get an oracle ranking model for further directing ranking for new dataset.
As can be seen in the previous section, the diversity features aim to reflect the relationship between current document with former ranked documents and therefore the features extraction is related to certain documents ranking and their quality are potentially affected by the ranking list. Actually this simulates the process of generating diversified documents based on former ranked documents in the paradigm of re-ranking for promoting novelty and diversity, where the document for each position is determined in the principle of maximizing the diversity for the whole ranking list. Accordingly these diversity features should be extracted in tandem. There can be different ways to generate diversity features:
Once for all: The diversity features are generated according to the initial ranking given by general learning-to-rank model, and the oracle model is learned from all features once for all.
Dynamic update: After the diversity features of documents in ith top K subset are determined, the oracle learning-to-rank model will be re-learned and consequently the general ranking will be updated which results in the re-generating of diversity features.
Heuristically the second strategy might be better; however, we argue that this is much time-consuming and complicated in practice. Therefore in this paper we adopt the first strategy for diversity feature generation.
In order to evaluate the proposed approach, we use the TREC 2006 and 2007 Genomics tracks full-text collection as the test corpus, which consists of 162,259 documents from 49 genomic-related journals indexed by MEDLINE [17, 12] including 64 queries in total. Three levels of retrieval metrics were measured in the TREC 2006, namely Passage MAP, Aspect MAP, Document MAP and one more were proposed in TREC 2007 Genomics track, i.e., Passage2 MAP [17, 12].
Golden standard of relevance and aspects judgment for official released legal span of passages are provided. For the sake of generalization, we only utilize the relevance information for generalizing train file for general learning-to-rank model. We define passage as maximum span of consecutive text within one single document not including any HTML paragraph tag. Within this principle we extract passages from the meta data and index. In constructing the train dataset for learning-to-rank, we compare the extracted passages with the official defined passages with golden standard of relevance and assume whenever there is an overlap, the relevance of official released passages will contribute to extracted passage.
Parameters of learning-to-rank algorithm is optimized using a greedy boosting method on 2-fold cross-validation setting in which the best model is selected according to Document MAP. The parameters α and β in Equation 1 are tuned based on 2-fold cross-validation.
Results and analyses
Comparison with baseline
Performance Comparison with Baselines on 2006 Collection.
Performance Comparison with Baselines on 2007 Collection.
As can be seen from Table 3 and Table 4 when diversity features are utilized for learning a ranking model, performance improvements over three strong baselines BM25, DirKL and gLTR can be obtained in terms of all different levels of MAP metrics on both 2006 and 2007 Genomics Track tasks. As to the higher improvement space of Passage MAP than Aspect MAP, we attribute it to the paragraph-based indexing of the original data and the way how we generate training dataset for learning-to-rank: the relevance of passages are contributed by all embedded paragraphs that are relevant while referring to different topics of the query.
It is noticeable that the improvements of Document MAP are also remarkable. This shows that the diversity features are beneficial for promoting not only diversity but also general relevance performance. When the diversity information is used for training model, the passages that are both relevant and have various topics will be favored by the ranking model. This is promising in that when being designed properly, the diversity features are beneficial both in improving general IR metrics and promoting diversity in ranking.
Comparison with TREC results
Performance Comparison with TREC 2006 Submissions.
Performance Comparison with TREC 2007 Submissions.
Comparison with re-ranking method
Comparison with Re-Ranking Method on 2006 Collection.
Comparison with Re-Ranking Method on 2007 Collection.
As shown in Table 7 and Table 8 our method achieves performance improvements over the re-ranking method in terms of all metrics on both 2006 and 2007 collections. We attribute this to the diversity-representative features proposed in this paper and the utilization of learning-to-rank technology. Learning-to-rank has demonstrated strength in integrating multiple sources of features in constructing model. As well as other machine learning methods, features play an important role in learning-to-rank. As proven usefulness in the previous section, diversity-representative features essentially enhance the learning-to-rank method with greater opportunity to capture novelty and diversity information in ranking list which results in building better ranking model.
In summary, from the results and analyses we can draw a conclusion that our proposed diversity features are representative of diversity information of ranking list and helpful in advancing ranking model within the combined learning-to-rank framework proposed in this paper.
Effect of control parameter
In this section, we evaluate the parameters α and β in the framework that can affect the retrieval performance. Because β = 1 − α, so in this section, we present the results under different settings of α, more specifically we sweep over values (0.1, 0.2, ..., 0.9).
In particular, for each dataset we conduct a 2-fold cross-validation, where each fold randomly chooses half of the topics for training and the remaining for testing, and vice versa. The overall retrieval performance is averaged over the two test topic sets.
It is also noticed that when α is set to 1, the combined model in Equation 1 is equal to gLTR, which is the general model, while it is set to 0, the combined model equals to the diversity-biased model, and neither of them obtain the best result. This shows the necessity and effectiveness of the combination. For some matrices (eg. document MAP on 2007 collection and aspect MAP on both collections), the best result occurs when α is set in the range of (0.6, 0.8). So the empirical setting of parameter α is suggested to be (0.6 ~ 0.8) when no training data is available.
In this paper, we have applied learning-to-rank technology to biomedical information retrieval and proposed a combined learning-to-rank model which integrates a general ranking model and a diversity-biased model. The general ranking model proved to be effective. However, with the help of diversity-biased model, the retrieval results are more promising. The diversity-biased model is learned from both general features and diversity-favored features to award ranking list with low redundancy and high diversity. The diversity-reflecting features which are defined in the perspective of topics relationship of different passages in ranking order appear to contribute promoting results diversity.
Publication of this article has been funded by the Natural Sciences and Engineering Research Council (NSERC) of Canada, the Early Researcher Award/Premier's Research Excellence Award and the IBM Shared University Research (SUR) Award. We also would like to thank IBM Canada for providing IBM BladeCenter blade servers to conduct experiments reported in the paper.
This article has been published as part of BioMedical Engineering OnLine Volume 13 Supplement 2, 2014: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2013): BioMedical Engineering OnLine. The full contents of the supplement are available online at http://www.biomedical-engineering-online.com/supplements/13/S2.
- Carbonell J, Goldstein J: The use of MMR, diversity-based reranking for reordering documents and producing summaries. SIGIR 1998, 335–336.Google Scholar
- Wang J, Zhu J: Portfolio theory of information retrieval. SIGIR 2009, 115–122.View ArticleGoogle Scholar
- Santos RLT, Macdonald C, Ounis I: Exploiting query reformulations for web search result diversification. WWW 2010, 881–890.Google Scholar
- Yin X, Huang X, Li Z: Promoting ranking diversity for biomedical information retrieval using wikipedia. ECIR 2010, 495–507.Google Scholar
- An X, Huang JX: Boosting novelty for biomedical information retrieval through probabilistic latent semantic analysis. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR '13. ACM, New York, NY, USA; 2013:829–832.View ArticleGoogle Scholar
- Huang X, Hu Q: A bayesian learning approach to promoting diversity in ranking for biomedical information retrieval. SIGIR 2009, 307–314.View ArticleGoogle Scholar
- Chen Y, Yin X, Li Z, Hu X, Huang J: A LDA-based approach to promoting ranking diversity for genomics information retrieval. BMC Genomics 2012,13(3):1–10.View ArticleGoogle Scholar
- Radlinski F, Bennett PN, Carterette B, Joachims T: Redundancy, diversity and interdependent document relevance. SIGIR Forum 2009,43(2):46–52. 10.1145/1670564.1670572View ArticleGoogle Scholar
- Goldberg AB, Andrzejewski D, Gael JV, Settles B, Zhu X, Craven M: Ranking biomedical passages for relevance and diversity: University of Wisconsin, Madison at TREC Genomics 2006. TREC 2006.Google Scholar
- Zhu X, Goldberg AB, Van J, Andrzejewski GD: Improving diversity in ranking using absorbing random walks. Physics Laboratory University of Washington 2007, 97–104.Google Scholar
- Demner-Fushman D, Humphrey SM, Ide NC, Loane RF, Mork JG, Ruch P, Ruiz ME, Smith LH, Wilbur WJ, Aronson AR: Combining resources to find answers to biomedical questions. TREC 2007.Google Scholar
- Hersh W, Cohen A, Ruslen L, Roberts P: TREC 2007 Genomics track overview. TREC 2007.Google Scholar
- Zhou W, Yu CT: TREC genomics track at UIC. TREC 2007.Google Scholar
- Yin X, Huang JX, Li Z, Zhou X: A survival modeling approach to biomedical search result diversification using wikipedia. IEEE Transactions on Knowledge and Data Engineering 2013,25(6):1201–1212.View ArticleGoogle Scholar
- Radlinski F, Kleinberg R, Joachims T: Learning diverse rankings with multi-armed bandits. ICML 2008, 784–791.View ArticleGoogle Scholar
- Santos RLT, Macdonald C, Ounis I: On the suitability of diversity metrics for learning-to-rank for diversity. SIGIR 2011, 1185–1186.Google Scholar
- Hersh W, Cohen AM, Roberts P, Rekapalli HK: TREC 2006 genomics track overview. TREC 2006.Google Scholar
- Liu TY: Learning to rank for information retrieval. Found Trends Inf Retr 2009,3(3):225–331.View ArticleGoogle Scholar
- Metzler D, Bruce Croft W: Linear feature-based models for information retrieval. Inf Retr 2007,10(3):257–274. 10.1007/s10791-006-9019-zView ArticleGoogle Scholar
- Bendersky M, Metzler D, Croft WB: Learning concept importance using a weighted dependence model. WSDM 2010, 31–40.Google Scholar
- Robertson SE, Walker S, Hancock-Beaulieu MM: Large test collection experiments on an operational, interactive system: Okapi at TREC. IPM 1995,31(3):345–360.Google Scholar
- Zhai C, Lafferty JD: Model-based feedback in the language modeling approach to information retrieval. CIKM 2001, 403–410.Google Scholar
- Amati G, Joost C, Rijsbergen V: Probabilistic models for information retrieval based on divergence from randomness. TOIS 2002, 20: 357–389. 10.1145/582415.582416View ArticleGoogle Scholar
- Hiemstra D: Using language models for information retrieval. Phd thesis, University of Twente; 2001.Google Scholar
- Tao T, Zhai C: An exploration of proximity measures in information retrieval. SIGIR 2007, 295–302.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.