Evaluating Retrieval System Effectiveness

Subscribers:
344,000
Published on ● Video Link: https://www.youtube.com/watch?v=Tw4guy9X8U0



Duration: 1:11:30
1,941 views
9


One of the primary motivations for the Text REtrieval Conference (TREC) was to standardize retrieval system evaluation. While the Cranfield paradigm of using test collections to compare system output had been introduced decades before the start of TREC, the particulars of how it was implemented differed across researchers making evaluation results incomparable. The validity of test collections as a research tool was in question, not only from those who objected to the reliance on relevance judgments, but also from those who were concerned as to how they could scale. With the notable exception of Sparck Jones and van Rijsbergen's report on the need for larger, better test collections, there was little explicit discussion of what constituted a minimally acceptable experimental design and no hard evidence to support any position. TREC has succeeded in standardizing and validating the use of test collections as a retrieval research tool. The repository of different runs using a common collection that have been submitted to TREC enabled the empirical determination of the confidence that can be placed in a conclusion that one system is better than another based on the experimental design. In particular, the reliability of the conclusion has been shown to depend critically on both the evaluation measure and the number of questions used in the experiment. This talk summarizes the results of two more recent investigations based on the TREC data: the definition of a new measure, and evaluation methodologies that look beyond average effectiveness. The new measure, named bpref for binary preferences, is as stable as existing measures, but is much more robust in the face of incomplete relevance judgments, so it can be used in environments where complete judgments are not possible. Using average effectiveness scores hampers failure analysis because the averages hide an enormous amount of variance, yet more focused evaluations are unstable precisely because of that variation.




Other Videos By Microsoft Research


2016-09-05Folklore of Network Protocol Design (Anita Borg Lecture)
2016-09-05Toolkit for Construction and Maintenance of Extensible Proof Search Tactics
2016-09-05ME++
2016-09-05Structural Comparison of Executable Objects
2016-09-05Indifference is Death: Responsibility, Leadership, & Innovation
2016-09-05TQFTs and tight contact structures on 3-manifolds      
2016-09-05Wireless Embedded Networks/The Ecosystem and Cool Challenges
2016-09-05Data Mining & Machine Learning to empower business strategy
2016-09-05Some uses of orthogonal polynomials
2016-09-05Approximation Algorithms for Embedding with Extra Information and Ordinal Relaxation
2016-09-05Evaluating Retrieval System Effectiveness
2016-09-05Exploiting the Transients of Adaptation for RoQ Attacks on Internet Resources
2016-09-05Specification-Based Annotation Inference
2016-09-05Emotion Recognition in Speech Signal: Experimental Study, Development and Applications
2016-09-05Text summarization: News and Beyond
2016-09-05Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution
2016-09-05Learning and Inferring Transportation Routines
2016-09-05Raising the Bar: Integrity and Passion in Life and Business: The Story of Clif Bar, Inc.
2016-09-05Revelationary Computing, Proactive Displays and The Experience UbiComp Project
2016-09-05The Design of A Formal Property-Specification Language
2016-09-05Data Harvesting: A Random Coding Approach to Rapid Dissemination and Efficient Storage of Data



Tags:
microsoft research