Over the summer I served as a Google Summer of Code mentor for David Nemeskey, PhD student at Eötvös Loránd University. David proposed to improve Lucene’s scoring architecture and implement some state-of-the-art ranking models with the new framework.
These improvements are now committed to Lucene’s trunk: you can use these models in tandem with all of Lucene’s features (boosts, slops, explanations, etc) and queries (term, phrase, spans, etc). A JIRA issue has been created to make it easy to use these models from Solr’s schema.xml.
Relevance ranking is the heart of the search engine, and I hope the additional models and flexibility will improve the user experience for Lucene: whether you’ve been frustrated with tuning TF/IDF weights and find an alternative model works better for your case, found it difficult to integrate custom logic that your application needs, or just want to experiment.
Some bullet points of the new scoring features:
- New ranking algorithms, in addition to Lucene’s Vector Space Model:
- Added key statistics to the index format to support additional scoring models.
- Term- and field-level statistics for collection frequencies and deriving averages.
- Additional document-level statistics for computing normalization factors.
- Decoupled matching from ranking in Lucene’s core search classes:
- Customize scoring without digging into the “guts”.
- Customize explanations: essential for debugging relevance issues.
- Powerful low-level Similarity API, supporting advanced use cases:
- Incorporate per-document values from Column Stride Fields into the score.
- Use different scoring parameters or algorithms for different fields.
- Fuse multiple scoring algorithms into a combined score.
- Convenient high-level SimilarityBase for everything else:
- Write your own scoring function in one Java method.
- Easy access to available index statistics.
For more information about this GSOC project, take a look at its wiki page