This is part 11 in a (never ending?) series of articles on Indexing and Searching the ISFDB.org data using Solr.
When we left off last time, we had used a domain specific biasing function to improve the order of our results so popular Authors and Titles surfaced at the top of results. Today we’re going to look at using DisMax to make further improvements.
(If you are interested in following along at home, you can checkout the code from github. I’m starting at the blog_10 tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the blog_11 tag containing the end result of this article.)
Popular != What I Want
Using a score boost based on popularity gave us some quick wins in making “good” docs bubble up easily, and it’s the type of solution Product Managers and Sales folks really love because it shows the “hot” stuff front and center, but it can also annoy users who are interested in the “long tail”. Sometimes, they may not even be looking for the tip of that tail — take for instance an author search for Sterling.
Bruce Sterling is a popular Sci-Fi author who has published almost 200 novels/stories, and anyone searching the ISFDB Data would be reasonable in expecting his name to be the first result for “Sterling”. Since we’ve got a filter on
doc_type:AUTHOR then you would certainly expect him to be at the top of a list of folks named Sterling.
Instead what we get on our page #1 of results is…
- Ray Bradbury
- Bruce Sterling
- Gregory Benford
- Edmond Hamilton
- Terry Brooks
- Sterling E. Lanier
- Amy Sterling Casil
- William Morrison
- Sterling Lanier
- Kenneth Sterling
…there’s hardly a “Sterling” among them!
The reason is simple and straight forward, and somewhat clear just from the UI view. We can see that “Ray Bradbury” has a pseudonym of “Brett Sterling” — it’s not a big stretch to imagine that he might be more popular then “Bruce Sterling”, and the explain toggle shows us that that is in fact the case…
- Ray Bradbury
451896.44 = (MATCH) boost(catchall:sterling,sum(int(views),int(annualviews))), product of: 8.268163 = (MATCH) weight(catchall:sterling in 560416), product of: 0.99999994 = queryWeight(catchall:sterling), product of: 8.268164 = idf(docFreq=443, maxDocs=636658) 0.12094583 = queryNorm 8.268164 = (MATCH) fieldWeight(catchall:sterling in 560416), product of: 1.0 = tf(termFreq(catchall:sterling)=1) 8.268164 = idf(docFreq=443, maxDocs=636658) 1.0 = fieldNorm(field=catchall, doc=560416) 54655.0 = sum(int(views)=40015,int(annualviews)=14640)
- Bruce Sterling
327739.88 = (MATCH) boost(catchall:sterling,sum(int(views),int(annualviews))), product of: 18.488174 = (MATCH) weight(catchall:sterling in 560504), product of: 0.99999994 = queryWeight(catchall:sterling), product of: 8.268164 = idf(docFreq=443, maxDocs=636658) 0.12094583 = queryNorm 18.488176 = (MATCH) fieldWeight(catchall:sterling in 560504), product of: 2.236068 = tf(termFreq(catchall:sterling)=5) 8.268164 = idf(docFreq=443, maxDocs=636658) 1.0 = fieldNorm(field=catchall, doc=560504) 17727.0 = sum(int(views)=12092,int(annualviews)=5635)
Looking at the other results and their score explanations, it’s easy to see pseudonyms affecting the other results in the same way (or in the case of Terry Brooks: the birth place of “Sterling, Illinois”)
Not All Fields Are Created Equal
It would be easy to fall into a trap of micro-tuning a divisor on the popularity boost to try and make it more subtle, but ultimately the problem is that we are searching against a “catchall” field containing all of the text from all of the other fields, and in reality not all fields are created equal. Bruce Sterling may have the term “Sterling” in his catchall field 5 times compared to Ray Bradbury’s 1, but what should really matter is which fields the term appears in. We could change our catchall field to only include the canonical name of an author instead of their pseudonyms, but that’s a very black/white solution that would hurt folks searching on pseudonyms (or looking for authors from Illinois). What we need is a shade of grey that lets us give more weight to some fields than others
DisMax is a QParser that I’ve written about before. If you want all the gory details, I suggest you read that article, but for now the quick take away is that DisMax let’s you configure different fields to search against with different weights.
To keep things simple for start, I’m going to ignore “Title” documents completely, and focus solely on “Author” docs (since different types of documents contain different fields). Without changing my configs at all, I can use URL params to experiment with some different uses of DisMax to search specific fields with various weightings…
- Just searching two name fields with equal weight
Just searching two name fields with equal weight
- Add the catchall field in, but weight it much less then other fields
canonical_namea much higher weight then the other fields
- Add the popularity boost back into the mix
(Note: in this last instance, we have to move the
defType=dismax into the
q param’s local params, so it will be used to pick the nested parser for
defType is only the default type of parser for the “main” query at whatever level it’s used — it doesn’t recurse down to other query strings that get parsed)
We’ve now got some results that look fairly decent: matches in the
canonical_name field are heavily weighted and considered really important, but matches anywhere in the document will still be returned as results. In the future we might want to better leverage the
pf param of DisMax to only weight fields heavily if they contain all of the terms in a query, but for now we’ve definitely got some incremental improvement.
But What About Titles?
Before we call it day, we have to think about the “Title” situation. We’re still searching the catchall field, so matching titles are still be returned, but since they don’t have a chance of matching any of the heavily weighted fields, the scores from DisMax can be so low that even extremely popular titles will score lower then authors who just happen to have names that are similar to their titles. I’m sure Pete Lion worked very hard on the cover art for the one book he worked on, but does it really make sense that a search for lion should return him before The Lion, the Witch and the Wardrobe? (The most popular title in the ISFDB).
One approach we could take would be to use
copyField directives or DIH transformers to create more “common” fields that would exist in all types of documents, and use those in our DisMax options. I may do that down the road, but in the mean time we can gain parity for Title documents by adding the
title field to the
qf with a comparable boost to
canonical_name So “good” matches on Title docs will get decent scores.
Last But Not Least: Fix Some UI Bugs
When I added the multiplicative boost last week, and switched to using
q param as an “invariant” so that it would always be applied and could never be overridden. This works well, and I updated the text-box in the UI to know about
q param, which gets ignored.
Since I’m using DisMax as my QParser for the
fq for these links, and rely on the default query (now specified for DisMax using
The one other “Bug” I wanted to fix today is the bug in my brain that somehow let me get this far in working on this project without ever adding an external link from each search result to the main ISFDB.org page for the specific Author/Title. I’m not sure why I never did it before, but it was a relatively simple little but of UI markup (although it did require a small macro change because of some oddities in whitespace handling).
Conclusion (For Now)
And that wraps up this latest installment with the blog_11 tag. We’ve now got some much better looking results for various searches, by using DisMax to search against various fields with different weighted importance.
One final note: It’s important to realize that there is nothing special about the weights I picked for these fields. They are not magic numbers, I did not put a lot of thought into them, and I didn’t rely on any particular wisdom or experience (that I didn’t share in this article) to decide what they should be. I just picked numbers that at first glance gave me good looking results. The scores produced don’t matter — what matters is that the weights used in the
qf “play nicely” with one each other, and with the multiplicative boost from the popularity. If the popularity numbers grow by a few orders of magnitude, then these numbers might not be useful anymore. In an ideal world, I would setup a suite of relevancy tests, and do click through analysis, and have a team of helper monkeys sanity checking popular searches — but for a one man personal project, the results so far seem pretty good.