This is part #4 in a (never ending?) series of articles on Indexing and Searching the ISFDB.org data using Solr.

When we left last time, I had a nice index of “Title Centric” documents — One document for each title in the ISFDB, with multi-valued fields containing the basic data about each Author that worked on a title. This is the first week I didn’t have time to work on the project on Friday, So I squeezed in a bit of work today (Saturday) to just get the basics in place for Author Centric documents.

(If you are interested in following along at home, you can checkout the code from github. I’m starting at the blog_3 tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the blog_4 tag containing the end result of this article.)

Why Not Use Multiple Indexes?

A quick digression: Today I’m going to start putting multiple different types of documents all in one index. You might ask (and rightly so) “Why not just use multiple indexes? Wouldn’t using Multiple SolrCores make sense for this?” The answer is “it depends”. If I had completely different use cases for title searching and author searching, then it would certainly make sense to use different cores with different schemas — but ultimately I want to be able to support a single simple search box where you can just type in anything (a name, a title, a keyword) and get “good” results, with authors and titles intermixed in one set of results. At that point faceting can be used to say “no, i really just meant i was looking for titles…” or “actually i just want authors…”.

Now that we’ve cleared that up…

Supporting Multiple Document Types

The first thing to do to support multiple types of documents in a single index, is to make our uniqueKey something that can be distinct across all types of documents, so we don’t risk collisions. So I’ve added a “doc_id” field to my schema, and made it the uniqueKey field. To populate this for each of my documents, I’m using the “TemplateTransformer” to construct an artificial field out of the “title_id” field, with a “TITLE_” prefix put in front of each value to make them unique (so title_id #4321 and author_id #4321 don’t overwrite each other in the index.

This is the first time I’ve had to explicitly use DIH’s <field …> syntax to declare a field (because the field name isn’t exactly the same as the column name from the DB and i need to generate it using a transformer) and it exposes an eccentricity about how DIH deals with fields: it refers to the field name you want to use in Solr as a “column”…


<field column="doc_id" template="TITLE_${title.title_id}" />

This thoroughly confused me when i looked at the examples, I really wanted to write something like this…


<field name="doc_id" template="TITLE_${title.title_id}" />

…but as you can see from other examples in the DIH docs, this is how the <field …> tag is used, even when the source of the data is something that isn’t a DB. (it doesn’t make a lot of sense to me, but it is what it is).

So now we’ve got a new uniqueKey field with a generated value for all of our title docs, we’ll also want a “doc_type” field that keeps track of (surprise!) the type of document we’re dealing with. As the DIH FAQ mentions, this is trivial to do by abusing the template transformer…

<field column="doc_type" template="TITLE" />

Adding Authors

With the foundation in place, adding Author Centric documents is just a matter of adding a new “top level” entity for them in our DIH config. For now I’m just using simple fields from the Author table that I can expand on later. I’m also reusing overlapping fields in the schema for properties of an author that are already in my title centric docs. This is a shortcut to save time now, but ultimately I’ll want to better distinguish these fields for two reasons:

  • Schema Properties: Even though it will probably make sense to use the same field type for these fields in the author docs and the title docs, in a good document model, properties like “multiValued” should be different between them — because a title can have multiple author names, but an author can only have one canonical name.
  • Field Stats: Things like the IDF of a field span the entire index — they don’t know about “doc_type” so for fields like “author_canonical” they will get seriously out of whack they way I’m reusing them. (ie: not a lot of authors have “asimov” in their name, but lots of title documents do.)

Conclusion (For Now)

And that wraps up this latest installment with the blog_4 tag. Now we can not only query for “books by people named asimov” but also “people named asimov

Sorry this was such a short post — check back at the end of this week, when I should hopefully have some more time to clean up the schema a bit, and improve on the modeling of the “Author Centric” Documents.

About Hoss

Read more from this author

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.