Solr Powered ISFDB – Part #4: Multiple Doc Types

By on February 12, 2011

This is part #4 in a (never ending?) series of articles on Indexing and Searching the data using Solr.

When we left last time, I had a nice index of “Title Centric” documents — One document for each title in the ISFDB, with multi-valued fields containing the basic data about each Author that worked on a title. This is the first week I didn’t have time to work on the project on Friday, So I squeezed in a bit of work today (Saturday) to just get the basics in place for Author Centric documents.

(If you are interested in following along at home, you can checkout the code from github. I’m starting at the blog_3 tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the blog_4 tag containing the end result of this article.)

Why Not Use Multiple Indexes?

A quick digression: Today I’m going to start putting multiple different types of documents all in one index. You might ask (and rightly so) “Why not just use multiple indexes? Wouldn’t using Multiple SolrCores make sense for this?” The answer is “it depends”. If I had completely different use cases for title searching and author searching, then it would certainly make sense to use different cores with different schemas — but ultimately I want to be able to support a single simple search box where you can just type in anything (a name, a title, a keyword) and get “good” results, with authors and titles intermixed in one set of results. At that point faceting can be used to say “no, i really just meant i was looking for titles…” or “actually i just want authors…”.

Now that we’ve cleared that up…

Supporting Multiple Document Types

The first thing to do to support multiple types of documents in a single index, is to make our uniqueKey something that can be distinct across all types of documents, so we don’t risk collisions. So I’ve added a “doc_id” field to my schema, and made it the uniqueKey field. To populate this for each of my documents, I’m using the “TemplateTransformer” to construct an artificial field out of the “title_id” field, with a “TITLE_” prefix put in front of each value to make them unique (so title_id #4321 and author_id #4321 don’t overwrite each other in the index.

This is the first time I’ve had to explicitly use DIH’s <field …> syntax to declare a field (because the field name isn’t exactly the same as the column name from the DB and i need to generate it using a transformer) and it exposes an eccentricity about how DIH deals with fields: it refers to the field name you want to use in Solr as a “column”…

<field column="doc_id" template="TITLE_${title.title_id}" />

This thoroughly confused me when i looked at the examples, I really wanted to write something like this…

<field name="doc_id" template="TITLE_${title.title_id}" />

…but as you can see from other examples in the DIH docs, this is how the <field …> tag is used, even when the source of the data is something that isn’t a DB. (it doesn’t make a lot of sense to me, but it is what it is).

So now we’ve got a new uniqueKey field with a generated value for all of our title docs, we’ll also want a “doc_type” field that keeps track of (surprise!) the type of document we’re dealing with. As the DIH FAQ mentions, this is trivial to do by abusing the template transformer…

<field column="doc_type" template="TITLE" />

Adding Authors

With the foundation in place, adding Author Centric documents is just a matter of adding a new “top level” entity for them in our DIH config. For now I’m just using simple fields from the Author table that I can expand on later. I’m also reusing overlapping fields in the schema for properties of an author that are already in my title centric docs. This is a shortcut to save time now, but ultimately I’ll want to better distinguish these fields for two reasons:

  • Schema Properties: Even though it will probably make sense to use the same field type for these fields in the author docs and the title docs, in a good document model, properties like “multiValued” should be different between them — because a title can have multiple author names, but an author can only have one canonical name.
  • Field Stats: Things like the IDF of a field span the entire index — they don’t know about “doc_type” so for fields like “author_canonical” they will get seriously out of whack they way I’m reusing them. (ie: not a lot of authors have “asimov” in their name, but lots of title documents do.)

Conclusion (For Now)

And that wraps up this latest installment with the blog_4 tag. Now we can not only query for “books by people named asimov” but also “people named asimov

Sorry this was such a short post — check back at the end of this week, when I should hopefully have some more time to clean up the schema a bit, and improve on the modeling of the “Author Centric” Documents.

Share on LinkedInShare on FacebookTweet about this on Twitter

Related Posts

Search Hub 2.0 Public Beta

Efficient Field Value Cardinality Stats in Solr 5.2: HyperLogLog

Hey, You Got Your Facets in My Stats! You Got Your Stats In My Facets!!

Stump The Chump: D.C. Winners

What Could Go Wrong? – Stump The Chump In A Rum Bar

Top Posts

Understanding Transaction Logs, Soft Commit and Commit in SolrCloud

Faceted Search with Solr

Nested Queries in Solr

Posted in SearchHub, Technical Article with tags #ISFDB

Your email address will not be published. Required fields are marked *




Thanks for this series – this is a very helpful hands-on intro to Solr.

I noticed that when I run a query without a field name, on – battle station – there are 0 hits.

But I know that doc_id TITLE_2 has title_title = Battle Station.

When I run the query on – battle station – without a field name, the catchall field is queried since it is the default search field. Since catchall is type string, there is no match because of the case difference. Can you confirm that this is true?

Thanks again for this series.


Thanks for your post. I tried this approach on my project but I ran into a strange error. The error says ‘org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: uid=[A_2, 4291fc550b900a71]’ I checked the error source on UpdateHandler and I found: (if( id.length > 1 )
throw new SolrException( SolrException.ErrorCode.BAD_REQUEST,”Document contains multiple values for uniqueKey field: ” + idField.getName());

return idFieldType.storedToIndexed( id[0] );) It seems templateTransfromer doesn’t concatenate A.aid with uniqueKey uid, the array length becomes 2 and throws the corresponding error. I don’t know why.


naska: you definitely can’t have a document with multiplevalues in the unique key, and temlate transformer definitely shouldn’t be creating multiple values.

it’s posisble that a bug was introduced in template transformer since i wrote this blog, but it’s also possible you just have a mistake in your configuration that is resulting in multiple fields when you mean to create a single concatenated field. (it’s hard to tell since you only provided the error message you are getting, and not any details about what your DIH config looks like)

blog comments aren’t a very good medium for troubleshooting bugs — so i suggest you start a thread on the solr-user mailing list showing your DIH config, sample data, and the details of your error so we can see if we can get to the bottom of things.