Book Review: Solr 1.4 Enterprise Search Server (Packt)
Summary
Solr 1.4 Enterprise Search Server, by David Smiley and Eric Pugh is a must-have for all new-comers to Solr. It provides a comprehensive overview of the features needed for practically every project, from indexing common data sources to searching with faceting, highlighting, spell checking and then covers deployment and performance topics. One tough issue this book, and practically all books based on open source software, faced was coinciding publication with the release of the software. Solr 1.4 took a while to release, but a publisher needs to go to print. So this book went to print before Solr 1.4 was really finished. The authors, however, are maintaining an addendum wiki page covering the bits of Solr 1.4 that didn’t make it into print.
I give this book 4.5 out of 5 stars. Read on for my thorough review, chapter by chapter.
Disclaimer/caveat: Eric Pugh is a good friend of mine, and we have worked together on some projects. And he says some really nice things about me in his introduction. He thanks me for introducing him to the world of open source, and since then he’s gone a long way with his immersion as an Apache committer and member. And he even helped with some database examples for the original Ant book. That being said, I’m still going to give the book a fair review below.
A word from our sponsor: The Packt Solr book is a great general introduction to Solr, and complements our Lucidworks for Solr Certified Distribution Reference Guide nicely. Our reference guide goes into the nuts and bolts that a general consumer book simply cannot. And the great news is, you get two for the price of one! Our reference guide is free.
Book Structure
It’s tricky to structure a book on Solr. Which comes first? Indexing or searching? Gotta have data indexed, but there is a lot of minutia involved in setting up a schema, and the real goal of a project is to focus on the searching side of things. This book does a good job with the structure, first providing an overview, pulling in a rich public data set and making it searchable without going into the details. I give a thumbs-up to the Table of Contents structure, though I do think some topics deserve a chapter of their own, especially faceting which in this book is simply a section within the “Enhanced Searching” chapter. The coverage of some of the experimental patches to Solr was interesting, and risky, in the “Search Components” chapter, both of which are undergoing extensive reworkings now and are very different from what was written in this book. At least it whets the readers appetite for these features, but he’ll have to do a lot of homework and likely jump to trunk plus patches or Solr 1.5 to get these features.
Preface
“However, as this book went to print prior to Solr 1.4’s release, two new features were not incorporated ito the book: search result clustering and trie-range numeric fields”. And, umm, Solritas was overlooked too (how could you, Eric? Velocity is one of our mutual favorite projects!!). Again, there are a number of features not covered in the book, which are quite thoroughly covered in the addendum wiki, including my beloved Solritas.
Chapter 1: Quick Starting Solr
One of my pet peeves is when terminology isn’t quite accurate, and in many parts of this book the disjunction-max (aka dismax) query parsing capability is called a “handler”. It’s a query parser. Yes, in the example configuration that ships with Solr a “dismax” named handler is provided, but it really needs to be clear that it’s a query parser that can be used with any search handler. Maybe this is just me being picky because I’m too close to the code, and this perhaps is a minor nit to most readers if even. Slow commits “which could take between seconds and a minute” – really could go on much longer with large indexes and warming queries (but it depends, just wanted to clarify that even a “minute” could be a vast underestimate).
On page 22 there is a light bulb call out that mentions to not hit return in the search box or you’ll get an error. I tried this both with the simple and full interface with Solr 1.4 and didn’t get an error (using *:* + return in the query box). No big deal here, but the issue I have is when a book on open source code calls out an issue and says “perhaps this will be fixed at some point”. I know, I know… it’s very difficult to write a book, much less deal with all the bugs you encounter as you write. It is certainly easier to document a bug than fix it. But, I expect a lot from authors of open source technologies. I can’t even count the number of oddities Steve and I fixed in Ant while writing the original version of “Java Development with Ant”. It’s tough, bouncing back and forth between fixing and writing, but it is the responsible thing to do. I’m pointing this out here, but there are a few other places in the book that point out issues that probably could just as easily been fixed. [sorry guys, I’m a tough customer!]
Chapter 2: Schema and Text Analysis
Page 33 – Prefix, wildcard, and fuzzy queries, do not “require scanning for all of the indexed terms used in a field to see if they match the queried term” necessarily; it depends on the prefix of the query term used and Solr 1.4 even comes with a slick ReversedWildcardFilter to streamline leading wildcard queries.
And finally a compliment and very good point – p. 34, the first paragraph to the “Schema design” section says something important: “the queries you need to support completely drive the schema design”. Well said!
p. 42 – there’s a bit of misspeak about positionIncrememtGap – if “A” and “B” are indexed into a multiValued field, a query of A AND B would still match. However, an exact phrase match of “A B” (with quotes) would not, provided a sufficient positionIncrementGap value.
p. 48 has a call out effectively lumping analyzers, tokenizers, and (token) filters all together as “analyzers”. This is being a bit too high-level for my pedantic tastes. An analyzer encapsulates the complete process of tokenization. A tokenizer is responsible for the first step in breaking down text into tokens. Filters take the rest of the processing in a pipeline fashion.
p. 53 – typo, “WorkDelimiterFilterFactory” -> “WordDelimiterFilterFactory”
p. 57 – synonym expansion, “For a variety of reasons, it is usually better to do this at index-time.”. Really? I disagree, or at least, again, “it depends”. But query-time expansion is the most flexible, allowing changes to the synonym expansion to be made without having to reindex everything.
Chapter 3: Indexing Data
p. 65 – kinda oddly, Solr Flare is mentioned as a client API (for indexing?). Anyway, it’s just a glorious hack of a search UI, not connected to indexing at all. In fact, it allows one to point at (practically) any Solr instance and search it, and if your facet fields are named with the pattern *_facet they’ll be autodiscovered and displayed and usable for constraining.
Top of p. 68 typo “solr.body” should be “stream.body”. p. 69, the remote streaming call out – I feel like it deserves a bit of explanation why it isn’t (or rather shouldn’t be) enabled by default, as it is worth mentioning the security implications of being able to pass in a local file path or remote URL for Solr to fetch. And actually it is enabled by default in Solr 1.4 (careful out there!), which is mentioned later in the book I believe.
p. 87 – the example curl command to Solr Cell (aka ExtractingRequestHandler) uses outdated parameters not used in an official release. Please see Solr’s wiki for the actual details.
Chapter 4: Basic Searching
p. 98: echoParams is mentioned as “not particularly useful”, however it actually can be with a client that wants to work with all the parameters that may be hidden via a default server-side request handler mapping. I’ve used it effectively in some cases.
p. 99: *:* for matching all documents “definitely has its uses”. Like? Like for q.alt using the dismax parser, for example. More on this below.
p. 103: “before adding slop, you may want to gauge its impact on query performance”. I don’t think phrase slop is going to impact performance, and definitely not for the majority of Solr-based applications. So this may be a bit of a premature optimization. Use phrase slop when you need it. But of course, a good general recommendation is always to measure performance regardless of what tweaks you’re making.
p. 106: date math – this is where it is worth noting the Trie field types that didn’t make it into print. These field types are very important for speeding up numeric and date range queries.
p. 108: Filtering, “and leave the query string blank”?? The example doesn’t leave it blank, and it’s an error to actually do so. Use *:* in this case (see above and note where it is useful).
p. 112: Scoring – “you will usually find that scores in the vicinity of 0.5 or better are decent matches” – this is entirely dependent on the data and queries. Scores can be all over the place, and their actual value isn’t necessarily useful on its own, but relative to other matching documents.
Chapter 5: Enhanced Searching
I feel at least faceting should have been a separate chapter, as it is one of the most prominent features of Solr.
The callout saying “function queries” are a poor name – seems unnecessary to state, and isn’t really misnamed is it? It’s a query that applies a function to score all documents.
“Dismax Solr request handler” section – again, misnamed. It’s a dismax query parser, not handler per se.
p. 132: “But remember, there is no query syntax to invoke Lucene’s DisjunctionMaxQuery…”. Oh but there is!
p. 133: “The[pf] syntax is identical to” qf (not “bf” as the book states). Likewise it’s a typo in the “reasons to vary” qf (not “bf”). And again in the callout on pf Tips on the following page (134)… should be qf not bf. Yeah, tricky stuff these parameter names for dismax, alas.
p. 148: Alphabetic range bucketing – I really like the trick they detail, of using PatternTokenizer to extract the first letter and then a SynonymFilter to bucket them into groups like A-C, D-F. There’s lots of cool little tricks like this that one can play, so this example will surely spark some imaginative uses. Nice!
Chapter 6: Search Components
p. 167: Another great call out, this time about the Query Elevation component. It isn’t a general approach to fix queries that aren’t working as well as you’d like, relevancy-wise. It is for editorial boosting (or result removal).
p. 173: spell checking note, it is mentioned to take care to build the dictionary whenever a core gets loaded. But one of the better choices is to set spell check dictionary rebuild upon a commit, so it stays current with new documents. Later, p. 176, it is mentioned to rebuild on optimize. Not bad advice, but it really all depends on how an application is using commits and optimizes. Many projects eschew optimization altogether and simply set a low merge factor, and in those cases I’d recommend building the spell check dictionary on commit instead.
p. 180 – “it is possible for a client to do this [requery for a spelling suggestion when the results are zero] … but it would be much slower”. Much slower? Nah. It’s quite reasonable for clients to do these sorts of things based on business logic of the results returned. A query to Solr often responds within tens of milliseconds. Not much slower at all.
p. 185 – and sometimes reading a book on a topic you know pretty well you even learn something new! I wasn’t aware of the mlt.qf parameter. Cool!
p. 189 – stats component, “computes… statistics of specified numeric fields in the result set” (not across the entire index). And the stats component bug call out: it was fixed in Solr 1.4.
Field Collapsing and LocalSolr – these sections are useful to point out the features coming to Solr, but the technical details are no longer applicable. These valuable features both should make it into Solr 1.5, though.
Chapter 7: Deployment
p. 200: typo on the “[solr.]solr.home” property in the text. Example code gets it right though.
p. 205: Logging JARmageddon – I just wanted to mention my objection to Solr’s logging switch away from JDK logging. What a headache logging is, and the pains one must go through to switch it out, even with SLF4J. ugh, my apologies, but I did state my objections.
p. 210: The “good use” of RELOAD doesn’t seem so good to me. If you need to have different configurations between indexing and searching, set up two servers and use replication.
p. 215: The benefits of JMX are a bit exaggerated. You can get access to pretty much all of Solr’s internal data (number of documents as this section mentions, cache stats, etc, etc) through Solr’s request handlers and stats.jsp (which outputs XML). But, yes, the JMX stuff is still useful in environments that have monitoring solutions already set up for it. If you’re rolling your own monitoring, though, maybe just grab the stats more directly.
Chapter 8: Integrating Solr
p. 228: Maven note – NOOO! Solr will move to a Maven-based build over my dead body!
p. 236: EmbeddedSolrServer – Streaming locally available content into Solr is not, IMO, an attractive use of embedded Solr. It misses being able to run an indexer on a separate machine and multiprocess/thread easily. I’ll grant that both the rich client application and upgrading from a pure Lucene-based solution make sense for embedded, but these are pretty exceptional use cases. And the advantages to using Solr over HTTP are pretty extensive (scaling, separation, replication, distributed search). Also, SolrMarc is not a good example, IMO, of using embedded Solr server. See my recent foray back into this world.
p. 245: SolrJS – another case of publishing a book too early. It’s been forked, improved and renamed to AJAX Solr and removed from Solr’s core. This has been noted in the addendum wiki.
p. 247: I’ve seen cases where PHP apps do best using the JSON response format. So leading readers down the path of the php or phps response formats may be unnecessary. Of course there are a number of PHP libraries that may be the best route to go and you can simply not worry about what response format is going on behind the scenes.
p. 253: It’s nice to see mention of Ruby on Rails, though the coverage is bit spotty and confusing. Solr Flare and RSolr are mentioned in the same sentence, but are different beasts. Solr Flare being a hackish (critique of my own creation) RoR plugin for viewing search results. RSolr being a new and improved API, compared to solr-ruby, for accessing Solr from Ruby code. [Flare currently uses solr-ruby] It’s great to see mention of Blacklight! All of these technologies are very close to home for me. I really enjoyed the section showing the MusicBrainz Ruby indexer and Blacklight integration – sweet. This shows that Blacklight is not just for libraries, but is a good start for a general purpose Ruby on Rails front-end to any Solr index.
Chapter 9: Scaling Solr
Quite a nice chapter! No other feedback other than kudos on laying out it clearly how Solr addresses scalability and various tips and tricks to testing and monitoring a system to know when/how to tweak configurations and system architecture for scaling up.
Grand Finale
I spelled out a lot of fiddly feedback above, and I expect the great addendum wiki page will factor in any keepers from this review. Of course most of the review points out mistakes or differences of opinion, that’s what a review is for, though this is a solid, useful book. So, if you’re considering using Solr, this book is for you. If you’re already using Solr, you’ll likely pick up a useful trick or three. Go get it! (and our complementary Lucidworks for Solr Certified Distribution Reference Guide)
Best of the Month. Straight to Your Inbox!
Dive into the best content with our monthly Roundup Newsletter!
Each month, we handpick the top stories, insights, and updates to keep you in the know.