During a past ecommerce webinar with Brian Doll of Sheetmusicplus.com, I posted a checklist of items that are commonly occurring in many ecommerce applications and then I waved my hands, due to time constraints, and said Solr (and now Lucidworks) can do almost all of them out of the box and left the rest as an exercise for the reader. Well, now I have some time, so let me fill in the blanks with some more concrete examples about how to do this.
For this example, I am using real estate data freely available from the NYC government. The reason I am interested in this data is that it is:
- It has product-like data in it, as in: name, description, a bunch of metadata and price
- It’s mostly real (I embellished it with descriptions and a few other pieces and filled in some missing pieces of data, see the Indexer class in the source code.) In fact, it’s so real, that when setting up the app, one quickly sees how noisy the data is in terms of things like missing values, etc. For instance, 1804 records don’t have the year built specified.
I have setup a Solr schema for this data as well as some tools for indexing the data. To run the demo, you will need:
Once you have the prerequisites in place, take the following steps:
- Unzip the ecommerce.zip file into the directory of your choice
- cd lucid_ecom
- In a separate terminal window: cd solr
- java -jar start.jar (just as if you were running the Solr tutorial. Note, I am running a relatively recent version of the Solr 3.x branch)
- Point your web browser at http://localhost:8983/solr/nyc and take a moment to familiarize yourself with the interface.
( I used the VelocityResponseWriter built into Solr. It’s nice for prototyping, but it “ain’t” for production use.)
A pre-built index is included in the Zip file, but if you wish to index it yourself, run:
- ant delete-all (deletes the existing content)
- ant index
With the working application in place, let’s take a look at how to implement the various checklist items.
Implementing the Checklist
I’ve broken out each checklist item below and will cover each of them in more detail in the following subsections.
There really isn’t much to be said here other than Solr has built in support for querying in all the “usual” ways that one would expect out of a search engine. Keywords, phrases, wildcards, fielded search and much, much more. For example, try:
- http://localhost:8983/solr/nyc?q=tottenville or just type tottenville in the search box.
- http://localhost:8983/solr/nyc?q=5+bedrooms+%22Staten+Island%22 (5 bedrooms “Staten Island”)
- http://localhost:8983/solr/nyc?q=5+bedrooms+borough_display%3ABro* (5 bedrooms borough_display:Bro* — Should match all 5 bedrooms in either the Bronx or Brooklyn)
Take some time and try out your own queries. In our example, we are using the extended Dismax Query Parser, in case you want to learn more about how it works.
High Quality relevance (precision @ < 10)
In many search applications, and ecommerce is no exception, users often abandon searches when the first page of results (often the top 10) are not relevant to their query. Thus, it is important that a search engine return good results on the first page. While some guidance (more on this in the coming sections) can help alleviate the abandonment problem, a strong first showing is often the quickest way to more clickthroughs. Since Solr utilizes Lucene, which implements an industry standard vector space approach to search, results are often quite good out of the box. Nevertheless, many ecommerce applications may need one or more of the tools that Solr/Lucene provide out of the box to tweak relevance, such as:
- Document, field, token boosting (i.e. matches in the title field are more important than matches in the description.)
- Query term boosting (provide weights for different terms, such as synonyms.)
- Disjunction Maximum Query scoring (aka the “dismax” parser or the extended dismax parser) for dealing with cross field matches.
- Automatic phrase generation from multiword queries even when the user did not explicitly quote the keywords.
- The ability to override low-level scoring information such as term frequency, document frequency, document length normalization and coordination factors.
- Function queries (more later) to allow values in fields (such as price) to be factors in scoring.
- Editorial Boosting/Sponsored Results (in Solr-speak it’s called the QueryElevationComponent — more later) to place specific results at the top.
Relevance tuning is a complex subject and one that is best viewed in the light of your data. In summary, make sure you are making decisions about relevancy based on the big picture and try to avoid any local minima (i.e. tuning a specific query to the detriment of breaking lots of other queries.) In other words, make sure your top money making queries aren’t effected by you “fixing” a one or two bad queries. To learn more, see my articles on Improving Findability and Debugging Relevance. With the basics out of the way, it’s time to take a look at faceting and discovery tools
One of Solr’s most appealing features is its out of the box support for faceting (sometimes called navigators, parametric search, guided navigation) in a number of different ways (see Faceted-Search-Solr for a primer. Also see http://wiki.apache.org/solr/SimpleFacetParameters) In the example application, the left hand nav area shows facets for things like borough (field based faceting), sale price (numeric range faceting), sale date (date range faceting) and pet friendly (facet by query). Solr also supports “multi-select” faceting. And, while there isn’t support for true hierarchical faceting in Solr yet, there are ways to achieve it through intelligent modeling of your tokens. Last, but not least, you may find https://issues.apache.org/jira/browse/SOLR-792 useful for doing grouped faceting (color: red, size: large).
Additionally, helping customers discover items of interest goes well beyond facets. Features like Did you mean, Related Items/Searches, Collaborative Filtering/Recommenders (see Mahout for an open source solution), Auto Suggest and others can go a long way in increasing the user’s ability to purchase items from your store. Many of these features I’ll cover below.
Flexible language analysis tools
Solr contains support for most of the commonly spoken languages in the world, including English, Chinese, French, Spanish, Korean, German, Thai and many more. Lucene and Solr are also UNICODE compliant.
Frequent Incremental Updates
Lucene, and thus, Solr has supported incremental updates from it’s inception without the need to re-index the whole collection. It is also very fast at making new documents available for search. Additionally, with the combination of recent and upcoming work in Lucene, real time search should be available soon. The one piece that is still missing is individual field update, but for certain types of fields (ratings, for instance), there may be easy workarounds.
Ratings and Reviews
In working with many ecommerce customers on Solr, there are usually questions around how to incorporate ratings and reviews into search results without skewing results or introducing too much noise. On the ratings side, app developers often want to incorporate the aggregate rating of an item as a boost factor in the overall score. I will discuss how to do this in detail in the section titled Editorial Relevance Controls below. Meanwhile, on the review side, it is often the case that too much noise is introduced by including reviews “on par” with matches in the product title or description. For instance, if I’m selling “Widget X” and a review for a different product says something like “You should also check out Widget X”, bringing back a match on that second product really isn’t all that useful for a customer searching for “Widget X”. To deal with this noise, people often take a couple of different approaches:
- They weight review matches lower than product matches via boosting (either at query time or indexing time)
- They only search reviews if they don’t feel they have high quality matches for the main product search
You could also do some type of post processing analysis (NLP) of the review to see if it is on topic, but this approach likely isn’t viable for most people in most situations due to the processing power and accuracy of such a solution. As for #2 above, see my post on Fake and Invisible Queries for more insight.
Auto suggest (aka auto complete) is one of the cheapest (in terms of development costs) mechanisms available for enhancing the chance that users find what they are looking for. I’ve heard of vendors adding auto-suggest and having it add millions to their bottom line. Simply by providing a drop down list of ways of completing what a user has typed so far an application can do a number of things:
- Reduce spelling errors thus leading to lower frustration and better results sooner rather than later
- Seed the user with items that they may want but weren’t explicitly looking for. After all, an intelligent auto-suggest box can very easily not only give completions, but it can also hook in related items too.
- Short-circuit search all together and go directly to a landing page for a specific search
For the demo, I implemented auto-suggest using SOLR-1316, which should be committed to trunk soon. Note, also, there are other ways of doing auto-suggest, too, including using the TermsComponent and Faceting. Here are the steps I went through to make auto-suggest work:
- Applied the SOLR-1316 patch to the 3.x branch. This required a minor tweak to the HighFreqDictionary.java file. See patch below
- Add the necessary piece to the solrconfig.xml. See the /autosuggest SearchComponent in the solrconfig.xml in the appendix.
- Decide what fields to use in building the auto-suggest index (see schema.xml). I then “copy fielded” these into a field named suggest. Note that I used a non-stemming analyzer. I also used Solr’s word-based n-gram filter with a shingle base of 5 so as to give phrase suggestions too. Note, this is intended for demonstration purposes, as you may wish to not use shingles and append terms as the user types or you may want to use a different value for n. Also note, I did not spend much time at all on evaluating what went into the suggest field that is used as a source. You will want to validate it and make sure it is aligned with your business goals.
- Build the auto-suggest data structures via the Spell Checker build command (see the next section)
Hopefully, from here you will have enough information to build you your auto-suggest capabilities. If not, see our search site for more info, including alternate approaches to the SOLR-1316 patch.
Did You Mean?
Just like auto-suggest, spell checking can be helpful to users in finding what they are looking for, especially given the propensity of manufacturers/product designers to use incorrectly spelled words in their product name in order to better “brand” the product. Good spell checking goes beyond merely hooking up a dictionary of terms, it is also quite important to know when to suggest a term and when not suggest a term. Lucene/Solr has the basics of setting up spell checking covered via the SpellCheckComponent, but a good spell checking application will need to go beyond merely setting up the component in order to achieve good results. First things first, however, let’s take a look at getting spell checking setup and then we can examine what is needed to make it better.
First, we need to configure the SpellCheckComponent in the solrconfig.xml file. There is an example of this in the Solr tutorial example, from which I changed the distance measure from the Levenstein edit distance to the Jaro-Winkler distance. The reason I did this is based on past experience that users tend to misspell words towards the end of the word and not the beginning, which the Jaro-Winkler distance accounts for. My configuration looks like:
<searchComponent name="spellcheck"> <str name="queryAnalyzerFieldType">textSpell</str> <lst name="spellchecker"> <str name="name">default</str> <str name="field">spell</str> <str name="spellcheckIndexDir">./spellchecker</str> <str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str> </lst> <!-- ... --> </searchComponent>
The whole point of a SearchComponent such as the SpellCheckComponent is to hook it into the main Solr request processing instead of having to make a separate call. Thus, I hooked the SpellCheckComponent into the /nyc RequestHandler so that all queries that are submitted to the “main” RequestHandler will also be spell checked. Once the configuration is setup, the spelling index must be built (and maintained.) This is handled by issuing an &spellcheck.build=true command to the spell checker, as in:
(Note, the &q param can be anything.)
Once the configuration is hooked up and the spell checking data structure is built, the last piece is to hook it into the UI. (Note, I setup the solrconfig.xml to automatically do spell checking on every query request.) To hook into the UI, I co-opted the suggest.vm file and spruced it up a bit to provide links, etc. Other than that, it is exactly the same, since both are just different implementations of spell checking.
See the Solr wiki on the SpellCheckComponent for more information.
In many ecommerce applications, stores position related items next to a particular item so as to inspire the user to either buy an additional item or offer an alternative. Naturally, the “relation” is determined by the store and might take on a variety of forms, such as: accessories, enhanced versions, cheaper versions, alternatives from different manufacturers or popular items based on other users. Similarly, a store may wish to give users not only suggestions and spelling corrections, but they may also want to give users alternative search terms or other popular searches. For instance, if a user searches for TVs, a store may want to suggest they search for “LCD TVs” or “HD TVs”, etc.
When it comes to related items, many Solr users rely on either hand-crafting a second query (given an original query and a particular item) by using the original terms of the query and some of the terms that describe the item. For instance, an application might use the category of the item plus some of the keywords for that item to then craft the query, submit it to Solr and then display the first few results. This approach can also be done automatically using Solr’s built in More Like This (MLT) capability, but you may need to do some tuning to get the results you desire. For the sake of the example, I incorporated MLT into the application. You can see it on the left hand side, just below the map, under the “Similar Properties” heading. The configuration of MLT was done in the solrconfig.xml file as part of the /nyc RequestHandler. Note, in a typical application you may not wish to generate MLT results for a search query, but instead only provide them once a user chooses a particular document, as MLT can add a fair amount of overhead to the process. Other Solr applications will often calculate related items off line or through some type of collaborative filtering approach (see Apache Mahout’s recommender capability for an open source library to do this) and either add the information to the document and re-index or integrate it at the application level. In these cases, it’s not hard to integrate, but it is beyond the scope of this article.
As for the functionality to add related searches, there is not currently support built into Solr, but there is a JIRA issue open to track the idea. Related searches can often be determined through a combination of log analysis (look for patterns in a user session) and synonyms or via collaborative filtering/recommenders. Also, have a look at Mahout’s Frequent Pattern Mining capabilities. One could also index the queries into another index (Solr core) and simply issue fuzzy queries to it.
Editorial Relevance Controls
- The /nyc RequestHandler has the QueryElevationComponent hooked in and keyed off of the elevate.xml file. In that file, I mapped the query “3 bedroom Brooklyn” to rank a specific document higher and exclude one other. See http://localhost:8983/solr/admin/file/?file=elevate.xml for the mapping. To see this, add &enableElevation=false to the query, as in: http://localhost:8983/solr/nyc?q=3+bedroom+Brooklyn&enableElevation=false
- I setup “phrase boosting” on the description field to generate phrases against the description field. See the /nyc RequestHandler (it’s the “pf” setting” in the solrconfig.xml).
- I added a “boost function” to rank documents higher based on the commission paid for selling the property (note, I randomly assigned a value to this field for pedagogical reasons). See the “bf” setting in the /nyc RequestHandler.
- Also, don’t forget creative domain modeling: for instance, if you want to support landing pages and banners, why not just create them as documents in your index (assign a type to them) and make sure they are at the top of the results (other possibilities include doing two queries, one for landing pages first and then one for the results)
Administration means many things to many people. To the IT department, it means easy setup, configuration, monitoring, maintenance, scalability, fault tolerance, etc. while to the business user it means tools for manipulating results, reporting search statistics and following through on business goals. While the latter is important, I am going to focus on the IT dept. needs for the sake of this article. Solr is very easy for an IT person to get setup and have a baseline configuration in place. I’ve seen customers (without my help) be up and running and searching their data in non-trivial ways in as little as 30 minutes, sometimes less. As for monitoring, Solr comes with web pages that report status as well as JMX integration. I’ve also seen Solr integrated nicely with Nagios, Cactus and other tools. Lucid Imagination also partners with New Relic to offer Solr specific monitoring tools.
As for the big questions about scalability and fault tolerance, the answer is an unequivocal yes. High traffic ecommerce sites like Zappos, Netflix, CNET, AOL and many others use Solr to server their search needs. Solr can be setup to both handle large indexes and high query volumes. For more information on how to do this, see Mark Miller’s excellent article on scaling Solr.
Recommendations (See Mahout)
For both online and offline recommendation calculations, see the Apache Mahout project, which has an excellent collaborative filtering library. While integration with Solr does not yet exist, Mahout does expose web services (as well as Java APIs) for its recommender engine, so it is feasible to integrate it within an application.
Analytics and other Business Tools
Analytics are probably Solr’s weakest area, but that being said, we find that many customers already have platforms in place (like Omniture) that they can easily integrate Solr into. This often saves business users from having to learn yet another tool. As for other business tools, Solr likely does not have them (for instance, merchandising tools), but again, many people find it straightforward to integrate Solr into existing tools. Also, this is an area that Lucidworks, with it’s administrative UI really can help. It has screens and tools for doing log analysis and seeing what popular queries are, as well as popular terms and zero result values.
Solr is a very popular and capable search engine for ecommerce and, looking forward, it is only getting better. With a focus on greater features (spatial search, for instance), the latest Lucene and easier scalability, the next version of Solr promises to be even better.
Items needed here: schema, solrconfig, SOLR-1316 3.x branch patch