To paraphrase an age-old question about trees falling in the woods: “If content lives in your application and you can’t find it, does it still exist?” In this article, we explore how to make your content findable by presenting tips and techniques for discovering what is important in your content and how to leverage it in the Lucene Stack.

Introduction

Chances are, if you’re like me, you didn’t grow up dreaming of better ways to find text and data on a website or a hard drive. Heck, you probably didn’t even think about it once you were enrolled in college, even if you were a Computer Science student. Truth is, you probably are working on a project that requires you to search your content and now you’re wondering how to do just that. Or, perhaps, you already have search working, but your tests and/or your programming instinct tells you it could be better. Even worse, maybe your boss/QA dept./CEO/”Best Customer” is telling you it could be better. Thus, you have a findability problem and you aren’t sure what to do next. After all, a search library is supposed to just work, right?

Take for example a recent client of Lucid’s. They are an instantly recognizable household name using Lucene to power their online store. Their store serves millions of requests per day and they have pretty sophisticated analysis to track conversion on searches into purchases. Unfortunately, one of their top selling products, let’s call it “widget X”, had a findability problem. When users typed in “widget X” in the search box, all kinds of things related to widget X showed up, but widget X didn’t show up in the results until page 12. Needless to say, this was costing them a good chunk of money because widget X is a best seller via other distribution channels. After some analysis and working through some of the tips in my article on improving relevance, we discovered that one of the main fields being searched was empty for widget X. After tracking the problem back to their data entry system, a fix was tendered and the problem was solved during their next site update. Problem solved.

While most tools, like databases, will claim they do a good job finding your unstructured content, the truth is most of them take a one-size fits all approach to the problem and your results suffer. In fact, even though a search library like Lucene does a good job out of the box, there are many things you, as a developer and Subject Matter Expert (SME), can do to make it even better.

Planning for Findability

First and foremost, consider yourself lucky if you are starting a new project instead of trying to fix an existing project. Good requirements gathering, design and specification never go out of style, so taking the time to plan for how to find your content will undoubtedly help you succeed. Of course, all is not lost if you aren’t starting fresh, as most of the techniques I describe will work fine, it’s just they may require a bit more effort.

Knowing your Content

Before we begin thinking about specific techniques, I want you to think about the ideal search engine. Of course, it really shouldn’t be called a search engine, right? After all, it’s a found engine. In other words, you type in (or speak, or, in the future, think of) some description and this magical engine instantly finds the one exact thing you are looking for. That thing may be one word, sentence, paragraph, document or a whole set of documents. The key to the engine is the fact that every single item in the results is relevant to the search and no relevant documents were overlooked. Furthermore, with this engine, you could seamlessly search across all kinds of content with nary a thought of it’s structure or lack thereof. The engine would happily crunch away at your content without a peep, silently building data structures to match every search need from every user.

Note

In Information Retrieval (IR), my magical engine would be said to have both perfect precision and perfect recall.

Pretty cool engine. If only it existed. The fact of the matter is, no engine exists that can know all the ins and outs of your data. Moreover, in all but the most trivial of applications, not even you can know and synthesize your data so as to make it findable for your users. Realizing this, it is imperative to come up with a plan for understanding as much as you can about your content in as short of a time frame as possible.

When I’m starting off a new application with new content, I work through the collection methodically from the top down, as described in the following sections. Keep in mind throughout, though, that the process is one of diminishing returns. You will learn a lot quickly, but then it will taper off and you should get on with the rest of your application development. Additionally, always keep in mind the users, which I’ll talk about next section, and your system goals.

Collection Level Knowledge

At the collection level, I try to gather aggregate information about the content. At this level, I’m interested in:

  • The location of the content. Is it on the web, a filesystem, a database or something else?
  • How many raw items (i.e. files, rows, etc.) are in the collection. I say “raw”, because it isn’t necessarily the case that there is a one-to-one match between the item and what gets put in the search engine.
  • If appropriate, the MIME types present. Common MIME types are Microsoft Office (TM), Adobe PDF (TM) files, HTML, XML, etc. If the files are XML or some other markup language, is there a DTD, XML Schema or the equivalent that describes the file structure. If you are not sure, Tika can be used to determine the MIME type.
  • The language(s) present in the content. Search the web for language identification software if you don’t know what languages are present.
  • If appropriate, the character encoding of the content. Tools exist that can be run on a file to determine this by sampling the front of the file.
  • The update frequency of the collection. Tradeoffs may need to be made if improvements require more processing time than our system can handle and still keep up with new documents arriving.
  • The response time needed. Again, tradeoffs may need to be made if advanced techniques require too much time.
  • Useful aggregate statistics like: average document length, total number of terms, most frequently occurring terms (and phrases).
  • Related resources like dictionaries, thesauri and the like.
  • Do the documents have any logical relationships between them? For instance, Internet search engines often benefit from link structure that relates one document to another.
  • As a whole, can I assign a few keywords to the whole collection that would easily let others know what’s in the collection? For example, a website on hockey players could be: sports, hockey, players, teams. If I have time, I may use a tool like Apache Mahout to cluster the collection to find related content.

For collecting the statistics and exploring the collection as a whole, it is often possible to do a first iteration over the content without putting too much effort into overcoming errors and obstacles. Even if I’m using Lucene directly, I’ll often send the documents into Solr with a very simple schema that does basic analysis of the text and throws everything into a few fields with a simple unique id based on the file name or the primary key. If the documents are in a common file format like PDF or Word, I’ll first run them through Tika and extract the text and metadata for indexing. During this process, I watch for files that fail to index and set them aside for closer examination. Once I have this index, I’ll open up the index in Luke and explore at the document level, as I describe in the next section.

Document Level Knowledge

After assessing the collection as a whole, I usually sample a small percentage of the documents for closer scrutiny. Don’t fall into the trap of just looking at the first few files that are listed in your directory view, as they may not be a good cross-section of the collection. As for picking the number of files to review, I don’t have a hard and fast rule, but it probably averages out to be somewhere around 30 to 50 documents. Within each document, I often look for:

  • Structure like titles, headers, prices, summaries, tables, lists, and images; even paragraph, sentence, section, and chapter information can be useful in some applications.
  • Important metadata like author, number of pages, summaries
  • Hints about the importance of a document and a way to identify it algorithmically. For instance, HTML documents contain links that can be used to determine what pages are important? Or, in an email application, perhaps it is possible to use your corporate address book to determine the roles of the people in the email, such that something like a corporate memo from the CEO is higher priority than an email from an entry level person. Note, it is not good enough for you to simply “know” it’s more important, you have to be able to program it! Once you have determined the importance, then you can boost those documents during indexing.
  • How to identify words/tokens. Can I split on whitespace or do I need something that is aware of punctuation? In most cases, it is obvious what a word is without putting in too much thought.
  • Important words, like jargon, synonyms, acronyms, abbreviations, phrases, proper nouns, locations, etc. If these exist (of course they do!) is there a way for me to obtain them? Is it worth it for me to do so?
  • Anything else that catches my eye or seems odd. Oddities almost always result in errors in early passes over the content. Things like empty fields, weird characters, unclosed tags and so on are examples.

By this time, you should have a pretty good picture of what your content looks like. If I’m implementing a system using some higher-level Natural Language Processing (NLP) techniques, I may choose to examine things more closely at the word, sentence and paragraph level and think about things like part of speech, grammatical structure and other similar information. For now, however, I’m content to move on to take a look at how users factor into the process.

Knowing your Users

Believe it or not, users are not the enemy. I know, they break things. They overload the system. They don’t read instructions. Most of all, they don’t care about your excuses for why things don’t work the way they want them to work. Simply put, they will go somewhere else if you can’t deliver it. The only way to overcome all of these issues is to get inside their heads and figure out what they want.

For sake of discussion, when I talk about knowing users, I’m going to focus on how users search, not how they interact with the user interface (UI). That is in no way meant to discount the importance of the user interface, it’s just recognition of the fact that I’m not a UI person. Thankfully, there are many experts out there who can help you figure out what is the best UI for search. Additionally, even though this article appears logically after the content section, keep in mind the two are deeply intertwined. In all likelihood, you will do several iterations on the two topics, with one informing the other.

The first thing to do to understand users is to assess their level of search sophistication. Your average Internet searcher (I’ll call them “general searchers”) is going to have very different expectations from a well-trained information worker like a Librarian or an Intelligence Analyst. If dealing with generic searchers, know that many of your choices for UI and query syntax have already been made for you by Google and Yahoo! based on their simple textbox input and basic query syntax. If you are thinking of introducing some new syntax or a different type of input mechanism for general search, you will need to consider the amount of effort needed to train your users to take advantage of the new feature. Expert searchers, on the other hand, are often willing to learn new features IF you can demonstrate they return superior results. Often time, the best solution is to offer the simple input box and an option to switch to advanced search, since even expert users will prefer simplicity for many easy tasks.

Some other generally useful tips include:

  • Don’t be afraid to mark something as Beta and let people try it. Just make sure you generate detailed logs so you can track user interaction and get feedback about what works and what doesn’t.
  • If you are upgrading an existing system, make sure you harness the information contained in the system logs, especially the query inputs and the clickthroughs on results.
  • Focus groups and A/B testing (a certain number of users see one interface, while the rest see another) can be useful ways of determining what works best, so don’t be afraid to experiment.

At this point, it is useful to think about how users will interact with the system. On the input side, the main questions revolve around the query syntax to support and the options to allow. For example, some systems only allow simple keyword entries and phrases, while others allow full paragraphs or boolean logic. Options-wise, you may consider allowing users to restrict the results by collection, dates, locations or other features present in the content. You may also allow them to specify a sorting order.

On the output side, you will likely be returning a list of results sorted by relevance or some other criteria such as date. Additionally, things like facets, extractions, spelling suggestions, highlighting, related searches and other features can all add to the user experience. Remember, findability doesn’t always mean search, it often means navigating to the result as well. Tools like Solr and Lucene can provide sophisticated navigation capabilities as well.

There is obviously a lot more that could be said about understanding users. I would urge readers to dig more into the literature and also look at what successful sites have done with their search and see what can be used on your own site.

Garbage In, Garbage Out

While it is often overlooked, the value of “clean” data should not be underestimated. While it is true even when dealing with databases and pure text, it is especially true when dealing with common office documents like Microsoft Word and Adobe PDF. For instance, standard writing practice for most scientific papers is a two column format, but many extraction programs do not properly handle this and will return content as if the sentence spanned the column. Other problems include non well-formed files, scanned text extracted with Optical Character Recognition (OCR), mixed language text, missing identifiers and on and on. Most of these problems will be caught pretty early on in the process, but some, like OCR data and mixed language text, may require a little extra time to solve, or at least work around.

If you’re lucky, the cost of dirty data will show up during indexing in the form of failed documents (which you gracefully recover from.) Some of the time, however, it will sneak through and end up causing bad results or no results. In these cases, the best you can do is keep a watchful eye on your logs and faithfully do relevance testing to try to track down the bad documents and fix them.

Finally, in some fields like Computer Forensics, hard to process data is part of the process. In these situations, you do the best you can with what you have and patiently try alternate approaches hoping to get something of value out of the mish-mash of data. In any situation, there are analysis and querying techniques that can improve the situation. The next three sections outline many common IR techniques for improving findability.

Analyzing your Analysis

Over the years, much has been done to research the role of what Lucene calls analysis in the search process. Academics have done studies on stemming, case sensitivity, the use of phrases and other extractions like named entities, n-grams and many other techniques, which I’ll explain shortly. Some of these factors result in marked search improvement, while others are good for eking out very minor improvements that will never be noticed by most users. I hope to focus on the stuff that will really help, but know, too, that every situation is different. Not all techniques will help in all situations. In fact, they may actually do harm, too.

As a recap, in Lucene, analysis is the process that converts the input strings into tokens that are then indexed and made searchable. For instance, analysis is responsible for converting the sentence:

The quick red fox jumped over the lazy brown dogs.

into the tokens:

quick, red, fox, jump, over, lazi, brown, dog

How you, via Lucene (note, Solr shares Lucene’s analysis process), choose to do analysis, will have a very large impact on how good your system is at returning results. And, unlike the one-size-fits-all systems out there, you have direct control over the process in Lucene, if you want it.

Tip

To learn more about the specifics of how analysis works in Lucene (i.e. the Analyzers, Tokenizers and TokenFilters), see my article on getting started with Lucene.

The table below spells out many common analysis techniques that have been shown, over the years, to improve search as a whole.

Table 1. Common Analysis Techniques and Tradeoffs

Name Description Pros Cons Lucene/Solr Implementations
Case Sensitivity Transformations Converts all characters to their lower (or upper) case equivalent. Users often don’t know the correct capitalization. Many languages also capitalize the first letter of the first word of a sentence, which should still result in matches. Prevents exact matches and potentially makes acronyms and other words where case matters indistinguishable from the equivalent word.
  • LowerCaseFilter
  • LowerCaseTokenizer
Intelligent Stopword Handling While stopwords are often removed, indexing them and using them intelligently during search can enhance results. Essentially, they are kept when constructing phrases, but otherwise dropped. Better phrase matching without the extra noise for keyword search. Prevents information loss, since the stopwords are no longer dropped during indexing. No disk space savings. Requires a non-default QueryParser in Lucene and Solr.
  • Analyzers for removing stopwords:
  • StopFilter
  • StopAnalyzer

Many other Lucene Analyzers employ the StopFilter as well.

Stemming Transform the original word into some root form. For instance, removing plurals or suffixes is a common stemming technique. Users searching for “bank” are likely also interested in documents containing “banks”. Sometimes the root form of a word is not all that closely related to the original word. This is often true in languages like Arabic that build words from prefixes, infixes and suffixes (Arabic is an agglutinative language). Additionally, stemming prevents exact matches. A non-exhaustive list of Stemmers for Lucene:SnowballFilter and SnowballAnalyzer bring Dr. Martin Porter’s popular Snowball stemmers to Lucene and cover a wide variety of languages.The Lucene analysis contrib (contrib/analysis in the Lucene distribution) has stemmers for: Arabic (trunk-only), Brazillian, Dutch, French, German, and Russian.Lucid has an optimized version of the Krovetz stemmer available for downloadTry searching the Internet for “<LANGUAGE> Lucene stemmer” or “Lucene stemmer”, where <LANGUAGE> is the name of the language you need a stemmer for.
Intelligent Compound Analysis Compound, hyphenated and other “mixed” words are split into smaller tokens. For instance, iPod may be split into “i” and “pod” or a product SKU like XJ-7543 becomes “XJ” and “7543”. Users don’t always know the exact way the terms are created, so they may not put in the correct punctuation or spacing Extra processing, also potentially creates false matches. In Solr, the WordDelimiterFilter has numerous options for splitting within words.
Token normalization Create a single, common form of a token. For instance, stripping accents and other diacritics from words is quite common in languages that make use of them. Users often don’t know the correct forms or aren’t at a keyboard where it is easy to enter them. Potentially introduces false matches if the user explicitly wants to find the term with a certain markup. ISOLatin1AccentFilter – Strips ISO Latin 1 accentsElisionFilter – Removes elisions from tokens, e.g. “l’avion” becomes “avion”
Phrase Identification Mark and index/search phrases as separate tokens Can be boosted, often resulting in better matches Requires extra processing that may be more costly The PhraseQuery and SpanNearQuery can be used to create phrases, but currently Lucene doesn’t ship with anything that determines phrases automatically in text. See n-grams for creating phrase-like tokens.
n-grams A sub-sequence of n items from a sequence. Can be character-based or token-based. Example: the bigram for “Lucene” is: “Lu”, “uc”, “ce”, “ne”. Often used with noisy data or languages that don’t use whitespace for word segmentation (e.g. Chinese). Token-based n-grams are useful for creating pseudo phrases. Also useful in spell checking and language identification. Extra processing step; how to pickn?
  • EdgeNGramTokenFilter, EdgeNGramTokenizer – Character-based.
  • NGramTokenizer, NGramTokenFilter – Character-based.
  • ShingleFilter, ShingleMatrixFilter – Token-based.
Synonym Expansion Given a token, add one or more synonyms to the token stream. Also used for expanding acronyms and abbreviations. Languages have multiple ways of saying the same thing, but users typically only input the word they know. Introduces extra processing. Many synonyms are not exact equivalents of the original word, so they may introduce ambiguity
  • SynonymFilter – Solr.
  • SynonymTokenFilter – Lucene
  • Lucene’s Wordnet contribution (located in contrib/wordnet) – Uses Princeton’s WordNet to add synonym expansions. This is not done as a part of an Analyzer, as it is an offline task.
Named Entity Recognition and other extraction techniques Identify proper nouns (people, places, companies, etc.) and other items of interest (e.g. currency) Similar to phrase identification, can be boosted to signify importance Requires extra processing. Can induce false matches if user is not interested in the proper noun meaning of the word or phrase Taming Text has code for hooking in OpenNLP into Lucene and Solr which will be freely available at some point (or just email me.)

It is quite common to employ several of these techniques in and across fields. For instance, one field may be lowercased and stemmed while another is only lowercased and a third is only tokenized.

Before I move on to how to construct better queries, I’m going to cover stemming in a bit more detail, as it is an important analysis technique that bears more discussion.

Stemming In Greater Detail

As Common Analysis Techniques and Tradeoffs discusses, stemming is the process of reducing a word down to some root form. Stemmers are often created by Computational Linguists or others who have a deep knowledge of how words are formed in the language of choice, but basic stemmers are pretty easy to write assuming a working knowledge of the language.

In practice, there are a wide variety of approaches to stemming ranging from lightweight to very aggressive. A lightweight stemmer is one that removes only the most obvious suffixes, such as removing “s” from “banks”, while a more aggressive stemmer may remove prefixes, infixes and suffixes. Also, note that the result of stemming need not even produce an actual word. For instance, in my example above, the word “lazy” was stemmed to “lazi”.

As with most analysis operations, stemming incurs costs, both in terms of speed and quality, when searching. On the speed size, some stemmers may use a complicated rule set to determine the proper stem, thus incurring extensive computing costs over a large collection. On the quality side, stemming generally improves recall, while potentially causing a loss in precision (see Kraaij for background.) The loss of precision is more apt to happen when using an aggressive stemmer whereby two words that are very distantly related (and thus judged not relevant by a user) end up being reduced to the same term. This is often the case in languages like Arabic (see Larkey) where many words come from the same root form.

The question thus becomes: how does one pick a stemmer, if at all? In most cases, when dealing with unstructured text, such as news, web, email and other communication, you will want to use a stemmer. For things like product names, etc. stemmers likely do not make sense, especially since the trend these days is to have product names that are made up anyway. Once a stemmer is decided upon, I recommend starting with a lightweight one, like the Krovetz Stemmer, which does simple things like remove plurals and then, over time, if you find that your relevance testing shows your recall is suffering, you could try a more aggressive stemmer like the Snowball Stemmers (often called the “Porter” stemmers after their creator) included with Lucene.

In the end, you may or may not need stemming or some other analysis technique I’ve talked about here. Just make sure you make an informed decision based on what your application needs and not what is the “default” based on someone else’s example or suggestion. As always, don’t be afraid to experiment. For now, though, it’s time to move on to discuss some tips for making better queries.

Query Techniques for Better Search

Up until now, I’ve focused mostly on the content side of the equation, but it is just as important to consider how queries are constructed when thinking about findability. On the query side, there are usually at least two types of queries to discuss: the user input query and the actual query that gets submitted to the system. This is because most systems usually modify the user’s query based on system defaults, user selected options and the results of parsing the original query. At the user level, education and persuasion are your best bets for getting better results. Educate your users about your supported syntax by having simple examples and worthwhile documentation. Persuade them to submit better queries by providing tools like auto-suggest and options that take the guesswork out of how to treat their input.

Unfortunately, user education and persuasion only go so far. To improve relevance, consider adopting one or more of the following suggestions:

  • If you are using the Lucene Query Parser, change the default operator to “AND” for most queries. If this results in too many zero-result queries, first run an AND query and then back off with an OR query. Alternatively, only use AND for queries with fewer than some number of terms, after which use OR.
  • Try using “sloppy” phrases. In the Lucene Query Parser syntax, this means using the tilde operator, e.g. “Minnesota Vikings”~10. The slop factor indicates how many tokens may occur between the terms of the phrase and still have a match. The closer together the tokens are, the higher the score. Thus, a really sloppy phrase query will often work just like an AND query, but documents where the terms occur closer together will rank higher.
  • If you’re using Solr’s Dismax Query Parser, make sure you explore the many options available to you related to phrase boosts, function queries and field boosting.
  • If you are not in a high-query volume situation (on the order of hundreds of queries per second or more), consider using automatic relevance feedback. As background, relevance feedback is similar to Lucene’s “More Like This” functionality in that it uses feedback from the user as to what is relevant or not and then constructs a new query from the most important terms in the documents the user selected and returns the results from this new query. Automatic relevance feedback simply assumes, without user input, the first five or ten documents (it’s usually configurable) are relevant and uses the top terms from those documents to construct a new query. The new query is then submitted and the results are then returned to the user, who is unaware that two queries took place instead of one. Relevance feedback is almost always a win, as most search engines are good at getting good results in the first five or ten documents. The downside is you are executing two queries per every one input, plus you have to load and process the top documents from the original result set. However, as I said, if throughput isn’t absolutely critical, it is often worth the cost.
  • Consider position based matching with Lucene’s SpanQuery functionality. With the SpanQuery, you can hone on the specific part of a document that contains the match. This can be especially helpful in really large documents where query terms are far apart.
  • Try to determine if one field or one term is more important than others and boost those fields or terms.
  • If, during analysis, you can determine that particular words are of more importance, than you can mark those terms with a payload (a byte array that can be used for term-based storage). Then, at query time, you can use the BoostingTermQuery to boost those documents that contain the terms with payloads.
  • Just because a user enters keywords into a search box, does not mean you have to actually execute a search against your system. Caching whole result sets and altering them based on log analysis, popularity, etc. is perfectly fine. As a corollary, if you know a particular document for a common query is the best result, then make it the best result. Don’t waste one iota of your time thinking about why a document occurs in position two versus position three of your results. Do, however, spend time on documents that occur at a position greater than 10 or 20 that you (or your team) think belong in the top ten. See my article on relevance for assistance.

Of course, many of the analysis techniques I described above are not specific to content only. As a general rule, the analysis process needs to be the same for both indexing and searching, so keep that in mind when building your application. Next, I’ll have aquick discussion of some helpful display and navigation ideas, and then I’ll wrap up the article with some final thoughts and resources.

Navigation Hints

So far, we’ve covered a lot of information, ranging from analysis to query construction. The last big piece of the findability puzzle is navigation. There are many ways to enhance navigation, but I’ll focus on what you can do today using Solr and Lucene.

In recent years, faceting has gained popularity in a lot of online stores. Faceting is a technique of offering users a restricted set of choices derived from the result set data from which they can refine their query, the key being that all choices are known to have valid results. Faceted displays often display the number of documents in each facet as well. For example, CNet Shopper (powered by Solr) makes extensive use of faceting on their site, an example of which can be seen in Faceting Example. Naturally, all of those cool things you extracted during analysis, like named entities and phrases, make for great facets. Right now, Faceting in Solr works out of the box, but it is also possible in Lucene with some work.

Figure 1. Faceting Example

CNet Shopper (http://shopper.cnet.com) uses faceting extensively in their search results, as can be seen in this example.

Besides faceting, other aggregating techniques are also often used, albeit by far fewer systems. Both document and search results clustering can be useful in the right situation, as they can distill a lot of related documents down into a single cluster from which a representative result can be chosen. Solr has some preliminary (i.e. it hasn’t been committed) work in place to add support for the Carrot2 clustering project (search results based) and Apache Mahout has many clustering algorithms also available. To learn more, see SOLR-769.

Other helpful navigation techniques include “More Like This”, alternate sorting options, filtering options and “Did You Mean?” functionality. In More Like This, the user is able to choose a document that is a good match and ask for more documents that are similar to that document. For Did you Mean functionality, providing spelling suggestions for terms that occur more frequently in the collection can often help a user find better results. As for sorting, sometimes users want newer documents to appear first, or documents they haven’t seen yet to appear first. Likewise, if the user can restrict down the results set ahead of time by providing filtering information (by date, price range, author, etc.) you can return back a smaller result set which is easier to review. Finally, highlighting can at least help direct the user’s eye to the parts of the document that are pertinent to their query.

Final Thoughts

At a much deeper, expert level, which I won’t cover too much, you may want to consider overriding Lucene’s Similarity class with your own, especially the length normalization factor, if you find that Lucene, as a whole tends to favor shorter or longer documents in the results. Moreover, you may consider implementing your own Query classes with their own scoring capabilities. In fact, this is how many of Lucene’s query classes came about, so please consider contributing yours back. If you go down this path, consult the Lucene source for examples of how to proceed and don’t be shy about asking questions on the appropriate mailing list.

Finally, as with most things in search, findability is not a fire and forget problem. I would encourage you to schedule routine check ups for your search application to make sure it is still performing as expected. Oftentimes, as the index grows with new documents, the collection statistics that play a factor in results will change, yielding different results from before, thus your application may require a tuneup.

In this article, I covered many topics about how to use Lucene and Solr to enhance your content’s findability. I covered ways to better understand your content, your users and your analysis process. I also gave some tips and techniques for creating better queries. Hopefully, this article will give you some insight into how to make a better search system by thinking more about what makes something findable to begin with and then how to leverage that in Lucene and Solr. Finally, please feel free to add your own comments at the bottom or this article on how you improved your applications relevance.

Resources

The following resources may help you to learn more about findability and how you can leverage Lucene and Solr to improve your application.

  • Read Wikipedia’s definition of Findability.
  • Learn more about the concept of findability at www.findability.org by Peter Morville.
  • Learn more about possible relevance problems in your application in my article titled Debugging Relevance Issues in Search.
  • Learn other techniques and libraries for enhancing your Lucene and Solr search applications in my book (co-authored with Tom Morton): Taming Text.
  • Utilize Andrzej Bialecki’s helpful Lucene utility named Luke
  • Use Apache Mahout for useful machine learning tools like clustering.
  • Learn the basics of Lucene
  • Use Apache Tika for MIME type identification and extraction.
  • Use OpenNLP to perform higher level NLP functions like phrase identification and Named Entity Recognition.
  • Agglutinative languages construct words extensively out of smaller units of words (called morphemes). Aggressive stemming may produce a lot of noise during search of these languages
  • For some experimental clustering support in Solr, see SOLR-769.
  • Kraaij et al. Viewing stemming as recall enhancement. Proceedings of the 19th annual international ACM SIGIR … (1996)
  • Larkey et al. Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. … on Research and development in information retrieval (2002)