“Search vs discovery” is a common dichotomy that is used in discussions about search technology, where the former is about finding specific things that are either known or assumed to exist, and the latter is about using the search/browse interface to discover what content is available. A single user session may include both of these “agendas”, especially if a users’ assumption that a certain piece of content exists, is not quickly verified by finding it. Findability is impaired when there are too many irrelevant or noise hits (false positives), which obscures or camouflages the intended results. This happens when metadata is poorly managed, search relevance is poorly tuned or when the users’ query is ambiguous and no feedback is provided by the application (such as autocomplete, recommendation or did you mean) to help improve it.
Content visibility is important because a document must first be included in the result set to be found (obviously), but it is also critical for discovery especially with very large content sets. User experience has shown that faceted navigation is one of the best ways to provide this visualization especially if it includes dimensions that focus on “aboutness” and “relatedness”. However if a document is not appropriately tagged, it may become invisible to the user once the facet that it should be included in (but is not) is selected. Data quality really matters here! (My colleague Mark Bennett has authored a Data Quality Toolkit to help with this. The venerable Lucene Index Toolbox or “Luke” which can be used to inspect the back end Lucene index is also very useful. The LukeRequestHandler is bundled with Solr. ) Without appropriate metadata, the search engine has no way of knowing what is related to what. Search engines are not smart in this way – the intelligence of a search application is built into its index.
Search and Content Curation
Findability and visibility are also very important when the search application is used as a tool for content curation within an organization. Sometimes, the search agenda is to see if something has been created before, as a “due diligence” activity before creating it. Thus, the phrase “out of sight, out of mind” becomes important when content that can’t be found tends to be re-created. This leads to unnecessary duplication, which is wasteful but also counter-productive to search both by adding to the repository size and by increasing the possibility of obfuscation by similarity. Applying “deduplication” processes after the fact is a band-aid – we should make it easier to find things in the first place so we don’t have to do more work later to clean up the mess. We also need to be confident in our search results, so that if we don’t find it, it is likely that it doesn’t exist – see my comments on this point in Introducing Query Autofiltering. Note that this is always a slippery slope. In science, absence of evidence does not equate to evidence of absence – hence “Finding Bigfoot”! (If they ever do find “Squatch” then no more show – or they have to change the title to “Bigfoot Found!” – which would be very popular but also couldn’t be a series! That’s OK, I only watched it once to discover that they don’t actually “find” Bigfoot – hence the ‘ing’ suffix. I suppose that “Searching for” sounds too futile to tune it in even once.)
Auto-classification technology is a potential cure in all of the above cases, but can also exacerbate the problem if not properly managed. Machine Learning approaches or using ontologies and associated rules, provide ways to enhance the relevance of important documents and to organize them in ways that improve both search and discovery. However, in the early phases of development it is likely that an auto-classification system will make two types of errors, that if not fixed can lead to problems of both findability and visibility. First, it will tag documents erroneously leading to the camouflage or noise problem and second, it will not tag documents that it should – leading to a problem with content visibility. We call these “precision” and “recall” errors respectively. The recall error is especially insidious because if not detected will cause documents to be dropped from consideration when a navigation facet is clicked. Also, errors of omission are more difficult to detect, and require the input of persons who understand the content set well enough to know what the autoclassifier “should” do. Manual tagging, while potentially more accurate, is simply not feasible in many cases because Subject Matter Experts are difficult to outsource. Data quality analysis/curation is the key here. Many times the problem is not the search engines fault. Garbage-In-Garbage-Out as the saying goes.
Data Visualization – Search-Driven Analytics
I think that one of the most exciting usages of search as a discovery tool is the combination of the search paradigm with analytics. This used to be the purview of the relational database model which is at the core of what we call “Business Intelligence” or BI. Reports generated by analysts from relational data go under the rubric of OLAP (online analytical processing) which typically involves a Data Analyst who designs a set of relational queries, the output of which are then input to a graphing engine to generate a set of charts. When the data changes, the OLAP “cube” is re-executed and a new report emerges. Generating new ways to look at the data require the development, testing, etc of new cubes. This process by its very nature leads to stagnation – cubes are expensive to create and this may stifle new ideas since there is some expert labor required to bring these ideas to fruition.
Search engines and relational databases are very different animals. Search engines are not as good as RDBMS at several things – ACID transactions, relational joins, etc — but they are much better at dealing with complex queries that include both structured and unstructured (textual) components. Search indexes like Lucene can include numerical, geospatial and temporal data alongside textual information. Using facets, they can also count things that are the output of these complex queries. This enables us to ask more interesting questions about data – questions that get to “why” something happened rather than just “what”. Furthermore, recent enhancements to Solr have added statistical analyses to the mix – we can now develop highly interactive data discovery/visualization applications which remove the data analyst from the loop. While there is still a case for traditional BI, search-driven discovery will fill the gap by allowing any user – technical or not – to do the “what if” questions. Once an important analysis has been discovered, it can be encapsulated as an OLAP cube so that the intelligence of its questions can be productized/disseminated.
Since this section is about visualization and there are no pictures in this post, you may want to “see” examples of what I am talking about. First, check out Chris Hostetter (aka “Hoss”)’s blog post “Hey, You Got Your Facets in My Stats! You Got Your Stats In My Facets!!” , and his earlier post on pivot facets. Another way cool demonstration of this capability comes from Sam Mefford when he worked at Avalon Consulting – this is a very compelling demonstration of how faceted search can be used as a discovery/visualization tool. Bravo Sam! This is where the rubber meets the road folks!