What’s wrong (or needs to be fixed) with Search and Why?

At Lucidworks, we like to say “Search is the Killer App”. In reality not all search applications kill, many of them suck, even though sad to say, many of these are built on the best search engine going anywhere – Apache Solr. Why? Because search is a complex problem and every search application is different. This blog is about how to create the killer search app. If you are already smart enough to use Solr – or better yet Lucidworks Fusion, you have the basic “bones” in place – all you need is a little special sauce to bring out the magic. Here are some ingredients.

Bag of Words Search and Us Humans – the Basic Disconnect

First, let me introduce what we search guys call the basic “bag of words” paradigm: all search engines work using the same basic algorithm – inverse token mapping and term density ranking. What this means is that text is first split into individual terms or “tokens” and then mapped back to documents. (BTW – Lucene does this in a truly kickass way). Although information about word order can be retained and used for things like phrase search or proximity boosting, search engines still like to “think” in terms of tokens. This is not however how people think of the search problem; they are looking for particular things or concepts and use words to describe what they are looking for. Without some help, the search engine treats the input as a big boolean OR, bringing back documents that have all or some of the words and ranks them according to how often these words (or their synonyms) are found in particular documents. We search wonks call this “bag-of-words” search because it is throwing one bag of words (the query) against a much larger bag (the index) and performing some fancy matrix math to find the “best” hits. Not perfect, but way better than pure string pattern finding as is used in relational databases (as in “show me records %LIKE% foobar”) which sometimes can’t tell the difference between “search” and “research” (true story) and compared to inverted index search can be painfully slow.

Which brings me to relevance ranking algorithms – the so-called “secret sauce” back in the day when Vendor Engines or “Vengines” ruled the world – and in case you haven’t been paying attention, those days are fading fast! “Buy my search engine because OUR relevance ranking algorithm is the BEST IN CLASS, PERIOD. How does it work you ask? Sorry, I could tell you that but then I’d have to shoot you!” (Which is almost always BS unless the sales guy has a PhD in Mathematics which most of them – trust me on this one – do not – so it was really just a smoke screen) But sorry for the digression – in a word – Bullshit – they all worked roughly the same way: TF/IDF (Ah! but WE use a hyperbolic cosine function or haversine rather than the pedestrian cos() function that all of our competitors use – we think!) And sorry Google, Page Rank only works well when you have humongous hyperlink matrices like the WWW to crunch (or should I say “Google-esque”? – since the name Google is derived from a “googol” which is 10 to the power of 100 – a honking mind-bogglingly HUGE number that deserves some serious ‘effin respect! – when Moore’s Law approaches THAT … holy shit – but I’ll most likely be dead by then but at the rate its going maybe not … and if I were a Googolaire – nobody else on the planet could get out of bed without my say-so).

But again, I digress. Once you guys got into the realm of enterprise search where linkage stats are basically nil, you had to slog it out with the rest of us – especially since you locked your IP into a nice-looking box and didn’t let us tinker with it (and we are going to seriously kick your butt on these Toasters because even you guys CANNOT shrink wrap search! – I mean since you don’t let anyone into your “appliances” does this include the recycling guys too?). Thanks for the occasional eye candy though – kudos to your graphics team – search may still suck but at least I get some edutainment out of it especially on holidays or on some historical day that I didn’t know about. (But just in case someone from Google sees this post and I ever want a job there – Google web search IS freaking awesome and continues to get better – its been my home page for well over a decade now – even before “Google” became a popular verb – but more on that later. I mean seriously, where would we be without Google, Wikipedia and smart phones today? I for one don’t want to go back. I used www.google.com more than five or six times while writing this blog just to make sure I didn’t F something up) And to be more objective here, yes even Google search does occasionally suck and that is because at its core it uses a bag-of-words paradigm like everybody else and this problem is especially pervasive when you use more than one search term. And even the greatest relevance ranking algorithm can’t hide the basic disconnect that often surfaces with bag-of-words search – at best its putting lipstick on a pig (sorry pigs, maybe in the next life you’ll get more respect …).

So, search is a hard problem – but I think that too many people think that it is a solved problem (largely thanks to that really amazing Google web search – again, keep up the great work guys, no kidding, awesome! – and for the rest of the world, sorry if I use the term Google a lot in this blog– just trying to up my TF here, kinda like personalized job security SEO – gotta pay the mortgage you know). But time for a reality check: search isn’t a solved problem, and it probably never will be – because the search engine alone with the algorithms that it uses can’t solve the problem on its own. We need to give it some help. Google continues to get better and better largely because they can afford to give it LOTS of help (last Google reference, I swear). So if you have a budget like that, you probably don’t need to read the rest of this because it will focus on what can be done on a low(er?) budget scale to address the bag-of-words search problem. In other words, techniques that don’t require running MapReduce jobs on truckloads of server blades – or Hadoop for the rest of us peons that don’t work for Brin and Page Inc. (yet? I’m on LinkedIn by the way, not Facebook though – sorry Mark) – especially at that way cool Mountain View office (is this one of the Eight Wonders of the Modern World yet? I saw the movie trailers but maybe I could get a personal tour someday?).

But back to our no-longer-just-hobbyists-and-dreamers corner – thanks again Doug, you da’ man! – long live Hadoop – go Open Source whoo freakin’ hoo!! (but just to be fair – do I have to Mom? … OK, OK and oh yeah to keep my job prospects alive, right – there is a ton of good Open Source donated by the G-men – so thanks for playing at our house too! And they did publish how MapReduce and BigTable work which I’m sure was much appreciated by Doug while his kid was playing with that stuffed toy that we all know the name of now.)

Semantic Search – What is it?

The hype cycle on this one has been stuck in the Trough of Disappointment for quite awhile now. Not because it’s a bad idea (it isn’t) but because it is REALLY hard to to well. But at the end of the day it addresses the crux of the problem – search is a semantic problem and one of the reasons that search engines have such a low success baseline is that they do not operate on the semantic level – in fact, the basic search algorithms that we all know and love have absolutely no freaking clue about semantics! So given that their core understanding of meaning and nuance in language is basically NADA, a) we have a long way to go but b) injecting even a little bit of semantic awareness into the process can make a big difference (which is my elevator pitch for this blog in a nutshell – and speaking of elevator speeches, I use the ‘G’ word often to explain to people what I do – then they get it).

A (very) shallow dive into linguistics – syntax and semantics

Context-Context-Context – humans bring contextual understanding to the search interaction because we know what words and phrases mean and how they function in language (i.e nouns, verbs and when a word is one thing and when it is another thing) and we “know stuff” that helps us disambiguate. So bottom line – humans and search engines do not speak the same “language” – we need an intelligent translation layer to help the search engine to react to what we want as opposed to what words we used to describe it.

A really great example of this is the different semantics used in boolean logic and street language. Thanks to the great mathematician George Boole for whom boolean is named, us computer geeks understand what the terms “AND” and “OR” mean – or do we really? It turns out that in the common, non-mathematical use of these words, they do not always follow boolean conventions – it depends on the context. If I want to look at shirts of a few colors at once, I don’t ask the salesperson to “show me red or blue shirts” because he or she will then ask – “OK which one do you want to see red or blue?” But if I say “show me red and blue shirts”, he or she (or my online site if it understands this context) will bring out both. If I did that with a search engine, it would return the dreaded ZERO results because no shirts (at least of a solid color) are both red and blue. So “and” really means “or” here in a boolean sense. This is one reason why we normally punt on this one by making “and” and “or” into stop words because while they are unambiguous in a boolean sense, they are not so in language usage. It depends on whether the choices are mutually exclusive or not – solid colors are, product types are “show me shirts and pants”, locations are “show me hotels in Detroit and Ann Arbor” other things are not as in “show me big and fast and powerful and fuel-efficient cars” – this is a boolean usage pattern but it may return ZERO results unless we drop the last criterion (or include Tesla in our search index) but that is certainly not the search engines fault – sometimes ZERO results speaketh truth!

Traditional methods for addressing the “problem”:

I want to list the things that we all know about (or should) first to get them out of the way. These are tried and true ways to enhance bag-of-words search and I want to say a bit about each of them before I try to say something new and innovative – or failing that, some brilliant, exciting new derivations of these basic ideas.

Best Bets – Bag-Of-Words (BOW) search is never going to get this one right, so lets just hard code the answer and be done with it – I know exactly what you want so I’ll just give you that. Everybody’s favorite example is “Holiday Schedule” (Apache Solr calls this the QueryElevationComponent in case you were wondering). Basically we Java coding guys load a HashMap (or Dictionary if you are a Python, Perl, Ruby or C# guy or gal) that links search terms to URLs and call it a day. (Notice that when I say “guys” – as I did in my shoutout to Larry and Sergey’s minions – I am not worried about being sexist but if I use the singular, I have to make sure not to offend the “fairer sex” as they were once called because when you are one of the “guys” daintiness and charm are not valued assets).

Synonyms – Ah! Now we are starting to get down and dirty. Adding a synonym list – and continually curating it – can make a huge difference, but what is a “synonym”? Lingustically it means a word or phrase that has the same meaning as another word or phrase. I’m not going to get into the weeds here but suffice it to say that someone with a PhD in Linguistics can tell you a lot more about what “synonym” really means and doesn’t mean and would probably have some quibbles with my street definition. But lets forge ahead anyway (you know of course what PhD really stands for don’t you?). We use synonym lists to solve search problems but sometimes the way that we use them is an even more egregious violation of the official piled higher and deeper definition – (look it up on Wikipedia if you are curious – and yes, I did donate at least once). I’ll say more about this when I discuss autophrasing later and that will get us to the heart of where syntax vs semantics matters.

Stemming, Lemmatization – This is another very valuable staple in our bag of tricks, but the devil is in the details here. Both of these work by identifying the root form of a word (the so called “stem” or “lemma”) and normalize on that so that the index can be searched independent of plurality, possessiveness or tense. Stemming is an algorithmic approach that deals well with common word forms. The problem comes when we deal with idioms (mouse/mice – mongoose/mongeese?) – so we also need an approach for these edge cases. Another interesting twist on this is the relationship between what stemming algorithm that you use (Lucene-Solr gives you a lot of options here), how “aggressive” it is and how that inversely affects the primary metrics of search goodness, precision and recall. A truly excellent discussion of this phenomenon can be found in Trey Grainger and Timothy Potter’s awesome book “Solr in Action” – (Check it out on Amazon.com – I am proud to say that I wrote the first and still most popular review and I meant every damn word – but Trey, Tim if you are reading this, where’s my kickback check? Geez guys at least offer to buy me a beer at Revolution for chrissake!)

Taxonomy / Ontology – Now for the heavy stuff. This is where the Semantic Search vendors live because its really easy to sell, but really HARD to do – so a) I can convince you that you really need it and b) you have to pay me lots of money because – well it’s a lot of work that you don’t want to do. I call these things “Knowledge Graphs” because they are ways of representing knowledge in a data structure. When I talked earlier about how users “know stuff” about language and have collected a bunch of factoids that they can leverage when they approach search its because they (we?) have a built in knowledge graph that we call our brain. A good example of knowledge context is a “fill-in-the-blanks” game on the phrase “BLANK was moonwalking”. If I substitute “Michael Jackson” (or actually James Brown who did this slick move before MJ was born) for BLANK I get one mental image, if I substitute “Neil Armstrong”, I get another. But what if I substitute “Harrison Schmitt”, “Dave Scott” or “Edgar Mitchell”? The image that you get depends on whether you are aware of the fact that these individuals were Apollo astronauts that walked on the moon after Neil Armstrong. If you didn’t know that, you are likely to get the more popular James Brown/Michael Jackson image (although the other guys are all white guys so it would probably be a comical one).

I’ll say more later about how even a little taxonomy can be a real help especially in eCommerce where we can seriously constrain the lexical context. And its really not hard to do if you have some good rules of thumb to follow; its just hard to do in a comprehensive way (maintenance is a REAL bitch here) – but sometimes you don’t need to be comprehensive – every improvement in precision is a win. As we say, search applications are never finished.

Machine-Learning – semi-automated classification: Here’s another technique that has been “around the block” for some time now. Its used a lot and even is the basis of the name for one of the old search engine vendors Autonomy who after consuming Verity is now consumed within the HP metroplex alongside Ross Perot’s guys (Side note – one of the reasons that we are leaving the Vengine era and entering the era of Open Source Dominance of the search market is that all of the old search vendors have been acquired by the mega-companies, Microsoft, Oracle, IBM, Hewlett-Packard where they have to compete within their own company with all the other stuff that these companies sell, so its much harder to get tech support on the phone – not only that, Apache Solr just simply kicks all of their butts in a serious way – and at a much better price-point! In other words, when we deploy Solr – preferably with Lucidworks Fusion, instead of Endeca – and here’s the toll free number to call, operators are standing by – we are not helping to pay for the other-mega-rich-guy-named-Larry’s yachts! And no matter how much money these guys acquire, they can’t freakin’ buy our engine if just to shut it down – like Autonomy did with Verity – because it belongs to US, the people – no I’m not a communist but I did read a great bio on J.Robert Oppenheimer recently – who wasn’t one either.)

Phew! Sorry about that – So what was I talking about? … Oh Right. What machine-learning approaches basically do is to use mathematical “vector” crunching engines to find patterns in large amounts of text and to associate those patterns with categories or topics. It is used for both entity extraction and conceptual tagging which interestingly enough can be shown to be opposite sides of the same coin (more on this later). The process of “vectorization” is important here and to give you some context, the TF/IDF algorithm used for relevance ranking – term frequency over inverse document frequency if you must know – is a type of vector – a number with both magnitude and direction in some lexical n-dimensional hyperspace (gulp! beam me up Scotty) – because it takes text data (tokens and their frequency within a document) and turns that into numerical matrices that can be used by the machine-learning algorithms to find interesting patterns. In the Lucene world, TF/IDF is called a similarity algorithm meaning that it finds the documents that are most similar to the query but similarity can be used in many creative ways as we shall see. Later, I will show a relatively inexpensive technique that was originally published in Ingersoll, Morton and Farris’ wonderful book “Taming Text” (Grant Ingersoll happens to be my boss at Lucidworks, but my shameless up-sucking here doesn’t change the fact that TT is a really great book …. but as I said, we gotta pay the mortgage).

Machine-learning is useful because it can find out what happened and what is related to what but it doesn’t know too much about how or why things happen because again, at its core is a statistical bag-of-words number cruncher. That is where knowledge bases come back into the picture (and other things like NLP). One way to grok this is to imagine crunching news and weather articles and coming to the conclusion that “Katrina” and “Sandy” have something to do with “hurricanes” and cost us tons of money. Using machine-learning techniques to build knowledge graphs is a really good use of this technology. The knowledge graphs can then be used to power semantic search – maybe with a little manual tweaking to remove the embarrassing stuff (like thinking that “Lincoln Junior High School” is a person – another true story). Come to think of it Watson (no not the Conan Doyle character) did just exactly this when it killed on Jeopardy (see below).  It read or “ingested” everything they could shove into it (encyclopedias, etc. – but not that Wikipedia page that Steven Colbert deliberately mis-edited no doubt) and created a knowledge graph that it could then use to give us humans a serious case of whup-ass.

NLP/AI: I lump natural language processing and artificial intelligence together – they are largely still a pipe dream but there are a few wins in this space (Siri? – lets get a show of hands on this one …). Basically, the idea is for the computer to parse language as a human would do, understand the conceptual structures that are conveyed (whatever that means) and to then respond like a person would. Think of Majel Barrett as the Star Trek computer voice here fielding a search question posed by Mr. Data – no check that, Mr. Worf (and its pronounced “dayta” not “dahta” at least by all the real scientists that I know – thus my internal reaction to someone that calls themselves a “Dahta Scientist” is “I don’t think so” – also with Data talking to the Computer – isn’t that just a rather inefficient SOA mechanism? Sorry, I’ll stop now).

The most famous statement of this in the Artificial Intelligence world is the “Turing Test” named after Alan Turing – whose contributions to our science are as important as John Von Neumann, Grace Hopper or anyone else I could try to impress you with – in which a person has to determine whether they are talking to another person or to a computer by asking it(?) a series of questions (i.e. is it Majel’s computer voice or Lwaxana Troi?). If they can’t tell, then we pass. For the most part, we computer scientists are still flunking out but the guys at IBM that built Watson and Deep Blue would probably kick up a fuss on this point. Do I need to say more about how powerful these techniques can be? But just try to buy Watson from IBM – I don’t think that there is a shrink wrap version on their download site yet (maybe that’s because you can’t find shit on their site but that’s another story – I’m picking too many fights here as it is – because one of the way cool jobs that was on my radar once is “Watson Fellow” so I’ll try not to burn any bridges in that direction as well).

Watson won Jeopardy with a lot of clever software like UIMA (Open SOURCE baby!) running on a few truckloads of server blades to get the response time down so that Watson could buzz in first (that part just doesn’t seem fair but you gotta admit, it does represent a real kickass demonstration of parallel computing chops – so OK I get it, it was a live demo, nobody got hurt).

So, if you are doing all of these things above, your search application is probably pretty damn good – i.e. it already kills. But if you want to do more or do it with less, stay tuned to this bat channel (I actually did research on bats once in my life so I’m allowed to reuse that hokey TV reference). Next bat time I’ll discuss some interesting techniques for playing tough with the big boys without having to have the resources of either of the guys named Larry, Mr. Bill, Mark or the company that the two guys named Steve founded (there is sadly only one surviving member of this dynamic duo but he is probably my biggest hero of them all and for whom I have only nice things to say – after leaving Academia, I cut my teeth in professional software writing Math games for kids on the original Mac OS – so here’s to you Woz, live long and prosper buddy!)

Da Segno al Fine

Well, thanks for bearing with me and getting to the end of this brain dump / diatribe – but now that I’ve got some of this off of my chest, maybe I’ll be more ah … well tempered next time, but on second thought … Naaahhhhh!!! Being a crusty old curmudgeon is just too much damn fun! Happy searching 🙂 But that said, it might not surprise you to learn that I prefer the earlier, bad Mr Scrooge. I thought he was way cooler, so I’ll let him have the last words – “Bah! Humbug!”