I just got back from an another incredible Lucene/Solr Revolution, this year in Sin City (aka Las Vegas) Nevada. The problem is that there were so many good talks, that I now can’t wait for the video tape to be put up on U-Tube, because I routinely had to make very difficult choices about which one to see. I was also fortunate to be among those presenting, so my own attempt at cramming well over an hour’s worth of material into 40 minutes will be available for your amusement and hopefully edification as well. In the words of one of my favorite TV comedians from my childhood, Maxwell Smart, I “Missed It by that much”. I ran 4 minutes, 41 seconds over the 40 minutes allotted to be exact. I know this because I was running a stopwatch on my cell phone to keep me from doing just that. I had done far worse in my science career, cramming my entire Ph.D thesis into a 15 minute slide talk at a Neurosciences convention in Cincinnati – but I was young and foolish then. I should be older and wiser now. You would think.
But it was in that week in Vegas that I reached this synthesis that I’m describing here – and since then have refined even a bit more, which is also why I am writing this blog post. When I conceived of the talk about a year ago, the idea was to do a sort of review of some interesting things that I had done and blogged about concerning facets. At the time, there must have been a “theme” somewhere in my head – because I remember having been excited about it, but by the time I got around to submitting the abstract four months later and finally putting the slide deck together nearly a year later, I couldn’t remember exactly what that was. I knew that I hadn’t wanted to do a “I did this cool thing, then I did this other cool thing, etc.” about stuff that I had mostly already blogged about, because that would have been a waste of everyone’s time. Fortunately the lens of pressure to get “something” interesting to say after my normal lengthy period of procrastination, plus the inspiration from being at Revolution and the previous days answers to “So Ted, what is your talk going to be about?” led to the light-bulb moment, just in the nick-of-time, that was an even better synthesis than I had had the year before (pretty sure, but again don’t remember, so maybe not – we’ll never know).
My talk was about some interesting things I had done with facets that go beyond traditional usages such as faceted navigation and dashboards. I started with these to get the talk revved up. I also threw in some stuff about the history of facet technologies both to show my age and vast search experience and to compare the terms different vendors used for faceting. At the time, I thought that this was merely interesting from a semantic standpoint, and it also contained an attempt at humor which I’ll get to later. But with my new post-talk improved synthesis – this facet vocabulary comparison is in fact even more interesting so I am now really glad that I started it off this way (more on this later). I was then planning to launch into my Monty Python “And Now for Something Completely Different” mad scientist section. I also wanted to talk about search and language, which is one of my more predictable soapbox issues. This led up to a live performance of some personal favorite tracks from my quartet of Query Autofilter blogs (1,2,3,4), featuring a new and improved implementation of QAF as a Fusion Query Pipeline Stage (coming soon to Lucidworks Labs) and some new semantic insights gleaned from of my recent eCommerce work for a large home products retailer. I also showed an improved version of the “Who’s In The Who” demo that I had attempted 2 years prior in Austin, based on a cleaner, slicker query patterns (formally Verb Patterns). I used a screenshot for Vegas to avoid the ever present demo gods which had bit me 2 years earlier. I was not worried about the demo per-se with my newly improved and more robust implementation, just boring networking issues and login timeouts and such in Fusion – I needed to be as nimble as I could be. But as I worked on the deck in the week leading up to Revolution – nothing was gelin’ yet.
I felt that the two most interesting things that I had done with facets were the dynamic boosting typeahead trick from what I like to call my “Jimi Hendrix Blog” and the newer stuff on Keyword Clustering in which I used facets to do some Word-2-Vec’ish things. But as I was preparing to explain these slides – I realized that in both cases, I was doing exactly the same thing at an abstract level!! I had always been talking about “context” as being important – remembering a slide from one of my webinars in which the word CONTEXT was the only word on the slide in bold italic 72 Pt font – a slide that my boss Grant Ingersoll would surely have liked (he had teased me about my well known tendency for extemporizing at lunch before my talk) – I mean, who could talk for more than 2 minutes about one word? As one of my other favorite TV comics from the 60’s and 70’s, Bob Newhart would say – “That … ah … that … would be me”. (but actually not in this case – I timed it – but I’m certainly capable of it) Also, I had always thought of facets as displaying some kind of global result-set context that the UI displayed.
I had also started the talk with a discussion about facets and metadata as being equivalent, but what I realized is that my “type the letter ‘J’ into typeahead, get back alphabetical stuff starting with ‘J’ then search for “Paul McCartney”, then type ‘J’ again and get back ‘John Lennon’ stuff on top” and my heretically mad scientist-esque “facet on all the tokens in a big text field, compute some funky ratios and of the returned 50,000 facet values for the ‘positive’ and ‘negative’ queries for each term and VOILA get back some cool Keyword Clusters” examples were based ON THE SAME PRINCIPAL!!! You guessed it “context”!!!
So, what do we actually mean by “context”?
Context is a word we search guys like to bandy around as if to say, “search is hard, because the answer that you get is dependent on context” – in other words it is often a hand-waving, i.e. B.S. term for “its very complicated”. But seriously, what is context? At the risk of getting too abstractly geeky – I would say that ‘context’ is some place or location within some kind of space. Googling the word got me this definition:
“the circumstances that form the setting for an event, statement, or idea, and in terms of which it can be fully understood and assessed.”
Let me zoom in on “setting for an event” as being roughly equivalent to my original more abstract-mathematical PhD-ie (pronounced “fuddy”) “space” notion. In other words, there are different types of context – personal, interpersonal/social/cultural, temporal, personal-temporal (aka personal history), geospatial, subject/categorical and you can think of them as some kind of “space” in which a “context” is somewhere within that larger space – i.e. some “subspace” as the math and Star Trek geeks would say (remember the “subspace continuum” Trek fans?) – I love this geeky stuff of course, but I hope that it actually helps ‘splain stuff too … The last part “in terms of which is can be fully understood and assessed” is also key and resonates nicely with the Theorem that I am about to unfold.
In my initial discussion on facets as being equivalent to metadata, the totality of all of the facet fields and their values in a Solr collection constitutes some sort of global “meta-informational space”. This led to the recollection/realization that this was why Verity called this stuff “Parametric Search” and led Endeca to call these facet things “Dimensions”. We are dealing with what Math/ML geeks would call an “N-Dimensional hyperspace” in which some dimensions are temporal, some numerical and some textual (whew!). Don’t try to get your head around this – again, just think of it as a “space” in which “context” represents some location or area within that space. Facets then represent vectors or pointers into this “meta-informational” subspace of a collection based on the current query and the collected facet values of the result set. You may want to stop now, get something to drink, watch some TV, take a nap, come back and read this paragraph a few more times before moving on. Or not. But to simplify this a bit (what me? – I usually obfuscate) – lets call a set of facets and their values returned from a query as the “meta-informational context” for that query. So that is what facets do, in a kinda-sorta geeky descriptive way. Works for me and hopefully for you too. In any case, we need to move on.
So, getting back to our example – throw in a query or two and for each, get this facet response which we are now calling the result set’s “meta-informational context” and take another look at the previous examples. In the first case, we were searching for “Paul McCartney” – storing this entity’s meta-informational context and then sending it back to the search engine as a boost query and getting back “John Lennon” related stuff. In the second case, we were searching for each term in the collection, getting back the meta-informational context for that term and then comparing that term’s context with that of all of the other terms that the two facet queries return and computing a ratio, in which related terms have more contextual overlap for the positive than the negative query – so that two terms with similar contexts have high ratios and those with little or no contextual overlap would have low ratio values hovering around 1.0.
Paul McCartney and John Lennon are very similar entities in my Music Ontology and two words that are keywords in the same subject area also have very similar contexts in a subject-based “space” – so these two seemingly different tricks appear to be doing the same thing – finding similar things based on the similarity of their meta-informational contexts – courtesy of facets! Ohhhh Kaaaaay … Cool! – I think we’re on to something here!!
The Facet Theorem
So to boil all of this to an elevator speech – single takeaway slide, I started to think of it as a Theorem in Mathematics – a set of simple, hopefully self-evident assumptions or lemmas that when combined give a cool, and hopefully surprising result. So here goes.
Lemma 1: Similar things tend to occur in similar contexts
Nice. Kinda obvious, intuitive and I added the “tend to” part to cover any hopefully rare contrary edge cases but as this is a statistical thing we are building, that’s OK. Also, I want to start slow with something that seems self-evident to us like “the shortest distance between two points is a straight line” from Euclidian Geometry.
Lemma 2: Facets are a tool for exploring meta-informational contexts
OK, that is what we have just gone through space and time warp explanations to get to, so lets put that in as our second axiom.
In laying out a Theorem we now go to the “it therefore follows that”:
Theorem: Facets can be used to find similar things.
Bingo, we have our Theorem and we already have some data points – we used Paul McCartney’s meta-informational context to find John Lennon, and we used facets to find related keywords that are all related to the same subject area (part 2 document clustering blog is coming soon, promise). So it seems to be workable. We may not have a “proof” yet, but we can use this initial evidence to keep digging for one. So lets keep looking for more examples and in particular for examples that don’t seem to fit this model. I will if you will.
Getting to The Why
So this seems to be a good explanation for why the all of the crazy but disparate seeming stuff that I have been doing with facets works. To me, that’s pretty significant, because we all know that when you can explain “why” something is happening in your code, you’ve essentially got it nailed down, conceptually speaking. It also gets us to a point where we can start to see other use cases that will further test the Facet Theorem (remember, a Theorem is not a Proof – but its how you need to start to get to one). When I think of some more of them, I’ll let you know. Or maybe some optimizations to my iterative, hard to parallalize method.
Facets and UI – Navigation and Visualization
Returning to the synonyms search vendors used for facets – Fast ESP first called these things ‘Navigators’ which Microsoft cleverly renamed to ‘Refiners’. That makes perfect sense for my synthesis – you navigate through some space to get to your goal, or you refine that subspace which represents your search goal – in this case, a set of results. Clean, elegant, it works, I’ll take it. The “goal” though is your final metadata set which may represent some weird docs if your precision sucks – so the space is broken up like a bunch of isolated bubbles. Mathematicians have a word for this – disjointed space. We call it sucky precision. I’ll try to keep these overly technical terms to a minimum from now on, sorry.
As to building way cool interactive dashboards, that is facet magic as well, where you can have lots of cool eye candy in the form of pie charts, bar charts, time-series histograms, scatter plots, tag clouds and the super way cool facet heat maps. One of the very clear advantages of Solr here is that all facet values are computed at query time and are computed wicked fast. Not only that, you can facet on anything, even stuff you didn’t think of when you designed your collection schema through the magic of facet and function queries and ValueSource extensions. Endeca could do some of this too, but Solr is much better suited for this type of wizardry. This is “surfin’ the meta-informational universe” that is your Solr collection. “Universe” is apt here because you can put literally trillions of docs in Solr and it also looks like the committers are realizing Trey’ Grainger’s vision of autoscaling Solr to this order of magnitude, thus saving many intrepid DevOps guys and gals their nights and weekends! (Great talk as usual by our own Shalin Mangar on this one. Definitely a must-see on the Memorex versions of our talks if you didn’t see his excellent presentation live.) Surfin’ the Solr meta-verse rocks baby!
Facets? Facets? We don’t need no stinkin’ Facets!
To round out my discussion of what my good friend the Search Curmudgeon calls the “Vengines” and their terms for facets, I ended that slide with an obvious reference to everyone’s favorite tag line from the John Huston/Humphrey Bogart classic The Treasure of the Sierra Madre, with the original subject noun replaced with “Facet”. As we all should know by now, Google uses Larry’s page ranking algorithm also known as Larry Page’s ranking algorithm – to whit PageRank, which is a crowd sourcing algorithm that works very well with hyper linked web pages but is totally useless for anything else. Google’s web search relevance ranking is so good (and continues to improve) that most of the time you just work from the first page so you don’t need no stinkin’ facets to drill in – you are most often already there and what’s the difference between one or two page clicks vs one or two navigator clicks?
I threw in Autonomy here because they also touted their relevance as being auto-magical (that’s why their name starts with ‘Auto’) and to be fair, it definitely is the best feature of that search engine (the configuration layer is tragic). This marketing was especially true before Autonomy acquired Verity, who did have facets, after which is was much more muddled/wishy washy. One of the first things they did was to create the Fake News that was Verity K2 V7 in which they announced that the APIs would be “pin-for-pin compatible” to K2 V6 but that the core engine would now be IDOL. I now suspect that this hoax was never really possible anyway (nobody could get it to work) because IDOL could not support navigation, aka facet requests – ’cause it didn’t have them anywhere in the index!! Maybe if they had had Yonik … And speaking of relevance, like the now historical Google Search Appliance “Toaster“, relevance that is autonomous as well as locked down within an intellectual property protection safe is hard to tune/customize. Given that what is relevant is highly contextual – this makes closed systems such as Autonomy and GSA unattractive compared to Solr/Lucene.
But it is interesting that the two engines that consider relevance to be their best feature, eschew facets as unnecessary – and they certainly have a point – facets should not be used as a band-aid for poor relevance in my opinion. If you need facets to find what you are looking for, why search in the first place? Just browse. Yes Virginia, user queries are often vague to begin with and faceted navigation provides an excellent way to refine the search, but sacrificing too much precision for recall will lead to unhappy users. This is especially true for mobile apps where screen real estate issues preclude extensive use of facets. Just show me what I want to see, please! So sometimes we don’t want no stinkin’ facets but when we do, they can be awesome.
Finale – reprise of The Theorem
So I want to leave you with the take home message of this rambling, yet hopefully enlightening blog post, by repeating the Facet Theorem I derived here: Facets can be used to find similar things. And the similarity “glue” is one of any good search geek’s favorite words: context. One obvious example that we have always known before, just as Dorothy instinctively knew how to get home from Oz, is in faceted navigation itself – all of the documents that are arrived at by facet queries must share the metadata values that we clicked on – so they must therefore have overlapping meta-informational contexts along our facet click’s navigational axes! The more facet clicks we make, the “space” of remaining document context becomes smaller and their similarity greater! We can now add this to our set of use cases that support the Theorem, along with the new ones I have begun to explore such as text mining, dynamic typeahead boosting and typeahead security trimming. Along these lines, a dashboard is just a way cooler visualization of this meta-informational context for the current query + facet query(ies) within the global collection meta-verse, with charts and histograms for numeric and date range data and tag clouds for text.
So to conclude, facets are fascinating, don’t you agree? And the possibilities for their use go well beyond navigation and visualization. Now to get the document clustering blog out there – darn day job!!!