Query Autofiltering IV: A Novel Approach to NLP

This is my fourth blog post on a technique that I call Query Autofiltering. The basic idea is that we can use meta information stored within the Solr/Lucene index itself (in the form of string or non-tokenized text fields) to generate a knowledge base from which we can parse user queries and map phrases within the query to metadata fields in the index. This enables us to re-write the user’s query to achieve better precision in the response.

Recent versions of Query Autofiltering, which uses the Lucene FieldCache as a knowledge store, are able to do this job rather well but still leave some unresolved ambiguities. This can happen when a given metadata value occurs in more than one field (some examples of this below), so the query autofilter will create a complex boolean query to handle all of the possibile permutations. With multiple fields involved, some of the cross-field combinations don’t exist in the index (the autofilter can’t know that) and an additional filtering step happens serendipitously when the query is run. This often gives us exactly the right result but there is an element of luck involved, which means that there are bound to be situations where our luck runs out.

As I was developing demos for this approach using a music ontology I am working on, I discovered some of these use cases. As usual, once you see a problem and understand the root cause, you can then find other examples of it. I will discuss a biomedical / personal health use case below that I had long thought was difficult or impossible to solve with conventional search methods (not that query autofiltering is “conventional”). But I am getting ahead of myself. The problem crops up when users add verbs, adjectives or prepositions to their query to constrain the results, and these terms do not occur as field values in the index. Rather, they map to fields in the index. The user is telling us that they want to look for a key phrase in a certain metadata context, not all of the contexts in which the phrase can occur. It’s a Natural Language 101 problem! – Subject-Verb-Object stuff. We get the subject and object noun phrases from query autofiltering. We now need a way to capture the other key terms (often verbs) to do a better job of parsing these queries – to give the user the accuracy that they are asking for.

I think that a real world example is needed here to illustrate what I am talking about. In the Music ontology, I have entities like songs, the composers/songwriters/lyricists that wrote them and the artists that performed or recorded them. There is also the concept of a “group” or “band” which consists of group members who can be songwriters, performers or both.

One of my favorite artists (and I am sure that some, but maybe not all of my readers would agree) is Bob Dylan. Dylan wrote and recorded many songs and many of his songs were covered by other artists. One of the interesting verbs in this context is “covered”. A cover in my definition, is a recording by an artist who is not one of the song’s composers. The verb form “to cover” is the act of recording or performing another artist’s composition. Dylan, like other artists, recorded both his own songs and songs of other musicians, but a cover can be a signature too. So for example, Elvis Presley covered many more songs than he wrote, but we still think of “Jailhouse Rock” as an Elvis Presley song even though he didn’t write it (Jerry Leiber and Mike Stoller did).

So if I search for “Bob Dylan Songs” – I mean songs that Dylan either wrote or recorded (i.e. both). However if I search for “Songs Bob Dylan covered”, I mean songs that Bob Dylan recorded but didn’t write and “covers of Bob Dylan songs” would mean recordings by other artists of songs that Dylan wrote– Jimi Hendrix’s amazing cover of “All Along The Watchtower” immediately comes to mind here. (There is another linguistic phenomenon besides verbs going on here that I will talk about in a bit.)

So how do we resolve these things? Well, we know that the phrase “Bob Dylan” can occur many places in our ontology/dataset. It is a value in the “composer” field, the “performer” field and in the title field of our record for Bob Dylan himself. It is also the value of an album entity since his first album was titled “Bob Dylan”. So given the query “Bob Dylan” we should get all of these things – and we do – the ambiguity of the query matches the ambiguities discovered by the autofilter, so we are good. “Bob Dylan Songs” gives us songs that he wrote or recorded – now the query is more specific but still some ambiguities here, but still good because we have value matches for the whole query. However, if we say “Songs Bob Dylan recorded” vs “Songs Bob Dylan wrote” we are asking for different subsets of “song” things. Without help, the autofilter misses this subtlety because there is no matching fields for the terms “recorded” or “wrote” so it treats them as filler words.

To make the query autofilter a bit “smarter” we can give it some rules. The rule states that if a term like “recorded” or “performed” is near an entity (detected by the standard query autofilter parsing step) like “Bob Dylan” that maps to the field “performer_ss” then just use that field by itself and don’t fan it out to the other fields that the phrase also maps to. We configure this like so:

performed, recorded,sang => performer_ss

and for songs composed or written:

composed,wrote,written by => composer_ss

Where the list of synonymous verb or adjective phrases is on the left and the field or fields that these should map to on the right. Now these queries work as expected! Nice.

Another example is if we want to be able to answer questions about the bands that an artist was in or the members of a group. For questions like “Who’s in The Who?” or “Who were the members of Procol Harum?” we would map the verb or prepositional phrases “who’s in” and “members of” to the group_members_ss and member_of_group fields in the index.

who’s in,was in,were in,member,members => group_members_ss, member_of_group_ss

Now, searching for “who’s in the who” brings back just Messrs. Daltrey, Entwistle, Moon and Towhshend – cool!!!

Going Deeper – handling covers with noun-noun phrase ambiguities

The earlier example that I gave, “songs Bob Dylan covered” vs. “covers of Bob Dylan songs” contains additional complexities that the simple verb-to-field mapping doesn’t solve yet. Looking at this problem from a language perspective (rather than from a software hacking point of view) I was able to find a explanation and from that a solution. A side note here is that the output of my pre-processing of the ontology to detect when a recording was a cover, was the opposite relation when the performer of a song is also one of the composers. Index records of this type get tagged with an “original_performer_s” field and a “version_s:Original” to distinguish them from covers at query time (which are tagged “version_s:Cover”).

Getting back to the language thing, it turns out that in the phrase “Bob Dylan songs covered”, the subject noun phrase is “Bob Dylan songs”! That is the noun entity is the plural form of song, and the noun phrase “Bob Dylan” qualifies that noun to specify songs by him – its what is known in linguistics as a “noun-noun phrase” meaning that one noun “Bob Dylan” serves as an adjective to another one, “song” in this case. Remember – language is tricky! However, in the phrase “Songs Bob Dylan covered”, now “Songs” is the object noun, “Bob Dylan” is the subject noun and “covered” is the verb. To get this one right, I devised an additional rule which I call a pattern rule: if an original_performer entity precedes a composition_type song entity, use that pattern for query autofiltering. This is expressed in the configuration like so:

covered,covers:performer_ss => version_s:Cover | original_performer_s:_ENTITY_,recording_type_ss:Song=>original_performer_s:_ENTITY_

To break this down, the first part does the mapping of ‘covered’ and ‘covers’ to the field performer_ss. The second part sets a static query parameter version_s:Cover and the third part:

original_performer_s:_ENTITY_,recording_type_ss:Song=>original_performer_s:_ENTITY_

Translates to: if an original performer is followed by a recording type of “song”, use original_performer_s as the field name.

We also want this pattern to be applied in a context sensitive manner – it is needed to disambiguate the bi-directional verb “cover” so we only use it in this situation. That is this pattern rule is only triggered if the verb “cover” is encountered in the query. Again, these rules are use-case dependent and we can grow or refine them as needed. Rule-based approaches like this require curation and analysis of query logs but can be a very effective way to handle edge cases like this. Fortunately, the “just plug it in and forget it” part of the query autofiltering setup handles a large number of use cases without any help. That’s a good balance.

With this rule in place, I was able to get queries like “Beatles Songs covered by Joe Cocker” and “Smokey Robinson songs covered by the Beatles” to work as expected. (The answer to the second one is that great R&B classic “You’ve Really Got A Hold On Me”).

Healthcare concerns

Let’s examine another domain to see the generality of these techniques. In healthcare, there is a rich ontology that we can think of relating diseases, symptoms, treatments and root biomedical causes. There are also healthcare providers of various specialties and pharmaceutical manufacturers in the picture among others. In this case, the ontologies are out there (like MeSH) courtesy of the National Institutes of Medicine and other affiliated agencies. So, imagine that we have a consumer healthcare site with pages that discuss these entities and provide ways to navigate between them. The pages would also have metadata that we can both facet and perform query autofiltering on.

Lets take a concrete example. Suppose that you are suffering from abdominal pain (sorry about that). This is an example of a condition or symptom that may be benign (you ate or drank too much last night) or a sign of something more serious. Symptoms can be caused by diseases like appendicitis or gastroenteritis, can be treated with drugs or may even be caused by a drug side effect or adverse reaction. So if you are on this site, you may be asking questions like “what drugs can treat abdominal pain?” and maybe also “what drugs can cause abdominal pain?”. This is a hard problem for traditional search methods and the query autofilter, without the type of assistance I am discussing here would not get it right either. For drugs, the metadata fields for the page would be “indication” for positive relationships (an indication is what the drug has been approved for by the FDA) and “side_effect” or “adverse_reaction” for the dark side of pharmaceuticals (don’t those disclaimers on TV ads just seem to go on and on and on?).

With our new query autofilter trick, we can now configure these verb preposition phrases to map to the right fields:

treat,for,indicated => indication_ss

cause,produce => side_effect_ss,adverse_reaction_ss

Now these queries should work correctly: our search application is that much smarter – and our users will be much happier with us – because as we know, users asking questions like this are highly motivated to get good, usable answers and don’t have the time/patience to wade through noise hits (i.e. they may already be in pain).

You may be wondering at this point how many of these rules will we need? One thing to keep in mind and the reason for my using examples from two different domains is to illustrate the domain-specific nature of these problems. For general web search applications like Google, this list of rules might be very large (but then again so is Google). For domain specific applications as occur in enterprise search or eCommerce, the list can be much more manageable and use-case driven. That is, we will probably discover these fixes as we examine our query logs, but now we have another tool in our arsenal to tackle language problems like this.

Using Natural Language Processing techniques to detect and respond to User Intent

The general technique that I am illustrating here is something that I have been calling “Query Introspection”. A more plain-english way to say this is inferring user intent. That is, using techniques like this we can do a better job of figuring out what the user is looking for and then modifying the query to go get it if we can. It’s a natural language processing or NLP problem. There are other approaches that have been successful here, notably using parts of speech analysis on the query (POS) to get at the nouns, verbs and prepositions that I have been talking about. This can be based on machine learning or algorithmic approaches (rules based) and can be a good way of parsing the query into its linguistic component parts. IBM’s famous Watson program needed a pretty good one to parse Jeopardy questions. Machine learning approaches can also be applied directly to Q&A problems. A good discussion of this is in Ingersoll et.al.’s great book Taming Text.

The user intent detection step, which classical NLP techniques discussed above and now the query autofilter can do, represents phase one of the process. Translating this into an appropriately accurate query is the second phase. For POS tagged approaches, this usually involves a knowledge base that enables parts of speech phrases to be mapped to query fields. Obviously, the query autofilter does this natively because it can get the information from the “horses mouth” so to speak. The POS / knowledge base approach may be more appropriate when there is less metadata structure in the index itself as the KB can be the output of data mining operations. There were some excellent talks on this at the recent Lucene/Solr Revolution in Austin (see my blog post on this). However, if you already have tagged your data, manually or automagically, give query autofiltering a shot.

Source code is available

The Java source code for this is available on github for both Solr4.x and Solr 5 versions. Technical details about the code and how it works is available there. Download and use this code if you want to incorporate this feature into your search applications now. There is also a Solr JIRA submission (SOLR-7539).

The 2025 AI reality check: What 1,100+ companies actually deploy vs. what they claim

2025 Generative AI Benchmark Report reveals only 6% deployed agentic AI while...

The State of Generative AI in Global Business: 2025 Benchmark Report, Dawn of the Agentic AI Era

The first-of-its-kind study using autonomous AI agents to benchmark AI capabilities across...

How an electronics giant meets engineers where they are, with 44 million products in catalog

Meet Mohammad Mahboob: A search platform director navigating 44 million products across...