Grant Ingersoll talks to Karl Wettin, an Apache Lucene and Mahout Committer and independent Lucene Solr consultant. Karl talks about lexically complex European languages, using terms with shingles to simplify queries rather than complex span queries, and improving spellchecking with reinforcement learning, and looking at Mahout NLP.
Interviewer:Today I’m speaking with Karl Wettin, an Apache Lucene and Mahout Committer and independent Lucene Solr consultant. Welcome, Karl, and thank you for taking the time to speak with me.
Karl Wettin:Thank you.Interviewer:Why don’t we start off by you telling us a little bit about your background before you got involved with Lucene and Solr?
Karl Wettin:I’ve been a developer since 15 years back. I’ve been working with Java since 1.2. I’ve done a lot of UML and code generation of UML before I got into I.R. (editor’s note: Information Retrieval), and I started working full-time on I.R. perhaps six, seven years ago.I started with proprietary solutions such as Autonomy, FAST, etc., but I’ve always had an interest in Lucene and the open-source community, and I think my first post in Lucene was back in 2005 or so regarding n-grams actually.
Interviewer:So what got you interested in Lucene specifically? Do you remember anything that stood out?
Karl Wettin:Its open-source Apache license.
Interviewer:Were you seeing customers of yours asking for open-source, or you just thought this was a better way to deliver value to them, or it was purely just it’s open-source for me and I can go see what’s under the hood?
Karl Wettin:It’s open-source for me. It was egoistic. I thought it was cool, and I got to get an insight of how an inverted index works, and honestly, there’s still many parts of Lucene that’s pure magic black block functionality for me these days.
Interviewer:Yeah, I think we all have that. So then how did you transition to, I mean, I know now that you’re an independent consultant. You do a lot of work in Lucene and Solr. How did you start to fit that in with hey, I can make a living at this?
Interviewer:Can you tell us about some of the problems you’ve solved with Lucene and Solr? What kind of implementations have you done for people?
Karl Wettin:Well, I’ve done many small implementations which has just been any like a Facebook thingy or whatever, where people need a search engine to search for general blog text things out of common personal information or blogs or whatever. But lately I’ve been working a lot on of large corpora that has small amounts of text, such as the Yellow Pages information where it’s a name and address.I’ve looked into an upcoming project for words, movie titles, for a online movie distribution site here in Sweden. I looked into large corpuses with small amounts of information.
Interviewer:By small, I mean, we’re talking a sentence. So is there anything special that you do to make that return better results or deal with that kind of sparseness of data?
Karl Wettin:Shingles is n-grams on word level, so basically, concatenated words, so I combined the words in a single term, so I get the phrase functionality built into each term, and I usually add a lot of payload boosts, and I try to decompose terms and stem them in and put them back in so in case people have some common typos. I gotta find a good English example here. It’s usually Swedish.Here’s the problem with Swedish. We have a rather lexically complex language. It’s a Northern Germanic language, and most Northern Germanic languages compose words, so rather than as in English you have T-shirt hanger is two words in English, while T-shirthanger is one word in Swedish. And you could also have stemming and/or it could be different grammatic forms of the first word. I mean this could be 16 words composed into one rather long word. So you might wanna look into say street names in the Yellow Pages where I might wanna decompose the words, if they’re already composed, in stem parts and then recompose them.And then I wanna do the same sort of analysis of the query, of course, so I can try to match as much as possible. So even if they forgot that it was the Danish road, perhaps in English, would be Swedish a single word, but it could be Daneroad instead of Danish road. And people tend to forget, and this is the biggest problem I’ve been facing is that users always forget or they don’t really know what they’re searching for, so they type in something wrong.
So I try to break up all the composed words and stem them down and do a lot of analysis at index time and fill up the index with lots of terms that are possible typos or human errors because they forgot that it was this grammatic form of Danish or Denmark or Denmark Road, and then match it up again. And then I add lots of finely tuned payload boosts on all the terms in the index.
Interviewer:So just for a little background for our listeners, can you explain what a payload is in Lucene?
Karl Wettin:A payload is metadata or at the term level, at term position level actually.
Interviewer:So you could store like an actual array of bytes right with a term.
Karl Wettin:A term position.
Interviewer:With a term position, right.
Karl Wettin:So that’s a specific occurrence of a term in a specific place in a text could be associated with a bunch of bytes. I use bytes to store a weight, a boost.
Interviewer:So another example would be maybe if people had part of speech tags for terms they could store the part of speech tag in a payload and then they would have a way of tracking the part of speech. So how do you calculate that way?
Karl Wettin:It’s layers and layers of filters, so usually maybe the original term it has a weight of one, and then I add ASCII folding on top of that, normalizing all upper ask key characters like weird Swedish Os with two dots over it turns into a O without dots. So for that token I lower a factor with .9 perhaps of the weight with that token.
Karl Wettin:And then I add another layer of something perhaps decomposition of that token which is without the upper ASCII, and then I also add the decomposition and stemming with the wicked upper ask key letter.
Karl Wettin:So I get all these permutations of terms and for each layer of filtering I do of the token, I lower the boost a little bit, so if the user actually types in a perfect match of the source data, they’ll be matching the original token and that’ll boost them up really good. But I also do the same analysis, of course, with stemming and with decomposition of everything of the query, so they will actually match everything else that I boosted down.
Interviewer:So it’s kinda a way of making sure that you maintain your precision while still having good recall.
Karl Wettin:And what I get out of this is, of course, that I get pretty good matches even though the users type wrong. Let’s take another example here like the decomposition is actually a major part of all the things I work with.
Karl Wettin:Rembrandt’s Plain, that is a place in Amsterdam, Rembrandtsplein, but people might spell it like Rembrandt Plain or Rembrandt of Plain, so I gotta decompose it and stem around and recompose and add that into the index. Otherwise, they could never match the first Rembrandt’s Plain which is the original’s token when someone search for Rembrandt of Plain.
Karl Wettin:They are composed tokens.
Interviewer:So how do you manage all of these variations in Lucene? What kind of analysis tricks are you doing there?
Karl Wettin:First, I try to find some simple rules for what is a prefix, what is a suffix in the word, so in the example of Rembrandtsplein, I do simple suffix analysis of all the data and find out that plein is like square.
Karl Wettin:Like Trafalgar Square, so then I gotta find all the suffixes that might be street, road, avenue, whatever. Then I create a filter that decomposed by these things. I mean, I start out simple, and then if it works well enough, then it’s well enough, but sometimes you gotta start using the dictionary.Somebody actually posted a decomposition filter, and I think it’s in Lucene. I had nothing to do with that. I never even used it; I know it’s there. It’s a more general thing that works on dictionaries while I always find my own rules.
Interviewer:Because you’re dealing with these small documents that are in a pretty specific genre. So you must have a pretty decent size explosion in terms of index size with this, but the trade-off is worth it is what you’re saying.
Karl Wettin:Oh yeah. I never do any complex queries. I mean, I don’t do any span queries. Well actually, I do ‘cause I use BoostingTermQuery which is a span query, but basically, I only do term queries. I never do phrases. I never do nears. I never do fuzzies. I mean, I don’t do anything like that. I do terms. It works out really well because it’s quick to place these sort of queries, and I get the phrase functionality by combining words into shingles, so I actually replaced the advanced stuff with simpler stuff. In the end I end up with a huge index with a mountain of terms.
Interviewer:You pay the cost at index time. Indexing can usually be done offline or as a background process, so that makes sense. So you use the, is it the ShingleMatrixFilter? Is that what it’s called?
Karl Wettin:That is what I use mostly.
Interviewer:That can output all these different permutations.
Karl Wettin:Even though that’s not quite true, but you could say this shingle will contain three words or three tokens, and one example would be the test case where I use Hello World, and it would be a synonym for hello, could be greetings and salutations, which actually is three tokens, and it could then combine and say hello world, greetings and salutations, salutations world. And this is rather useful when you say for a Yellow Pages solution where you have names. You might have first names, last names, middle names, and you wanna combine all the permutations ‘cause you might not know that the first name is really been the first name that they use. They might be using their middle name as their surname. The name they use and then common names.
Karl Wettin:My name is actually Johan but I use Karl.
Karl Wettin:But I also use the shingles in order to, as I said before, with composition; that we compose words. So I actually compose the query with no spaces when I place the query, and that way I fix that common error, at least in Swedish, is that people decompose the words when they’re typing. So rather than typing shirtholder, as one word, it’s shirt holder as in English. Then I can catch that by just storing the terms, the shingle term without the spacer in between and still get a hit.
Interviewer:So I know you’ve made some other contrib modules for Lucene. Can you tell us about those? What comes to mind the instantiated index, and I also know you’ve done some stuff with spellchecking that’s maybe in the issue tracker, but is it committed yet?
Karl Wettin:No, it’s not committed. It’s in there; I still need more test data, and I need to test it on some different things, but instantiated index is basically an alternate RAM memory, RAM directory solution, where RAM directory still is the same thing as a file system directory. But rather than you seeing the file system as a byte stream, RAM directory uses byte stream in memory. When the RAM directory loads a term, it would actually unmarshall a term from its byte stream and so on.While instantiated index is just aggregated instances of object and memory, so we always have the terms available in memory. It always have the term positions available in memory. It has everything as instances of objects.
Karl Wettin:So this makes it rather quick to navigate data, and in small corpuses, I’ve noted like a hundred-fold speed difference between RAM directory and instantiated index, although this falls pretty steep. The curve is pretty steep so it’s linear with RAM directory after say 20,000 documents of 2K size.
Interviewer:So this is kind of also an alternative to the memory index which can only hold
Interviewer:One document in memory at a time.
Karl Wettin:Right, where this can hold an unlimited number of documents.
Interviewer:And then you can search and use it just as you can use the RAM, any of the Lucene APIs?
Karl Wettin:Yeah, it provides you with an index reader, so you can use it as any other directory implementation or.
Interviewer:What was your use case for developing this?
Karl Wettin:I wanted to understand how the inverted index functioned, so I wrote it, and I also back then I worked with persistency mechanism called the Prevayler. The hypotheses behind Prevayler is that you can store all of your domain objects in RAM at all times, rather than flushing anything to disk periodically or like committing to a database. It creates a snapshot of your memory to disk. Everything you do against memory is passed through as a command pattern instance. It encapsulates a method in a class, and then they serialize that and put that on a transaction log.And I found this was really interesting ‘cause it was so speedy. If you can fit everything in memory, it turned out that using this for some of their demos, it was 5,000 times faster than running in MySQL with some ORM or something, so I thought this was sort of interesting. Let’s see how that works. Let me see with everything in memory at all time, and that’s why I built it. And it worked out pretty well, and then I found use cases for it later on, so I actually didn’t have any real use cases for it as it was built.But now it’s used as for that you spoke about the spellchecker implementation that I’ve been working on. And the spellchecker is actually a bit more machine-learning oriented than the current spellchecker in the trunk.
Interviewer:So what does it do that’s different?
Karl Wettin:It uses a methodology called reinforcement learning, and reinforcement learning is the process of learning from the user’s mistakes, so if somebody types in a query and then they change their query pretty quick and they didn’t get any hits in between, then I can pick that up that okay, they typed helo with one L and then ten seconds later or one second later, they typed hello with two Ls, and then they got a thousand hits. And then they clicked on something after typing hello with two Ls.Then I got an indication that they didn’t really mean the first thing; they probably meant the hello with two Ls. This requires a lot of load, I mean, lots of user load so you pick up everything.
Karl Wettin:I don’t know if Google does this, but I suspect they do something similar when they learn from what the mistakes the users does. This works pretty well. I mean, I’ve tested it with one corpus which is half a million queries or something, and I follow the user sessions, and what they do, and how they change their queries. And it picks up stuff like acronyms. People might be searching Lucene in Action, and then they search for L-I-A instead, and they still get sort of the same hits, and they click on sort of the same hits. And then the reinforcement algorithm can learn that, okay, L-I-A is probably related with Lucene in Action, so when someone search for Lucene in Action, it would say, “Hey, did you perhaps mean L-I-A?”
Karl Wettin:And vice verse. It’s a cool technique. I never really used it in commercial places, but it’s interesting stuff to work with. And I use instantiated index here to when searching for order of tokens, so maybe someone search for world hello, and then I can say, okay, it’s more probably that you’re searching for hello world because of the occurrence of the tokens, the phrase hello world compared to world hello.
Interviewer:So does it take into account also what is in the index as well? I mean, it’s layered on top of the Lucene spell-check.
Karl Wettin:Yeah, you could use the Lucene spell-check as a secondary suggestion scheme within it; that when it doesn’t have any suggestions for you, it will go to second level suggesters and pick up suggestions because it didn’t learn anything yet, and then it’ll go to algorithmic suggesters.
Interviewer:I mean, you still wanna give suggestions that are in the index versus.
Interviewer:Just suggesting things that other people happen to mistype as well which aren’t in the index.
Karl Wettin:But actually, I prefer working on queries and suggesting based on searching what other people search for rather than searching in the index because the index is so huge, usually, compared with just indexing queries. So I actually create an index of all the queries that people do and use that as a a priori data set rather than using the full Lucene index as an apery data set ‘cause it’s so much speedier to just search inquiries and usually it’s good enough.
Interviewer:That kinda gives you a freshness feature, too, right, in that you.
Karl Wettin:Yeah, it does.
Interviewer:You’re favoring what other people are searching for most recently.
Karl Wettin:If you have enough user load, it won’t really.
Karl Wettin:So if you have like 500,000 queries that you built the thing on and keep feeding it, you won’t notice any trends. I mean, you probably would but not that much.
Interviewer:So speaking of machine learning, this is really the first chance I’ve had to talk with a fellow Mahouter.
Interviewer:The British pronunciation. So tell me how did you get involved in machine learning, and then what led you into Mahout?
Karl Wettin:I thought it is a very interesting field, and I was just curious about the technology and maybe five, six years ago, I started reading about it and I kidnapped this guy here in Malmo, Håkan Källerstrand, as a mentor.
Interviewer:Not literally kidnapped, right?
Karl Wettin:Yeah, no, no.
Interviewer:We don’t need any international crime incidents reported here.
Karl Wettin:No but he’s been working a bit on different sorts of machine learning from item recommendation: people that bought this book also bought that book. And he’s been working on basic classification as naïve bayesian and commercially been doing stuff, but mostly, I think he does this for fun just as me. But he showed me WEKA which is this GPL library out of somewhere in New Zealand, right?
ey’ve only looked at two or three searches instead of looking at their logs and seeing what people are searching on and then deciding what to do.I just wanted to thank you once again for your time, Karl, and good luck with your current Yellow Pages implementation.