Grant Ingersoll talks to Karl Wettin, an Apache Lucene and Mahout Committer and independent Lucene Solr consultant. Karl talks about lexically complex European languages, using terms with shingles to simplify queries rather than complex span queries, and improving spellchecking with reinforcement learning, and looking at Mahout NLP.

Transcript

Interviewer:Today I’m speaking with Karl Wettin, an Apache Lucene and Mahout Committer and independent Lucene Solr consultant. Welcome, Karl, and thank you for taking the time to speak with me.
Karl Wettin:Thank you.Interviewer:Why don’t we start off by you telling us a little bit about your background before you got involved with Lucene and Solr?
Karl Wettin:I’ve been a developer since 15 years back. I’ve been working with Java since 1.2. I’ve done a lot of UML and code generation of UML before I got into I.R. (editor’s note: Information Retrieval), and I started working full-time on I.R. perhaps six, seven years ago.I started with proprietary solutions such as Autonomy, FAST, etc., but I’ve always had an interest in Lucene and the open-source community, and I think my first post in Lucene was back in 2005 or so regarding n-grams actually.
Interviewer:So what got you interested in Lucene specifically? Do you remember anything that stood out?
Karl Wettin:Its open-source Apache license.
Interviewer:Were you seeing customers of yours asking for open-source, or you just thought this was a better way to deliver value to them, or it was purely just it’s open-source for me and I can go see what’s under the hood?
Karl Wettin:It’s open-source for me. It was egoistic. I thought it was cool, and I got to get an insight of how an inverted index works, and honestly, there’s still many parts of Lucene that’s pure magic black block functionality for me these days.
Interviewer:Yeah, I think we all have that. So then how did you transition to, I mean, I know now that you’re an independent consultant. You do a lot of work in Lucene and Solr. How did you start to fit that in with hey, I can make a living at this?
Karl Wettin:

It pretty much just happened. I found an interest in it, and I worked around with it, and I had a few jobs that I picked up because of my interest in Lucene, and basically, as people started contacting me and asking me to help them out, and that’s the way it is still.
Interviewer:Can you tell us about some of the problems you’ve solved with Lucene and Solr? What kind of implementations have you done for people?
Karl Wettin:Well, I’ve done many small implementations which has just been any like a Facebook thingy or whatever, where people need a search engine to search for general blog text things out of common personal information or blogs or whatever. But lately I’ve been working a lot on of large corpora that has small amounts of text, such as the Yellow Pages information where it’s a name and address.I’ve looked into an upcoming project for words, movie titles, for a online movie distribution site here in Sweden. I looked into large corpuses with small amounts of information.
Interviewer:By small, I mean, we’re talking a sentence. So is there anything special that you do to make that return better results or deal with that kind of sparseness of data?
Karl Wettin:Shingles is n-grams on word level, so basically, concatenated words, so I combined the words in a single term, so I get the phrase functionality built into each term, and I usually add a lot of payload boosts, and I try to decompose terms and stem them in and put them back in so in case people have some common typos. I gotta find a good English example here. It’s usually Swedish.Here’s the problem with Swedish. We have a rather lexically complex language. It’s a Northern Germanic language, and most Northern Germanic languages compose words, so rather than as in English you have T-shirt hanger is two words in English, while T-shirthanger is one word in Swedish. And you could also have stemming and/or it could be different grammatic forms of the first word. I mean this could be 16 words composed into one rather long word. So you might wanna look into say street names in the Yellow Pages where I might wanna decompose the words, if they’re already composed, in stem parts and then recompose them.And then I wanna do the same sort of analysis of the query, of course, so I can try to match as much as possible. So even if they forgot that it was the Danish road, perhaps in English, would be Swedish a single word, but it could be Daneroad instead of Danish road. And people tend to forget, and this is the biggest problem I’ve been facing is that users always forget or they don’t really know what they’re searching for, so they type in something wrong.

So I try to break up all the composed words and stem them down and do a lot of analysis at index time and fill up the index with lots of terms that are possible typos or human errors because they forgot that it was this grammatic form of Danish or Denmark or Denmark Road, and then match it up again. And then I add lots of finely tuned payload boosts on all the terms in the index.

Interviewer:So just for a little background for our listeners, can you explain what a payload is in Lucene?
Karl Wettin:A payload is metadata or at the term level, at term position level actually.
Interviewer:So you could store like an actual array of bytes right with a term.
Karl Wettin:A term position.
Interviewer:With a term position, right.
Karl Wettin:So that’s a specific occurrence of a term in a specific place in a text could be associated with a bunch of bytes. I use bytes to store a weight, a boost.
Interviewer:So another example would be maybe if people had part of speech tags for terms they could store the part of speech tag in a payload and then they would have a way of tracking the part of speech. So how do you calculate that way?
Karl Wettin:It’s layers and layers of filters, so usually maybe the original term it has a weight of one, and then I add ASCII folding on top of that, normalizing all upper ask key characters like weird Swedish Os with two dots over it turns into a O without dots. So for that token I lower a factor with .9 perhaps of the weight with that token.
Interviewer:Okay.
Karl Wettin:And then I add another layer of something perhaps decomposition of that token which is without the upper ASCII, and then I also add the decomposition and stemming with the wicked upper ask key letter.
Interviewer:Right.
Karl Wettin:So I get all these permutations of terms and for each layer of filtering I do of the token, I lower the boost a little bit, so if the user actually types in a perfect match of the source data, they’ll be matching the original token and that’ll boost them up really good. But I also do the same analysis, of course, with stemming and with decomposition of everything of the query, so they will actually match everything else that I boosted down.
Interviewer:So it’s kinda a way of making sure that you maintain your precision while still having good recall.
Karl Wettin:And what I get out of this is, of course, that I get pretty good matches even though the users type wrong. Let’s take another example here like the decomposition is actually a major part of all the things I work with.
Interviewer:

One of the things maybe we could do is put up some examples.
Karl Wettin:Rembrandt’s Plain, that is a place in Amsterdam, Rembrandtsplein, but people might spell it like Rembrandt Plain or Rembrandt of Plain, so I gotta decompose it and stem around and recompose and add that into the index. Otherwise, they could never match the first Rembrandt’s Plain which is the original’s token when someone search for Rembrandt of Plain.
Interviewer:Right.
Karl Wettin:They are composed tokens.
Interviewer:So how do you manage all of these variations in Lucene? What kind of analysis tricks are you doing there?
Karl Wettin:First, I try to find some simple rules for what is a prefix, what is a suffix in the word, so in the example of Rembrandtsplein, I do simple suffix analysis of all the data and find out that plein is like square.
Interviewer:Yeah.
Karl Wettin:Like Trafalgar Square, so then I gotta find all the suffixes that might be street, road, avenue, whatever. Then I create a filter that decomposed by these things. I mean, I start out simple, and then if it works well enough, then it’s well enough, but sometimes you gotta start using the dictionary.Somebody actually posted a decomposition filter, and I think it’s in Lucene. I had nothing to do with that. I never even used it; I know it’s there. It’s a more general thing that works on dictionaries while I always find my own rules.
Interviewer:Because you’re dealing with these small documents that are in a pretty specific genre. So you must have a pretty decent size explosion in terms of index size with this, but the trade-off is worth it is what you’re saying.
Karl Wettin:Oh yeah. I never do any complex queries. I mean, I don’t do any span queries. Well actually, I do ‘cause I use BoostingTermQuery which is a span query, but basically, I only do term queries. I never do phrases. I never do nears. I never do fuzzies. I mean, I don’t do anything like that. I do terms. It works out really well because it’s quick to place these sort of queries, and I get the phrase functionality by combining words into shingles, so I actually replaced the advanced stuff with simpler stuff. In the end I end up with a huge index with a mountain of terms.
Interviewer:You pay the cost at index time. Indexing can usually be done offline or as a background process, so that makes sense. So you use the, is it the ShingleMatrixFilter? Is that what it’s called?
Karl Wettin:That is what I use mostly.
Interviewer:That can output all these different permutations.
Karl Wettin:
That’s the difference between the ShingleFilter and the ShingleMatrixFilter, is that the shingle matrix filter is actually three-dimensional. But basically it creates permutations and one could say that it would allow for multiple token synonyms. This is a pretty easy way to explain it.
Interviewer:Okay.
Karl Wettin:Even though that’s not quite true, but you could say this shingle will contain three words or three tokens, and one example would be the test case where I use Hello World, and it would be a synonym for hello, could be greetings and salutations, which actually is three tokens, and it could then combine and say hello world, greetings and salutations, salutations world. And this is rather useful when you say for a Yellow Pages solution where you have names. You might have first names, last names, middle names, and you wanna combine all the permutations ‘cause you might not know that the first name is really been the first name that they use. They might be using their middle name as their surname. The name they use and then common names.
Interviewer:Oh.
Karl Wettin:My name is actually Johan but I use Karl.
Interviewer:
Right, gotcha.
Karl Wettin:
Somebody else might be Johan though that is their middle name, so you can create combinations and permutations of the names for whatever it might be they wanna create permutations on.
Interviewer:Makes sense.
Karl Wettin:But I also use the shingles in order to, as I said before, with composition; that we compose words. So I actually compose the query with no spaces when I place the query, and that way I fix that common error, at least in Swedish, is that people decompose the words when they’re typing. So rather than typing shirtholder, as one word, it’s shirt holder as in English. Then I can catch that by just storing the terms, the shingle term without the spacer in between and still get a hit.
Interviewer:So I know you’ve made some other contrib modules for Lucene. Can you tell us about those? What comes to mind the instantiated index, and I also know you’ve done some stuff with spellchecking that’s maybe in the issue tracker, but is it committed yet?
Karl Wettin:No, it’s not committed. It’s in there; I still need more test data, and I need to test it on some different things, but instantiated index is basically an alternate RAM memory, RAM directory solution, where RAM directory still is the same thing as a file system directory. But rather than you seeing the file system as a byte stream, RAM directory uses byte stream in memory. When the RAM directory loads a term, it would actually unmarshall a term from its byte stream and so on.While instantiated index is just aggregated instances of object and memory, so we always have the terms available in memory. It always have the term positions available in memory. It has everything as instances of objects.
Interviewer:Right.
Karl Wettin:So this makes it rather quick to navigate data, and in small corpuses, I’ve noted like a hundred-fold speed difference between RAM directory and instantiated index, although this falls pretty steep. The curve is pretty steep so it’s linear with RAM directory after say 20,000 documents of 2K size.
Interviewer:So this is kind of also an alternative to the memory index which can only hold

Karl Wettin:Mm-hm.
Interviewer:One document in memory at a time.
Karl Wettin:Right, where this can hold an unlimited number of documents.
Interviewer:And then you can search and use it just as you can use the RAM, any of the Lucene APIs?
Karl Wettin:Yeah, it provides you with an index reader, so you can use it as any other directory implementation or.
Interviewer:What was your use case for developing this?
Karl Wettin:I wanted to understand how the inverted index functioned, so I wrote it, and I also back then I worked with persistency mechanism called the Prevayler. The hypotheses behind Prevayler is that you can store all of your domain objects in RAM at all times, rather than flushing anything to disk periodically or like committing to a database. It creates a snapshot of your memory to disk. Everything you do against memory is passed through as a command pattern instance. It encapsulates a method in a class, and then they serialize that and put that on a transaction log.And I found this was really interesting ‘cause it was so speedy. If you can fit everything in memory, it turned out that using this for some of their demos, it was 5,000 times faster than running in MySQL with some ORM or something, so I thought this was sort of interesting. Let’s see how that works. Let me see with everything in memory at all time, and that’s why I built it. And it worked out pretty well, and then I found use cases for it later on, so I actually didn’t have any real use cases for it as it was built.But now it’s used as for that you spoke about the spellchecker implementation that I’ve been working on. And the spellchecker is actually a bit more machine-learning oriented than the current spellchecker in the trunk.

Interviewer:So what does it do that’s different?
Karl Wettin:It uses a methodology called reinforcement learning, and reinforcement learning is the process of learning from the user’s mistakes, so if somebody types in a query and then they change their query pretty quick and they didn’t get any hits in between, then I can pick that up that okay, they typed helo with one L and then ten seconds later or one second later, they typed hello with two Ls, and then they got a thousand hits. And then they clicked on something after typing hello with two Ls.Then I got an indication that they didn’t really mean the first thing; they probably meant the hello with two Ls. This requires a lot of load, I mean, lots of user load so you pick up everything.
Interviewer:Right.
Karl Wettin:I don’t know if Google does this, but I suspect they do something similar when they learn from what the mistakes the users does. This works pretty well. I mean, I’ve tested it with one corpus which is half a million queries or something, and I follow the user sessions, and what they do, and how they change their queries. And it picks up stuff like acronyms. People might be searching Lucene in Action, and then they search for L-I-A instead, and they still get sort of the same hits, and they click on sort of the same hits. And then the reinforcement algorithm can learn that, okay, L-I-A is probably related with Lucene in Action, so when someone search for Lucene in Action, it would say, “Hey, did you perhaps mean L-I-A?”
Interviewer:Hmm.
Karl Wettin:And vice verse. It’s a cool technique. I never really used it in commercial places, but it’s interesting stuff to work with. And I use instantiated index here to when searching for order of tokens, so maybe someone search for world hello, and then I can say, okay, it’s more probably that you’re searching for hello world because of the occurrence of the tokens, the phrase hello world compared to world hello.
Interviewer:So does it take into account also what is in the index as well? I mean, it’s layered on top of the Lucene spell-check.
Karl Wettin:Yeah, you could use the Lucene spell-check as a secondary suggestion scheme within it; that when it doesn’t have any suggestions for you, it will go to second level suggesters and pick up suggestions because it didn’t learn anything yet, and then it’ll go to algorithmic suggesters.
Interviewer:I mean, you still wanna give suggestions that are in the index versus.
Karl Wettin:Mm-hm.
Interviewer:Just suggesting things that other people happen to mistype as well which aren’t in the index.
Karl Wettin:But actually, I prefer working on queries and suggesting based on searching what other people search for rather than searching in the index because the index is so huge, usually, compared with just indexing queries. So I actually create an index of all the queries that people do and use that as a a priori data set rather than using the full Lucene index as an apery data set ‘cause it’s so much speedier to just search inquiries and usually it’s good enough.
Interviewer:That kinda gives you a freshness feature, too, right, in that you.
Karl Wettin:Yeah, it does.
Interviewer:You’re favoring what other people are searching for most recently.
Karl Wettin:If you have enough user load, it won’t really.
Interviewer:

Matter that much.
Karl Wettin:So if you have like 500,000 queries that you built the thing on and keep feeding it, you won’t notice any trends. I mean, you probably would but not that much.
Interviewer:So speaking of machine learning, this is really the first chance I’ve had to talk with a fellow Mahouter.
Karl Wettin:Mahout.
Interviewer:The British pronunciation. So tell me how did you get involved in machine learning, and then what led you into Mahout?
Karl Wettin:I thought it is a very interesting field, and I was just curious about the technology and maybe five, six years ago, I started reading about it and I kidnapped this guy here in Malmo, Håkan Källerstrand, as a mentor.
Interviewer:Not literally kidnapped, right?
Karl Wettin:Yeah, no, no.
Interviewer:We don’t need any international crime incidents reported here.
Karl Wettin:No but he’s been working a bit on different sorts of machine learning from item recommendation: people that bought this book also bought that book. And he’s been working on basic classification as naïve bayesian and commercially been doing stuff, but mostly, I think he does this for fun just as me. But he showed me WEKA which is this GPL library out of somewhere in New Zealand, right?
Interviewer:
Yeah, the University of Waikato, I think it is.
Karl Wettin:
Yeah, it’s a great library, and I’ve been working a lot with it, and it has lots and lots of different algorithms from clustering to classification and whatnot. And I played around with it and I thought, I mean, basically I thought it was cool, and I didn’t really do anything with it more than fooling around with data.Nowadays, I use it to clean up data sets, the raw data that comes from the, that’s to be fed into Lucene is usually bad in one way or another and in my data sets right now, they’ve been mixed up fields. As in the Field A should be a first name and Field B should be a last name, and sometimes they put the last name as the first name and first name as the last name, and maybe I wanna put a little bit more weight on the last name than the first name or whatnot.But if the data is dirty, then that messes everything up. So then you gotta clean the data sets, and machine learning is classification is a good way at least to give an indication of how dirty the data is and then maybe it can also automatically clean up the data before you.

Interviewer:
You would go through and train a learner saying, “Okay, this data’s clean; this data’s dirty,” and then learn from that and then you could apply that to the larger corporate set as a whole?
Karl Wettin:
Yeah or if you know for a fact that 98 percent of the data is good, then you could just train the whole classifier with all the data you have and then you cross-reference the data and classify it with itself because you know that it’s not that plausible that someone called Mike is their last name because there’s like five people named Mike as their last name, and it’s six zillion people named Mike as their first name and then the classifier will pick up on that, the probability that it is a last name. You’re gonna get a lot of bad data back but with some manual checking of the data, it works out pretty well.
Interviewer:
Right, I think this is often the case with machine learning is you still need a human in the loop to go through and
Karl Wettin:
Yeah, at least in the beginning.
Interviewer:
So how did that lead you into Mahout?
Karl Wettin:
I think it started a year and a half ago that you sent me a mail and pointed out that I’ve been doing a bit of machine learning stuff that I’ve posted in the Lucene JIRA. And asked if I wanted to join a new project for distributed machine learning which is very interesting I must say because there is no good library out there that does distributed computing for machine learning. We talked earlier about WEKA and WEKA consumes so much resources. I mean, it’s impossible to try to classify large data sets or a cluster or you run out of RAM like all the time, so I thought that’d be interesting ‘cause also Lucene is very related.I mean, there’s many things analysis of queries and analysis of the data to be indexed and clustering data. I have always found the “more like this” feature in Lucene to be interesting even though I never used it, but I thought, hey, maybe one could start clustering on texting and classifying on text. Then there’s lot of tools for cutting up text and stemming and whatnot in Lucene, and this could help out for text machine learning. And that was my in to Mahout, why I thought it was interesting; that hey, I could apply machine learning on my Lucene index.And I still don’t really know what I wanna do with it, just that it’s cool.

Interviewer:
Mahout’s still pretty early stages, so we’re about to do a .1 release here and I think that’ll help spread the word, and it’ll involve more examples and more demos.
Karl Wettin:
But there’s so many things one can do with machine learning. I don’t know if I should start listing things of what’s cool with machine learning but.
Interviewer:
Well, you can give a few. Maybe that would be helpful for people who aren’t familiar.
Karl Wettin:
Well, perhaps clustering is something that we’ve been focusing on quite a bit in Mahout, and that is the idea of splitting up data sets in bags where the bags would then contain things that are similar to each other. And maybe if you were doing biometrics and you wanted to do, let’s say, facial recognition, Big Brother stuff here. And we say, okay, split up all these items in two bags, and then hopefully, you’d end up with one bag that contains males and another bag that contains females, but it might also be one bag with the people that have a mustache and one bag with people that does not have mustache, so you don’t really know how it would, but it would at least cluster up your data.
Interviewer:
Right, it’s unsupervised.
Karl Wettin:
Right.
Interviewer:
You can’t tell it what to get. You get back what it thinks is appropriate.
Karl Wettin:
Mm-hm. Other cool stuff, I mean, I use classification. I use that a lot for language identification. I do analysis on the user queries in Lucene to figure out what languages. Once again, I work with these large data sets that are niche, but the data is very small, but it could be in 16 different languages, and I have different rules for Northern Germanic, how to work on that, and I have languages and I have rules for how to work on Latin languages, and I have one rule for how to work on whatever.And of course, I wanna match my stemming and my everything with what’s in the index, and it’s very easy to classify data when you have, once again, let’s take the names or street addresses. It’s rather simple to figure out that a street address is in a specific language because there are common suffixes and prefixes like street.
Interviewer:
Right.
Karl Wettin:
It probably is not German. It’s called street. Although, if it’s strasse, then it’s rather probable that it’s in German, so that’s something that I use machine learning for rather often. There’s actually a patch in bayesian classifier patch, Lucene 6, no 1039, I think, which is spacing classifiers with example that does language identification. And I’ve used that exact patch with instantiated index as the data matrix. It can run on any Lucene index, but of course, instantiated index is perfect with this, and I’ll load it with 10,000 small documents, and it’ll be 30 times speedier than a RAM directory. What else do I do with machine learning?
Interviewer:
I think that’s plenty. I mean, other things, the classification also of being able to take like web pages and put them into certain buckets, say this web page is about sports, this one’s about entertainment. That kind of thing is pretty traditional classification. We’ve also added some of the genetic programming stuff. Taste is the collaborative filtering, item recommendations and user recommendations. People are probably familiar with that from like Amazon where it says, “Well, you’re interested in this book. Ten other people were also interested in these other books,” so Taste is good stuff for Mahout, too. That’s definitely true.Any tips for people who are just kinda getting started with this machine learning? It sounds like you are pretty self-taught on machine learning.
Karl Wettin:
No, I mean –
Interviewer:
Any tips for getting started?
Karl Wettin:
I mean like find the data. There’s really good test data that’s distributed with WEKA, simple things. I enjoy the weather data. They have you could predict weather using their data sets. It’s very small data sets. They have another data set for employment agreements that contains this guy has been employed for so long. He has these many vacation days. He’s get paid this much. We think this is a good employment or agreement or bad agreement. And then based on that, try to classify the rest of the agreements and set up the classifier and twiddle it around and try to get good results.And really I never know which algorithm to use or which one is best so trial and error, and there’s a WEKA, again, is a great library. It contains an experimenter or I don’t know what it’s called but it has multiple parts where one is a test suite, where you can set up, okay, I wanna classify on this data sets and I wanna run these 62 algorithms, and I wanna have these permutations of settings, and then you just start it and come back after a week, and it will say, “Okay, these algorithms, these settings, that was the best setting for you,” so and then you choose that ‘cause I mean, I have no idea when it’s better to use a support vector and rather than a
Interviewer:
Right.
Karl Wettin:
naïve bayesian or R-Tree or whatever.
Interviewer:
Right.
Karl Wettin:
So trial and error, play around, fool around. It’s a lot of fun. It is.
Interviewer:
Yeah, definitely. I mean, I’m in the same boat. That’s kinda how I got started with machine learning was just trying it out, playing around with it in different scenarios I was looking at in a previous job so then it’s just evolved from there.
Karl Wettin:
There’s a few good books, too. I know we mentioned on the Mahout list a couple of times, Toby Segaran – is that his name – wrote a book called, “Programming Collective Intelligence.”
Interviewer:
Right.
Karl Wettin:
And it brings up how an inverted index works. It brings up how different classifiers and clusters works, collaborative filtering. It’s a super book to read especially for me. I never took any math classes. I mean, I’m at high school algebra, he explains everything with Python code, very simple to understand.
Interviewer:
And even if you’re not a Python person, like I don’t know very much Python at all, and I picked up that book and I was able to understand the examples.
Karl Wettin:
Mm-hm.
Interviewer:
So yeah, that’s definitely a good book. There’s also though, what, the WEKA, is it called Data Mining, if you –
Karl Wettin:
Yeah, Data Mining. I think it’s just called Data Mining.
Interviewer:
Yeah.
Karl Wettin:
That’s a heavy book, but it’s a good book. It explains the stuff pretty well, still lots of Greek letters in there though, but that’s okay. So I’ve learned a lot from that book, too, but that’s more like a reference guide for me. How does this feature selection scheme works? Well, then I look it up in that book.
Interviewer:
Well, this has been really great, Karl. I don’t really have any more questions. Is there anything else you wanna talk about?
Karl Wettin:
I think we should talk about user behavior a little bit. There’s one thing that I often run into and is that people write search engines, and they add all these features. They think it’s good for the user and it will help the user.
Interviewer:
Right.
Karl Wettin:
But they don’t really know. I’d like to point out that people should start analyzing their log files and find patterns, what users have problems with and try to solve those things first and prioritize that. That is something people forget all the time.
Interviewer:
I mean, that’s one of the things I talk a lot about with, when it comes to relevance tuning, people say, “Oh, my search isn’t returning the right results,” but th
ey’ve only looked at two or three searches instead of looking at their logs and seeing what people are searching on and then deciding what to do.I just wanted to thank you once again for your time, Karl, and good luck with your current Yellow Pages implementation.
Karl Wettin:
Thank you.