Grant Ingersoll speaks with Lucene/Solr committer Ian Holsman, about the advantages of open source in the corporate context, performance optimizations in extracting database data, and a wide range of other search topics. Ian is CTO of Relegance, an AOL subsidiary, specializing in search-driven real-time news feeds.

Transcript

Grant Ingersoll:
Today I’m speaking with Ian Holsman, the CTO of Relegence, an AOL subsidiary and a heavy Lucene and Solr user. Welcome, Ian.
Ian Holsman:
Hi. How are you?
Grant Ingersoll:
I’m doing well. Could we start off by having you introduce yourself and tell us a little bit about your background before you got involved with Lucene and Solr?
Ian Holsman:
Okay. Currently I’m the CTO of Relegence, as you mentioned. I’ve been on the Internet scene since ’98, where I worked at a CNET initially there. I had a performance background technically and a developer background. I did my MBA around two years ago, and I currently manage a group of 50 people in AOL. So I’m not really a hands-on developer anymore, but they still get dirty every once in a while.So my background is more of a technical architect. I used to go around seeing other people’s problems and telling them how to fix them, basically, sometimes doing a bit of coding, but I’m done coding for like ten years now, but it’s more just going into architecture systems and operational things, and just trying to figure out what’s going wrong, and just designing systems for speed, basically, and for scale.So a lot of the background I have is running in large Web sites, large properties, similar stuff like CNET was news.com. At AOL we’ve got the AOL Front Door, which people look at, and just other Web sites and sites. So a million to ten million to one billion page views a day is stuff that I’ve got experience doing.
Grant Ingersoll:
Great. So can you tell us a little bit more about what Relegence does and your role there?
Ian Holsman:
Okay. So Relegence is a company which was acquired by AOL in 2007 or 2008. What they do is they provide real-time news about a given topic. So they started off in the financial service industry. So give me all the news about Time Warner or Microsoft and it would come up on a page. What it does is it crawls I think 20 to 30,000 different sources on the Web, from The New York Times to AlleyInsider, for example, and grabs the latest documents and just tries to figure out what the document is about.So it takes the names of the people out. It tries to figure out the category of the topic and the location. So what you end up with is a document with a lot of metadata. So, for example, it might say, “This document is talking about New York. The newspaper where it’s from is based in Michigan. Obama was mentioned and it’s an election topic,” and stuff like that.There’s more to it than that, obviously, but what it ends up with is Relegence can get a document from when it was published to a screen within I’d say five minutes.
Grant Ingersoll:
Nice.
Ian Holsman:
Yeah. It’s not just all documents. You can also filter them on different things and stuff like that. Obviously it was a big hit in the financial services stuff. So we could actually see that there was a storm in Hawaii destroying pineapple crop before the Reuters and stuff like that got hold of it. So the traders loved it.I obviously wasn’t around in Relegence, I’m an old AOL guy. I came in afterwards. So we’ve basically moved that onto the Web and webified it. So you’ll see it on Money.aol.com is news, NewsRunner, Love.com, AOL music pages. It just has a nice supplement, ’cause you can go to one page and it has news from all the other pages. It’s an aggregator, but a fairly sophisticated on.
Grant Ingersoll:
Nice. So when did you first encounter Lucene and Solr?
Ian Holsman:
I encountered – well, it wasn’t called Solr back then. We came to Lucene back in the CNET days. So a guy called Yonik and Clay Webster at CNET were trying to build an alternative to AltaVista to use for unstructured data. Now we had two issues. AltaVista wasn’t supported on 64-bit Linux machines, so we had to come up with something. So we came up with two different things.We came up with something called Atomics, which is MySQL over HTTP. The New York Times has something similar called DBSlayer now out in the open source. And we came up with Solr, which was Search on Lucene and Resin, which was our application server at the time.So we were using those two alternatives, and you can actually see the roots where Solr came from based on a lot of the design decisions that they had. So Lucene was used mainly in the shopping.cnet.com for the faceted search type things. That was the main use of it at the time. So you’ve done a search and HP matches five of that. There’s a price range of its 300 to 700, and 700 to 900 and so on, and that’s where it was used heavily.

They also had, like, three hourly batch updates. So, you can see that Solr was the – the replication of Solr was set up to do rsync, because that’s what suited the environment at the time. So a lot of the things that CNET used has still followed through in Solr.

So I wasn’t really that much involved in the day-to-day development of it. I was more doing the benchmarking and the performance analysis of it, making sure it ran smoothly and stuff like that, so it could do stuff.

Grant Ingersoll:
So do you recall, was there concern about bringing an open source from a business side, or were people pretty comfortable with that at CNET?
Ian Holsman:
It wasn’t the first open source product that we did, but it was the first open source product that we, I think, actually released out. So that was a big step. I mean CNET was a heavy user of open source. This was back in 2000, 2001, and we had a cost justification with it.We were using WebLogic or BEA WebLogic at the time, and it was something like ten grand per CPU or something like that to get a WebLogic application server running, and we looked at Resin and it was $500.00 per machine. That was the first justification, which wasn’t 100 percent open source, but it was really close to it.We benchmarked Tomcat as well, but at the time Tomcat just didn’t work as well for Resin. So for 500 bucks per server we just said, “Look, we’ll just go for it that way.”

We also started developing the Apache 2 Web server, which is where I came into Apache from. I’m an Apache 2 developer or I used to be anyway. We basically developed some of the Web servers and that introduced us into open source.

The main way I sold it to the business at the time was we had four developers on an ancient version of Apache, and they couldn’t keep up with the development, the new requests, because we had all these customized modules that we did by ourselves. So the business case was there’s nothing here proprietary or custom or giving us a strategic advantage. Let’s just go and open source it, the modules, contribute back into it, and basically we don’t have to have a team of five developers here. We can have a team of two who look after our interests on the project, and make sure that we still have knowledge about it, but all the other stuff is maintained by other people.

It was simple. It was a simple equation. And this was the time when the dot-com bubble burst and there wasn’t enough money for anybody. This stock price was at $0.50 and we weren’t sure if the company was gonna survive another quarter. It was really interesting times if you were there. You remember them.

So anything that saved money was appreciated. So we just looked at open source as a cost saving exercise. It wasn’t the fastest. It wasn’t the most full featured, and it wasn’t maybe the easiest to use, but it saved a heck of a lot of money for us and it nailed that the company stay afloat in a lot of cases.

Grant Ingersoll:
Right.
Ian Holsman:
So, I mean CNET is a very, very heavy user of open source now. We converted most of the stuff to MySQL. I think they were using Sybase at the time. They went from BEA and Resin to Tomcat eventually, and it enabled them to drive architecture decisions in different ways. So where before we were basically putting in license costs, which limited like the number of machines, and we needed these big, heavy irons because the license cost was so huge, as soon as we dropped that and went to open source we could go to the cheap Linux boxes.
Grant Ingersoll:
Right.
Ian Holsman:
And at the time we left, I mean we went from four 280Rs, which was the top of the line Sun equipment at the time, to 100 Linux boxes running Tomcat and Solr and stuff like that. So it just changed the model of open source. It just changed the equation so much, and then that led us to make use of cheap hardware, which was a dramatic savings at the time.
Grant Ingersoll:
Wow. Yeah, I think I see that a lot. Everybody is moving over. I’ve talked to a number of customers moving over to open source. So then after that you kind of found your way to AOL, and you were part of the AOL search group right there. Now were they using Solr already when you got there or did you bring it in, and how did that all work?
Ian Holsman:
Okay. So I came in I think February 2008, and I came maybe in maybe March. I came in two months after Ted Cahall, who is the EVP of platforms and technologies. He’s the one who hired me. He was the ex-CIO of CNET. So I came in on his coattails. So he brought in Solr and MySQL and Java basically, and started pushing us into that direction.So one of AOL’s issues that we had, and I’m probably gonna get flak for this from the guys inside, was that the technology group wasn’t responsive enough to the business users. It wasn’t through any fault of their own. It was just that they were using ancient technology, which just took them a long time to do anything.They wanted to put a new feature in, they had to go into their own search engine and modify the code, and have releases, QA releases, and it was just expensive because they were too busy fixing the guts of it than worrying about the features that people want. We came in and we introduced Solr to them. Obviously it was a new product and it was different, and people objected to it because that’s what happens.

At the time, the current search product that they had was superior in certain ways and inferior in other ways, but the cost of it was just prohibitive. We couldn’t maintain it and be commercially viable. So while it was good for certain things, it just wasn’t keeping up with the Jones, so to say. It couldn’t do faceted searches. It was just hard to deploy and just like only two people in the company who knew it, which is the other risk.

Grant Ingersoll:
Right. Oftentimes I say to people there’s really not much point in writing a low-level TF-IDF implementation or vector space model implementation of a search engine, because Lucene does it and problem solve so that you can then focus on your application.
Ian Holsman:
Yeah. And exactly to that point, I mean we had some guys, half my team, the majority of my developers in Relegence actually are located in Tel Aviv, Israel. They were using Lucene. They had their own internal stuff, and they found Lucene and Solr and they looked the TF-IDF, and they’re now using that in their algorithms.So this is guys who have PhDs and they can write these kind of things in their sleep if they really want to, and they probably do, in a good way, guys, in a good way.
Grant Ingersoll:
Definitely.
Ian Holsman:
But they looked at that stuff and it just wasn’t feasible for them. They’d rather spend their time writing the neural net stuff, which sits on top of that stuff, and doing the higher level stuff that they’re doing now than worrying about these basic things. They have got lucene saying yeah, sure, they probably tweaked a little bit of the code and they don’t like exactly how it works in certain areas, but it’s good enough.
Grant Ingersoll:
Right. Well and that’s the beauty of it being open source, is you can change it when you need to. And if you have those guys who have those skills, then it’s all that much easier.
Ian Holsman:
Yeah. I mean it took me a while to figure out what IDF actually stood for. In Israel IDF means the Army, so it’s a slight difference there.
Grant Ingersoll:
Okay.
Ian Holsman:
Yeah. I just actually found though that using Solr to do concept detection. So for example, it just came up to me that they’re doing it that way. What they’re doing is they’re taking out the words or they circled high-quality words out of a document, and they’re using a Solr search to figure out what concepts best suit it. So that might what we call – it’s like a topic, a high-level topic, for example, crime, fashion, I don’t know, softball. So we’ve got like a million of these things, and what they’ve done is they basically take these certain keywords and they shove it into the Solr query, and they say, “Well, that best suits these kinds of things.”So Solr is just used in a myriad of different ways inside, which isn’t just a simple do a text search on a Web page and get results. But we do use it for that as well.
Grant Ingersoll:
Right. So can you describe some of the properties that use Solr at AOL and at Relegence, and maybe just talk a little bit about how Solr powers those sites?
Ian Holsman:
Sure. So if you go to any of the search boxes in any of the AOL properties, I’d say you’ve got a 90 percent change that it’s hitting a Solr search engine in part of it. Now that could be where you go like to music.aol.com and you type in a search. It goes and hits Solr to get the results. To the actual AOL search itself, it uses Solr in the background.Obviously the Google results, and we call Google for what they call the organic links, but everything else on the page touches Solr. We just use it like that’s from trying to figure out what you typed to show you what the best – the ads on the page. So we’ve got something called Web offers on the top, to show you a recirculation link. So if you type in Madonna, we want you to go to AOL’s page about Madonna, or if you type in pizza in Denver, we want you to go to our local thing. Those two things are powered by some heavy lifting, but they are ending up being Solr searches.We also, if you type in a query string, we can try to figure out what site or what kind of category it is, and that’s also using Solr as well under the covers. There’s other stuff obviously on top of that, but it’s based on Solr, ’cause it’s just flexible enough to do that stuff.

Some of the interesting things that we’re using Solr for, which is beyond just the standard stuff I went through in Relegence, there’s a Web site called Love.com which is 100 percent powered by Solr. What it is, is you type in a topic like Madonna.love.com or Prince.love.com, and we first go to MySQL to look up the query for Prince, what actually do you mean by that, and then we expand that into a Solr search. And then all the results come off that and get presented on the page. So that’s one database we have of news articles from Relegence.

We have another database of photos. So if you have a look you will see there’s a photo gallery. Most of that is powered off Solr as well.

We have a federated search engine, which I think was a 1.3 release, where we have a database which is too large to fit on a single machine. So we have to split it up. I think it’s on ten machines at the moment, and it just goes and queries. Each one of the machines contains what they call a shard or a part of the database, and we use that to query I think three months worth of news articles. That’s what’s powering NewsRunner.

We also use something called Local Lucene, which is written by a guy called Patrick O’Leary, who is a committer now in the Apache group, to do geographic-based or geo-based searches for local. So we first look up Denver, for example, the ZIP code. We can look at the latitude and longitude of that and we do a bounding box, and we use the Local Lucene algorithms, which are part of that now, to show us all the pizza places in that bounding box or that five-mile radius, which is kind of cool.

I mean I wrote an application once, where we had to do that in MySQL. That was just a pain in the neck, because we had to go and get the right bounding box calculation. You had to go and store all these pre-computed things in MySQL, and the query was just dog slow compared to what is now available in search.

It’s just unbelievable how easy it is to do local-based searching now. You just put the latitude and longitude in certain fields and insert the document, and that’s it. You don’t have to worry about the bounding box calculations. There’s no sine, cosine calculations in your code anymore. It’s just done and it works.

Grant Ingersoll:
Yeah. Actually I just did another podcast interview with Patrick O’Leary. It’s not up on our Web site, but listeners to this one will be able to see that one or listen to that one as well. So that should be interesting to correlate.So kind of going back to all these properties, what kind of traffic is Solr serving? I mean AOL is obviously a huge property and has a ton of traffic. I mean can you talk a little bit about how much Solr is actually serving then?
Ian Holsman:
I don’t have the exact numbers. I couldn’t give you the exact numbers directly ’cause I think I’ll get into trouble for that, but we’re definitely hitting – I’d probably say over 20 million to 40 million searches a day, and that’s not including Bebo, ’cause if we include Bebo you could probably – I think they’re using Solr. AOL is a very large company and there’s not one person who’s responsible for everything. I don’t think they’re using Solr, but I think 20 million would be ballpark of searches, and that’s just doing a search. That’s not displaying a regular page. It’s people just doing searches.
Grant Ingersoll:
And a lot of those probably include faceting, too, and the local search as well, right.
Ian Holsman:
Yeah. Local search is all faceting, and we actually talked to you guys to get your algorithm. You had a faster version of faceting. I think it saved us – well I think we were going to deploy on 18 machines initially. We talked to you guys and we can get it down to five because it just sped up so much.
Grant Ingersoll:
Right. And actually, that algorithm is now going to be in Solr 1.4. So that was a nice improvement to add back into Solr, too.
Ian Holsman:
Yeah.
Grant Ingersoll:
Wow. That’s a pretty decent size Solr installation there. You’ve got 20 to 40 million queries a day, and I imagine several million or more documents as well, right, across all of those properties.
Ian Holsman:
Yeah. I mean you have to remember AOL is just not one big property. It’s lots of little ones. So for example, Love.com sits on a document size of I think five million documents, which is a month’s worth of news. I’ll just check that. So for one of the ones that we just basically just power off, and that’s just using a very basic search. I’m just checking to see exactly how many documents. That one I can tell you.But that’s just on a single machine just doing facet searches and regular. It’s based on a document size of 5.5 million.
Grant Ingersoll:
All on a single machine. I mean obviously you have some failover and all that stuff, right.
Ian Holsman:
Yeah. I mean we’ve obviously got that replicator for traffic.
Grant Ingersoll:
Right.
Ian Holsman:
So we’ve got a couple of machines as slaves doing that. But yeah, it’s a document database of 5.5 million things, and the response time for a single request in milliseconds.
Grant Ingersoll:
Nice.
Ian Holsman:
Now obviously we’ve got caching in there as well, because you don’t want to have every user hitting the same search in the system and wasting CPU.
Grant Ingersoll:
Sure.
Ian Holsman:
But yeah, I mean just by putting it into a Solr engine. So we had, like I said, 5.5 million documents sitting in a Solr database or in a Solr engine. And then what that enabled us to do is say, “Well what if we just run these kind of queries or we did these kind of faceted searches and do this kind of stuff?” And from that we’re about to – well when I get my poor finger out – I can say that in a podcast I guess – we’re basically gonna use that to launch a series of different Web sites and I say 10 or 15 different Web sites based on that, based on faceted searches on that data.
Grant Ingersoll:
Nice.
Ian Holsman:
It’s just too easy. We combine that. Now each of these documents are tagged location-wise. We can then use Local Lucene to do boundary searches if we wanted to on top of that. So we can have like a local news type thing. So it’s just really, really simple to add these features in. I mean if Love.com, as you see it, was an idea in December last year.
Grant Ingersoll:
Seriously?
Ian Holsman:
Yeah. Bill Wilson came up to us and said, “We want to do something like this,” and it took us I think three months to basically write the front-end. Most of the work was front-end related because we wanted to get it just right, and there were some issues with scalability.But yeah, it started off in December. We worked over. And this is just one or two developers. We just got the ingestion engine ’cause, again, we have the document stream coming through, and just shoved it into Solr.
Grant Ingersoll:
And the scalability issue, just to clarify, those were on the front-end, not with Solr, right.
Ian Holsman:
There was some issues with Solr.
Grant Ingersoll:
Some, a little bit.
Ian Holsman:
Yeah. I mean it’s not a perfect thing. It’s not working out of the box, but that was also stuff that we had issues with. For example, because we’re real-time we were updating; we were doing commit to every document, for example, for a while, which when you talk to these guys will tell you that’s not what you’re supposed to do. So we changed it to commit to every thousand, not running queries off the master, running off slaves, and things sped up.
Grant Ingersoll:
Yeah.
Ian Holsman:
And that’s where optimizers come in. I mean most of the queries, what we found is when we initially blamed Solr for everything, because it was a new thing and it wasn’t our code, we got an optimizer on there and look at that. It wasn’t Solr. It was because we were calling on MySQL database 50 times every document.
Grant Ingersoll:
Yeah, I’ve heard that one before, yeah.
Ian Holsman:
Yeah. So a good optimizer is worth its weight in gold. But then I come from a performance background, so I’m biased. But yeah, I mean we got it down to I think one page on Love.com was one Solr request and around 15 MySQL requests, ’cause we had other data coming out and we just didn’t want to stick it in Solr.So I mean, look, if I had some time there’d be some stuff I’d like to see included in Solr.
Grant Ingersoll:
What?
Ian Holsman:
One of the issues we have is a lot of stuff we do is time-based. So we want to do a search with relevancy, but show the stuff in time-based order ’cause we’re a news site, so most people care about the most recent things. That’s difficult to get your head around on how to do that.
Grant Ingersoll:
So not just sorting, but actually have the time be a factor in relevance.
Ian Holsman:
Exactly.
Grant Ingersoll:
Yeah. So I mean you could do something like function queries. I mean that’s typically what people do, like one over the time. But you guys probably want something more sophisticated than just one over time, right.
Ian Holsman:
Yeah. And the other stuff we have is popularity. So we know, a lot of the pages we look at we know what artists are popular. So for example, on music we use a data feed from AMG I think. There’s a lot of data and there’s a lot of musicians. Now some have the same names. This also comes up in sports teams. There’s a lot of people with the same names.So if you go and type in – I’ll stick to music, let’s say Madonna, you want the Madonna that you know and love. There’s probably another two Madonnas who have got the name Madonna. Now if we can have popularity coming back into the search thing, that would be another really cool thing, because we know this Madonna page is the one that gets the most hits. It’s just a matter of including that into the scoring algorithm so it includes it.It can be done today. It’s just hard to do. It’s doable. It’s just one of those kind of things which isn’t as easy to do
as what you’d like to just open the box and, bang, it’s done.
Grant Ingersoll:
Right. So I’ve done this actually in Solr. Now there’s what’s called – I forget the exact name. It’s like external file source or something like that, where you can actually use an external source as part of a function query. So what you can do is maintain a list of document IDs and boost values, and Solr, when you do a function query can then go and look up what those doc IDs are and what the boost value is, and then factor that into the score. And since that file is not part of the index it’s a lot easier to update. You can handle updating that in different ways.
Ian Holsman:
So you can use Memcache for that instead of a file?
Grant Ingersoll:
I haven’t done it with Memcache, but I imagine you could. I mean that would be an interesting question to pose on one of the Solr mailing lists or something along those lines. I mean as I understand it, I didn’t write that actual piece of code, Yonik did, but if we can do external files why not other things that are external, right.
Ian Holsman:
Exactly. I mean, yeah, if you can do Memcache then what you’ve got is you can keep your scores, your popularity updated in real time, and then all of a sudden this thing queries those kind of things in real-time. So that would be great. I’ll have to look at that.
Grant Ingersoll:
Yeah. And there’s more real-time functionality coming into Lucene underneath the hood as well. So that will obviously percolate up to Solr. I mean we’ve already seen Solr and Lucene have gotten a lot faster at simply reopening searchers. So if you’re up on a newer version of Lucene and Solr that’s a whole lot faster now than it used to be, so yeah.
Ian Holsman:
And there’s projects by the LinkedIn guys who were doing it like in real-time with Lucene. I mean we’ve got our updating now. It used to be once a minute and I think they’ve got it down to once a second. So they’re replicating it across at once a second, which it’s not real-time, but for most people that’s real-time enough.
Grant Ingersoll:
So you have an update come into your master and you replicate that out to all your workers every second.
Ian Holsman:
We can.
Grant Ingersoll:
And do you have custom Solr code to help with that or you’re doing that just through hardware?
Ian Holsman:
Okay. Now you’re getting a bit too technical. No, I’m just kidding. That’s not my team. That’s a guy called Vineet in the AOL search group, who’s doing that stuff. I’m just basically making use of it.
Grant Ingersoll:
Sure.
Ian Holsman:
The way AOL is structured is we haven’t got a search team who is responsible for Solr per se. We’ve got application developers who work in different business areas and they talk to each other. So I’m, for example, in Relegence and we do stuff. Venite is in AOL search. We’ve got guys called Shalin and Noble, who are in Bangalore, and they work on the shopping side mainly I think. Those two guys are committers now in Solr. They’re responsible.One of the other uses that we’re looking at is the multi core functionality. So we plan to use that for e-mail. Again, not me, that’s Noble and Shalin’s area, but each user of an e-mail has their own little Solr search. So, all the e-mails are just contained in that little core. So one of the issues we had with Solr was it wasn’t very fast at uploading and unloading millions of these, and Shaylen and Noble have been working to fix that.
Grant Ingersoll:
Nice.
Ian Holsman:
Yeah, and they’ve been doing benchmarks. That’s where contributing stuff back we found has been useful. I mean going back to the initial conversation about why open source is we’re a large organization. I think we’ve got – popular press reports us at 5,000 to 8,000 people in AOL, which is huge, and we’ve got people in multiple different countries. And if we wanted to keep our own version of Solr, we’d have to have a centralized group somewhere to do it.Now we don’t need that, ’cause there’s a centralized group in Apache doing that for us. And at the end of the day there’s nothing proprietary or so-called which we’ll lose sleep over if you guys grabbed it and became a competitor of ours, ’cause that’s not where our strategic advantage is. It’s not writing slightly faster Solr queries. It’s building the sites up where the data comes from, which is a lot more of the competitive stuff.So there’s no competitive advantage to keeping these changes to ourselves, and in fact it’s exactly opposite. I mean we put Local Lucene into Solr main tree six months ago, Patrick did, and then Yonik took a look at it, Chris and a lot of the other core developers in Solr and Lucene had a look at it and fixed a couple of different things and suggested alternatives. And if we didn’t have that in the open source we wouldn’t be getting the brightest minds on search around looking at our problems and thinking of it.

So that’s the other advantage, is we don’t have to hire all the rocket scientists. They all work for their own companies, doing their own thing, and they spend some time ’cause they’re interested in this stuff. It helps them.

Grant Ingersoll:
Yeah. I often say open source is just a really effective way to form partnerships with other companies, without having to sign legal agreements with all of them on an individual basis.
Ian Holsman:
Exactly. We’re all there to make money, and we all do in different ways. But it’s like you download Linux or like an operating system Solr has become an appliance in a lot of ways. I think nothing of downloading a released version. I’m still a bit wary of downloading the trunk and putting it into production, but some people in our company aren’t. There’s a lot of politics.
Grant Ingersoll:
Right.
Ian Holsman:
We have a released version. We test it. We QA it and it comes in a box. And the only thing we really have to think about is the schema. You put the schema in and put a version 1 in there, and it kind of works and you update the schema. The thing is one of the other benefits of Solr is you can do changes to the schema, add new fields, drop fields and all this kind of stuff, and there’s no multi hour reorg you have to do.
Grant Ingersoll:
And a lot of times people, especially if they come from database land, where they live in this normalized view of the world, you come into search and you de-normalize everything, and it just kind of frees your mind to go and think about a lot of other things, I think, and do things a lot faster.
Ian Holsman:
Yeah. Demos, for example, Demos.org, it still exists. We’re looking at rewriting that, and we were using Solr to do a lot the set-based operations. If you have a look at Demos, it’s a hierarchical tree. So we basically did some research and we found that it’s just simpler to write it with Solr, because the queries are just so much easier. Instead of having to do some type of nested set type algorithms, which are cool and it was a lot of fun writing, but they’re just hard to get correct, all the different edge cases. With Solr it’s just a very basic Solr query on a path, and it was fast enough to do it.
Grant Ingersoll:
Nice.
Ian Holsman:
One other thing I’d love to set in Solr is named value pairs and a multi value thing. So, for example, you have a document and you have attributes, and you might have internal IDs pointing to something else, like ID5 is Madonna; ID7 is Prince and so on.There’s no real easy way. Well, you can’t do it if you write code to actually store a name value pair against a document and have it start searches, the value and stuff like that, but keep the keys associated with it. You have to have like two lists. So that’s one of the other little things which I’d like to see happen.
Grant Ingersoll:
Because you wouldn’t just – you know, a field contains the value. I mean that’s a name value. But you want to be able to search by the value, but get back the name of – or the key values. Am I understanding correctly?
Ian Holsman:
Yeah. So, for example, I’ve got a document and the document contains many people. Now each one of those people has an ID. So if you think about it, on the invoice we have stock items. So you’ve got stock name and you’ve got the items. I need to have that product ID, all the person ID available in a document associated with this name.So I might go and order, have an invoice for five widgets, product ID75. I want to have the widget in product ID75 on the same line, so I don’t have to store a list of ID numbers and a list of the product names in separate fields. It’s just a small thing. It’s more one of those kind of things I could write a class to do that in Solr, but it’s just one of those things which isn’t out in the box. It’s definitely easily doable. It’s just a matter of doing it.
Grant Ingersoll:
Again, it sounds like kind of a function query type thing could probably handle it. But you’re right. It would involve some code.
Ian Holsman:
Yeah. So I mean I’m trying to phrase this. I mean a lot of stuff that Solr does just comes out of the box straightaway. You don’t need people involved. You just basically shove it in there and it works.For the majority of cases that’s fine. Where you need the people is when you’re starting to look at relevancy. You want this document to appear higher than the other document. You’ll get a better score. And that’s when you start playing with boost values and scoring and stuff like that. That is what I loosely call a rat hole, because it’s very hard to do well for all cases.
Grant Ingersoll:
Yeah.
Ian Holsman:
That’s what I think a lot of people forget with Solr, when they’re doing searches, is that it’s very easy to create the index and to get something to search and in most cases it’s fine, but you have to look at some of the edge cases, and when you get to an edge case it can sometimes be very hard to get that kind of thing working, because it’s not the same thing to get it working algorithmically.I think there are ways of fixing it on top of the actual engine. I think you’ve got document-level boosts and stuff like that that you can do.
Grant Ingersoll:
And actually, I mean you hit on something. That problem goes well beyond Solr. I mean that’s a general search problem, and any and all vendors kind of have that same thing. I mean one of the nice things you actually get with Solr is that you can see what’s happening underneath the hood. So you can at least know why a document scored the way it did, as opposed to it being some secret sauce that you could use to potentially reverse engineer what the engine is doing.
Ian Holsman:
Yeah.
Grant Ingersoll:
Yeah. That’s all great stuff, Ian. I think that’s about – I’ve definitely taken up more than enough of your time at this point. So why don’t we wrap it up there? I just wanted to say thank you for your time, and I’m looking forward to more good things out of Love.com and Relegence and AOL.
Ian Holsman:
Great. I’m sure we’re gonna do it. We’ve got a new CEO and its all guns firing. He’s definitely changing the place. That’s for sure.
Grant Ingersoll:
All right, very good.
Ian Holsman:
Thanks for your time, Grant.
Grant Ingersoll:
Thank you, Ian.

About Lucidworks

Read more from this author

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.