Or, “Trunk can use about 1/3 of the memory required by 3.x”
Please note that these tests were created with an eye toward emphasizing the differences. For instance, I chose to sort on strings since I knew that was an area that would highlight the improvelements. Even so, I’m quite impressed.
If you want to skip all the setup info, drop to the section “Measurement Results”.
Estimating hardware requirements
At Lucid, we often get asked “how much hardware do I need given X documents and Y QPS”? A similar question is “How many documents can I fit on a machine like Z?”.
These are perfectly reasonable questions, ones I’d like to have answered myself. Grant Ingersoll created a spreadsheet that tries to help estimate the memory requirements here: Grant’s Memory Estimator. Looking at that, you’ll see some of the problems. There are just too many variables you have to know up front. Often the hardware people are looking for some ballpark estimates before the search requirements are nailed down, which makes it even harder.
In some ways, this is like asking “how big is a ‘Java’ program?”. The answer is always “Well, what do you want it to _do_?”. Here at Lucid, we’ve gotten pretty good at replying “you have to prototype”. This isn’t a cop-out, I’ve personally tried at least three times to create guidelines for these questions for internal use. It usually takes me about 1/2 hour to remember why I gave up my previous attempts; I quickly get estimates like “you’ll need between 1 laptop and 3 supercomputers” – which is totally useless. Hey, I’m getting better. It’s taking me less and less time to give up ;). By the time you make ranges out of several of the variables, the range becomes humongous. Variables like:
- Sorting (how many fields? How many unique terms each?)
- Faceting (how many fields? How many unique values each?)
- How big is your machine? Solr usually comes under memory pressure first.
- What is the QPS rate you expect? Peak? Average?
- How many fields do you intend to search? Using edismax?
- What kinds of filter queries do you expect to run?
- And on and on.
Unfortunately, it’s often not possible to know these ahead of time, especially if Solr is being in a greenfield application. If Solr is replacing something else, you have some real-world data to go on when estimating query characteristics and QPS since you can examine how your current search engine is being used. But even this has uncertainties. I guarantee that, as soon as you present your users with different capabilities (which you will probably do with Solr) how they use the search capabilities will change. But you have to start somewhere.
You have to prototype
The only reliable guide we’ve been able to give clients is “you really have to prototype with your documents and your queries to see”. We usually recommend you get two numbers from a “typical” piece of hardware. Note that we’re usually more concerned with query performance than indexing performance:
- Start with, say, 20M documents on your machine. Use something like SolrMeter or jMeter to fire queries at your hardware. Keep upping the rate at which queries are sent to the server until your response time lengthens. This gives you the theoretical max sustained QPS rate.
- Now add, say, 5M documents to the machine and hit it with, say, 80% of the QPS rate found in step 1. If your QPS is sustained, add 5M more docs. Repeat until your server falls over.
Now you have a max QPS rate and max doc count for your target hardware, you can set up various monitoring systems to alert you when you start to approach these numbers and expand capacity as appropriate. This is important to the operations people. These are the folks who are on the other end of the beeper and do not want a call at 3:00 AM saying their system just stopped. They really really appreciate getting some advance warning that they’re nearing their capacity.
Using the above, and seeing how various clients use Solr, our first estimate is that a “reasonable” machine can handle 40-50M documents and give you a “reasonable” response rate. But already, you see part of the problem. Define “reasonable” in either dimension. Really, all we can offer is a place to start building out a test bed, with an order of magnitude number of documents. At least telling a client to start with 20M documents (assuming they aren’t, say, volumes of the Encyclopedia Britannica) starts to define the scale expectations.
Still, we have some “rules of thumb”
Despite all the prevarication above, we are pretty comfortable saying “start at 20M documents for prototyping”. Often we see 40-50M documents on production hardware.
But in trunk everything we know is wrong
At Lucid, when asked why performance is not as high as a client would like, the first thing we look at is how much memory is being used by the application. Solr will typically come under pressure from memory constraints first.
The second thing we look at is garbage collection. See Mark Miller’s post for an excellent introduction. We’ve seen some situations in which the JVM can’t recover enough memory on a full GC cycle to finish the current operation and went into a repeated full GC cycles for many minutes. Sometimes this has been tracked down to a JVM bug, sometimes a problem in Solr, sometimes to simply needing more memory, etc.
So, we reflexively look at memory for “why is Solr slow” and have some “gut feelings” about when and how memory pressure will manifest itself.
But in Solr/Lucene 4.0 (a.k.a. “trunk”) there have been some very significant improvements in how memory is used. So some of what we “know from experience” will need to be re-examined.
I ran some simple tests just to get a feel for what the effects of this work were, the results are surprising. Your mileage will vary, of course. The take-away here is that capacity probably has to be re-tested on 4.0. I’d be astonished if performance were worse, but your estimates of the number of documents your current hardware can comfortably accomodate are probably low if they’re based on pre-trunk testing.
I took an 11M document dump from Wikipedia and indexed it under 3x and trunk with code from late March, 2012. I created two sortable string fields, one for user and one for title, with 233K and 9.4M unique values, respectively. Then I started the server fresh and looked at memory consumption after a query that sorted by both these fields. I used the jConsole and VisualVM “GC now” button. The I looked at the memory consumption and object count.
Here are the actual numbers:
- Time to perform the first query with sorting (no warmup queries) 3x: 13 seconds, trunk: 6 seconds.
- Memory consumption 3x: 1,040M trunk 366M. Yes, almost a 2/3 reduction in memory use. And that’s the entire program size, not counting memory used to just start Solr and Jetty running.
- Number of objects on the heap. 3x: 19.4M trunk: 80K. No, that’s not a typo. There are over two orders of magnitude fewer objects on the heap in trunk!
This reduction in raw memory has obvious implications for capacity since it’s usually the case that Solr comes under pressure for memory reasons. In particular, once you push Solr’s memory to the point where it starts swapping, you have to shard the index to get your performance back. This point is probably quite a bit further out now. I doubt that you can put 3x as many documents on a given piece of hardware, but you’ll have to test to see.
A somewhat subtler savings is at garbage collection time. I don’t care how efficient your garbage collection is, collecting 19.5M objects has got to be slower than doing the same operation on 80K objects. I suspect that memory fragmentation is reduced too, but that’s speculation.
OK, how did they do that?
Well, it helped that the coding gnomes were locked in a room with neither food nor beer. Actually, this represents the culmination of a bunch of different work by some very, very talented programmers who just couldn’t stand the inefficiencies they knew about once they realized they could make it better. It’s been something over a year coming I believe.
Somewhere, old ‘C’ programmers are smiling (yes, I resemble that remark). The biggest savings in both memory and objects comes from allocating big buffers to hold the string data, treating it as a big block of memory with string offset and length kept in control structures. So, instead of having zillions of small String objects laying around, there are a small number of really big arrays of bytes and some control structures pointing to them.
The write-once nature of a segment helps here. Consider how awkward this would be if you allowed the strings to change. This nice, tightly packed array of Bytes would be subject to all the vagaries of changing data. The problem was already hard enough without re-inventing the heap manager! But since segments don’t change once written, these structures can be made when a searcher is opened without having to deal with the issues mutable data would entail. And they can be thrown away all at once when the searcher closes. These two features of Solr segments make this strategy much more viable than it might otherwise be. Oh, and it’s yet another reason that updating individual fields in a Solr document is “something we’ll think about”…
Note that for this test, both the fields I sort on are strings. I purposely chose string types because I was pretty sure it would show the improvements in memory usage clearly. I just wasn’t quite prepared for how clearly.
I’m simplifying here, and I had nothing to do with writing the code so it’s a high-level view. But the results speak for themselves. Couple that with the fact that Java String objects have about 40 bytes of overhead before they store the first byte of actual data and…well…