Batching when indexing is good:

For quite some time it’s been part of the lore that one should batch updates when indexing from SolrJ (the post tool too, but I digress). I recently had the occasion to write a test that put some numbers to this general understanding. As usual, YMMV. The interesting bit isn’t that the absolute numbers, it’s the relative differences. I thought it might be useful to share the results.

Take-aways:

Well, the title says it all, batching when indexing is good. The biggest percentage jump is the first order of magnitude, i.e. batching 10 docs instead of 1. Thereafter, while the throughput increases, the jump from 10 -> 100 isn’t nearly as dramatic as the jump from 1 -> 10. And this is particularly acute with small numbers of threads.

I have heard anecdotal reports of incremental improvements when going to 10,000 document/packet, so I urge you to experiment. Just don’t send a single document at a time and wonder why “Indexing to Solr is sooooo slooooowwwww”.

Note that by just throwing a lot of client threads at the problem, one can make up for the inefficiencies of small batches. This illustrates that the majority of the time spent in the small-batch scenario is establishing the connection and sending the documents over the wire. For up to 20 threads in this experiment, though, throughput increases with the packet size. And I didn’t try more than 20 threads.

All these threads were run from a single program, it’s perfectly reasonable to run multiple client programs instead if the data can be partitioned amongst them and/or you’d rather not deal with multi-threading.

This was not SolrCloud. I’d expect these general results to hold though, especially if CloudSolrClient (CloudSolrServer in 4.x/5x) were used.

Minor rant:

Eventually, you can max out the CPUs on the Solr servers. At that point, you’ve got your maximum possible throughput. Your query response time will suffer if you’re indexing and querying at the same time of course. I had to slip this comment in here because it’s quite often the case that people on the Solr User’s list ask “Why is my indexing slow?”.  90+ percent of the time it’s because the client isn’t delivering the documents to Solr fast enough and Solr is just idling along using 10% of the CPU. And there’s a very simple way to figure that out… comment out the line in your program that sends docs to solr, usually a line like:

server.add(doclist);

Anyway, enough ranting, here are the results, I’ll talk about the environment afterward:

Nice tabular results:

As I mentioned, I stopped at 20 threads. You might increase throughput with more threads, but the general trend is clear enough that I stopped. The rough doubling from 1 to 2 threads indicates that Solr is simply idling along most of the time. Note that by the time we get to 20 threads, the increase is not linear with respect to the number of threads and eventually adding more threads will not increase throughput at all.

Threads     Packet Size        Docs/second

20 1 5,714
20 10 16,666
20 100 18,450
20 1,000 20,408
2 1 767
2 10 4,201
2 100 7,751
2 1,000  9,259
1 1 382
1 10 2,369
1 100 5,319
1 1,000 5,464

Test environment:

  • Solr is running a single node on a Mac Pro with 64G of memory, 16G is given to Solr. That said, indexing isn’t a very memory-heavy operation so the memory allocated to Solr is probably not much of an issue.
  • The files are being parsed locally on a Macbook Pro laptop, connected by a Thunderbolt cable to the Mac Pro.
  • The documents are very simple, there is only a single analyzed field. The rest of the fields are string or numeric types. There are 30 or so short string fields, a couple of integer fields and a date field or two. Hey, it’s the data I had available!
  • There are 200 files of 5,000 documents each for a total of 1M documents.
  • The index always started with no documents.
  • This is the result of a single run at each size.
  • There is a single HttpSolrServer being shared amongst all the threads on the indexing client.
  • There is no query load on this server.

How the program works:

There are two parameters that vary with each run,  number of threads to fire up simultaneously and number of Solr documents to put in each packet sent to Solr.

The program then recursively descends from a root directory and every time it finds a JSON file it passes that file to a thread in a FixedThreadPool that parses the documents out of the JSON file, packages them up in groups and sends them to Solr. After all files are found, it waits for all the threads to finish and reports throughput.

I felt the results were consistent enough that running a statistically valid number of tries and averaging across them all and, you know, doing a proper analysis wasn’t time well spent.

Conclusion:

Batch documents when using SolrJ ;). My purpose here was to give some justification to why updates should be batched, just saying “it’s better” has much less immediacy than seeing a 1,400% increase in throughput (1 thread, the difference between 1 doc/packet and 1,000 docs/packet).

The gains would be less dramatic if Solr was doing more work I’m sure. For instance, if instead of a bunch of un-analyzed fields you threw in 6 long text fields with complex regex analysis chains that used back-references, the results would be quite different. Even so, batching is still recommended if at all possible.

And I want to emphasize that this was on a single, non SolrCloud node since I wanted to concentrate entirely on the effects of batching. On a properly set-up SolrCloud system, I’d expect the aggregate indexing process to scale nearly linearly with the number of shards in the system when using the CloudSolrClient (CloudSolrServer in 4x).