High throughput indexing in Solr

Sometimes you need to index a bunch of documents really, really fast. Even with Solr 4.0 and soft commits, if you send one document at a time you will be limited by the network. The solution is two-fold: batching and multi-threading. Batching helps you overcome network latencies and multi-threading helps you fully saturate Solr with updates. This is especially true when using CloudSolrServer or LBHttpSolrServer which will route your update to one of many Solr servers.

To start, let’s assume that you have a queue of strings that will become the “text” field of Solr docs (I leave populating the queue as an exercise to the reader). To save on memory and garbage collection, we can reuse a single SolrInputDocument and UpdateRequest per thread.

[gist id=4627882]

The above code will run forever since we are using BlockingQueue#take.

Now that we have simple multi-threading, the next step is to batch the updates. This requires a very small modification to our SolrUpdater class. Batching is quite simple since UpdateRequest#add allows you to add documents one by one (it only issues the actual request when you call the process method).

[gist id=4627896]

We can no longer reuse a single SolrInputDocument since we are batching them up in the UpdateRequest, but we can still reuse the UpdateRequest. Once we meet our batch size, we actually send off the request. You will of course need to tune your batch size and number of threads depending on the size of your documents, number of Solr servers, cost of indexing, etc.

Be warned, this implementation is far from complete. There is no thread management, error handling, logging, monitoring, or any of the other things you need for a robust mult-threaded application. In fact, there is a severe data loss bug: if you terminate these threads, any uncommitted documents will simply be lost.

At Lucidworks, we utilized a similar (albeit more complete) implementation to increase our indexing throughput by an order of magnitude.

You Might Also Like

How an electronics giant meets engineers where they are, with 44 million products in catalog

Meet Mohammad Mahboob: A search platform director navigating 44 million products across...

Read More

Build custom AI agents without writing a single line of code? Yep, we did that.

Finally, a low-code AI platform (really, no code) that lets the people...

Read More

How a B2B distribution giant uses smart search to navigate inflation, tariffs, and 10,000+ daily queries

Meet Ryan Finley: A 17-year search veteran who's turning enterprise search into...

Read More

Quick Links