Sometimes you need to index a bunch of documents really, really fast. Even with Solr 4.0 and soft commits, if you send one document at a time you will be limited by the network. The solution is two-fold: batching and multi-threading. Batching helps you overcome network latencies and multi-threading helps you fully saturate Solr with updates. This is especially true when using
LBHttpSolrServer which will route your update to one of many Solr servers.
To start, let’s assume that you have a queue of strings that will become the “text” field of Solr docs (I leave populating the queue as an exercise to the reader). To save on memory and garbage collection, we can reuse a single
UpdateRequest per thread.
The above code will run forever since we are using
Now that we have simple multi-threading, the next step is to batch the updates. This requires a very small modification to our
SolrUpdater class. Batching is quite simple since
UpdateRequest#add allows you to add documents one by one (it only issues the actual request when you call the process method).
We can no longer reuse a single
SolrInputDocument since we are batching them up in the UpdateRequest, but we can still reuse the
UpdateRequest. Once we meet our batch size, we actually send off the request. You will of course need to tune your batch size and number of threads depending on the size of your documents, number of Solr servers, cost of indexing, etc.
Be warned, this implementation is far from complete. There is no thread management, error handling, logging, monitoring, or any of the other things you need for a robust mult-threaded application. In fact, there is a severe data loss bug: if you terminate these threads, any uncommitted documents will simply be lost.
At Lucidworks, we utilized a similar (albeit more complete) implementation to increase our indexing throughput by an order of magnitude.