Blog, Open Source, SearchHub, Technical Article, Tutorials and Documentation

High throughput indexing in Solr

by David Arthur
January 30, 2013

Sometimes you need to index a bunch of documents really, really fast. Even with Solr 4.0 and soft commits, if you send one document at a time you will be limited by the network. The solution is two-fold: batching and multi-threading. Batching helps you overcome network latencies and multi-threading helps you fully saturate Solr with updates. This is especially true when using CloudSolrServer or LBHttpSolrServer which will route your update to one of many Solr servers.

To start, let’s assume that you have a queue of strings that will become the “text” field of Solr docs (I leave populating the queue as an exercise to the reader). To save on memory and garbage collection, we can reuse a single SolrInputDocument and UpdateRequest per thread.

[gist id=4627882]

The above code will run forever since we are using BlockingQueue#take.

Now that we have simple multi-threading, the next step is to batch the updates. This requires a very small modification to our SolrUpdater class. Batching is quite simple since UpdateRequest#add allows you to add documents one by one (it only issues the actual request when you call the process method).

[gist id=4627896]

We can no longer reuse a single SolrInputDocument since we are batching them up in the UpdateRequest, but we can still reuse the UpdateRequest. Once we meet our batch size, we actually send off the request. You will of course need to tune your batch size and number of threads depending on the size of your documents, number of Solr servers, cost of indexing, etc.

Be warned, this implementation is far from complete. There is no thread management, error handling, logging, monitoring, or any of the other things you need for a robust mult-threaded application. In fact, there is a severe data loss bug: if you terminate these threads, any uncommitted documents will simply be lost.

At Lucidworks, we utilized a similar (albeit more complete) implementation to increase our indexing throughput by an order of magnitude.

About David Arthur

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.

Fusion Platform Overview

Fusion Platform Pricing

AI Hub

Lucidworks Features and capabilities (all Included)

Product Discovery

Searchandising

Site Search

Workplace Search

Ingest Data and Capture Signals

Employee Search Experience

Customer Service and Case Resolution

AI and Large Language Models

Solutions

Commerce

Customer Service

Knowledge Management

Industries

Retail

Government and Public Sector

Healthcare

B2B Commerce and Distribution

B2B Manufacturing

Financial Services

EXPLORE OUR CONTENT

Ebooks & Reports

Blog

Videos

Press

Resources

About Lucidworks

Documentation

Careers

LucidAcademy

Contact Us

Technical Support

High throughput indexing in Solr

About David Arthur

LEARN MORE

Fusion Platform Overview

Fusion Platform Pricing

AI Hub

Lucidworks Features and capabilities (all Included)

Product Discovery

Searchandising

Site Search

Workplace Search

Ingest Data and Capture Signals

Employee Search Experience

Customer Service and Case Resolution

AI and Large Language Models

Solutions

Commerce

Customer Service

Knowledge Management

Industries

Retail

Government and Public Sector

Healthcare

B2B Commerce and Distribution

B2B Manufacturing

Financial Services

EXPLORE OUR CONTENT

Ebooks & Reports

Blog

Videos

Press

Resources

About Lucidworks

Documentation

Careers

LucidAcademy

Contact Us

Technical Support

About David Arthur

Related Articles

LEARN MORE