Open Source Hadoop Connectors for Solr

Lucidworks is happy to announce that several of our connectors for indexing content from Hadoop to Solr are now open source.

We have six of them, with support for Spark, Hive, Pig, HBase, Storm and HDFS, all available in Github. All of them work with Solr 5.x, and include options for Kerberos-secured environments if required.

HDFS for Solr

This is a job jar for Hadoop which uses MapReduce to prepare content for indexing and push documents to Solr. It supports Solr running in standalone mode or SolrCloud mode.

It can connect to standard Hadoop HDFS or MapR’s MapR-FS.

A key feature of this connector is the ingest mapper, which converts content from various original formats to Solr-ready documents. CSV files, ZIP archives, SequenceFiles, and WARC are supported. Grok and regular expressions can be also be used to parse content. If there are others you’d like to see, let us know!

Repo address: https://github.com/Lucidworks/hadoop-solr.

Hive for Solr

This is a Hive SerDe which can index content from a Hive table to Solr or read content from Solr to populate a Hive table.

Repo address: https://github.com/Lucidworks/hive-solr.

Pig for Solr

These are Pig Functions which can output the result of a Pig script to Solr (standalone or SolrCloud).

Repo address: https://github.com/Lucidworks/pig-solr.

HBase Indexer

The hbase-indexer is a service which uses the HBase replication feature to intercept content streaming to HBase and replicate it to a Solr index.

Our work is a fork of an NGDATA project, but updated for Solr 5.x and HBase 1.1. It also supports HBase 0.98 with Solr 5.x. (Note, HBase versions earlier than 0.98 have not been tested to work with our changes.)

We’re going to contribute this back, but while we get that patch together, you can use our code with Solr 5.x.

Repo address: https://github.com/Lucidworks/hbase-indexer.

Storm for Solr

My colleague Tim Potter developed this integration, and discussed it back in May 2015 in the blog post Integrating Storm and Solr. This is an SDK to develop Storm topologies that index content to Solr.

As an SDK, it includes a test framework and tools to help you prepare your topology for use in a production cluster. The README has a nice example using Twitter which can be adapted for your own use case.

Repo address: https://github.com/Lucidworks/storm-solr.

Spark for Solr

Another Tim Potter project that we released in August 2015, discussed in the blog post Solr as an Apache Spark SQL DataSource. Again, this is an SDK for developing Spark applications, including a test framework and a detailed example that uses Twitter.

Repo address: https://github.com/Lucidworks/spark-solr.

Image from book cover for Jean de Brunhoff’s “Babar and Father Christmas“.

The State of Generative AI in Global Business: 2025 Benchmark Report, Dawn of the Agentic AI Era

The first-of-its-kind study using autonomous AI agents to benchmark AI capabilities across...

How an electronics giant meets engineers where they are, with 44 million products in catalog

Meet Mohammad Mahboob: A search platform director navigating 44 million products across...

From Search to Solutions: How AI Agents Can Power Digital Commerce in 2025

Watch this on-demand webinar to discover the six smartest AI-driven DX strategies...