Open Source Hadoop Connectors for Solr

by Cassandra Targett
December 17, 2015

Lucidworks is happy to announce that several of our connectors for indexing content from Hadoop to Solr are now open source.

We have six of them, with support for Spark, Hive, Pig, HBase, Storm and HDFS, all available in Github. All of them work with Solr 5.x, and include options for Kerberos-secured environments if required.

HDFS for Solr

This is a job jar for Hadoop which uses MapReduce to prepare content for indexing and push documents to Solr. It supports Solr running in standalone mode or SolrCloud mode.

It can connect to standard Hadoop HDFS or MapR’s MapR-FS.

A key feature of this connector is the ingest mapper, which converts content from various original formats to Solr-ready documents. CSV files, ZIP archives, SequenceFiles, and WARC are supported. Grok and regular expressions can be also be used to parse content. If there are others you’d like to see, let us know!

Repo address: https://github.com/Lucidworks/hadoop-solr.

Hive for Solr

This is a Hive SerDe which can index content from a Hive table to Solr or read content from Solr to populate a Hive table.

Repo address: https://github.com/Lucidworks/hive-solr.

Pig for Solr

These are Pig Functions which can output the result of a Pig script to Solr (standalone or SolrCloud).

Repo address: https://github.com/Lucidworks/pig-solr.

HBase Indexer

The hbase-indexer is a service which uses the HBase replication feature to intercept content streaming to HBase and replicate it to a Solr index.

Our work is a fork of an NGDATA project, but updated for Solr 5.x and HBase 1.1. It also supports HBase 0.98 with Solr 5.x. (Note, HBase versions earlier than 0.98 have not been tested to work with our changes.)

We’re going to contribute this back, but while we get that patch together, you can use our code with Solr 5.x.

Repo address: https://github.com/Lucidworks/hbase-indexer.

Storm for Solr

My colleague Tim Potter developed this integration, and discussed it back in May 2015 in the blog post Integrating Storm and Solr. This is an SDK to develop Storm topologies that index content to Solr.

As an SDK, it includes a test framework and tools to help you prepare your topology for use in a production cluster. The README has a nice example using Twitter which can be adapted for your own use case.

Repo address: https://github.com/Lucidworks/storm-solr.

Spark for Solr

Another Tim Potter project that we released in August 2015, discussed in the blog post Solr as an Apache Spark SQL DataSource. Again, this is an SDK for developing Spark applications, including a test framework and a detailed example that uses Twitter.

Repo address: https://github.com/Lucidworks/spark-solr.

Image from book cover for Jean de Brunhoff’s “Babar and Father Christmas“.

About Cassandra Targett

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.

Fusion Platform Overview

Fusion Platform Pricing

AI Hub

Lucidworks Features and capabilities (all Included)

Product Discovery

Searchandising

Site Search

Workplace Search

Ingest Data and Capture Signals

Employee Search Experience

Customer Service and Case Resolution

AI and Large Language Models

Solutions

Commerce

Customer Service

Knowledge Management

Industries

Retail

Government and Public Sector

Healthcare

B2B Commerce and Distribution

B2B Manufacturing

Financial Services

EXPLORE OUR CONTENT

Ebooks & Reports

Blog

Videos

Press

Resources

About Lucidworks

Documentation

Careers

LucidAcademy

Contact Us

Technical Support