Lucidworks is happy to announce that several of our connectors for indexing content from Hadoop to Solr are now open source.
We have six of them, with support for Spark, Hive, Pig, HBase, Storm and HDFS, all available in Github. All of them work with Solr 5.x, and include options for Kerberos-secured environments if required.
HDFS for Solr
This is a job jar for Hadoop which uses MapReduce to prepare content for indexing and push documents to Solr. It supports Solr running in standalone mode or SolrCloud mode.
It can connect to standard Hadoop HDFS or MapR’s MapR-FS.
A key feature of this connector is the ingest mapper, which converts content from various original formats to Solr-ready documents. CSV files, ZIP archives, SequenceFiles, and WARC are supported. Grok and regular expressions can be also be used to parse content. If there are others you’d like to see, let us know!
Repo address: https://github.com/LucidWorks/hadoop-solr.
Hive for Solr
This is a Hive SerDe which can index content from a Hive table to Solr or read content from Solr to populate a Hive table.
Repo address: https://github.com/LucidWorks/hive-solr.
Pig for Solr
These are Pig Functions which can output the result of a Pig script to Solr (standalone or SolrCloud).
Repo address: https://github.com/LucidWorks/pig-solr.
The hbase-indexer is a service which uses the HBase replication feature to intercept content streaming to HBase and replicate it to a Solr index.
Our work is a fork of an NGDATA project, but updated for Solr 5.x and HBase 1.1. It also supports HBase 0.98 with Solr 5.x. (Note, HBase versions earlier than 0.98 have not been tested to work with our changes.)
We’re going to contribute this back, but while we get that patch together, you can use our code with Solr 5.x.
Repo address: https://github.com/LucidWorks/hbase-indexer.
Storm for Solr
My colleague Tim Potter developed this integration, and discussed it back in May 2015 in the blog post Integrating Storm and Solr. This is an SDK to develop Storm topologies that index content to Solr.
As an SDK, it includes a test framework and tools to help you prepare your topology for use in a production cluster. The README has a nice example using Twitter which can be adapted for your own use case.
Repo address: https://github.com/LucidWorks/storm-solr.
Spark for Solr
Another Tim Potter project that we released in August 2015, discussed in the blog post Solr as an Apache Spark SQL DataSource. Again, this is an SDK for developing Spark applications, including a test framework and a detailed example that uses Twitter.
Repo address: https://github.com/LucidWorks/spark-solr.
Image from book cover for Jean de Brunhoff’s “Babar and Father Christmas“.