Posting Rich Documents to Apache Solr using SolrJ and Solr Cell (Apache Tika)

Solr Cell, a new feature in the soon to be released Solr 1.4, allows users to send in rich documents such as MS Word and Adobe PDF directly into Solr and have them indexed for search. All of the examples on the Solr Cell wiki page, however only demonstrate how to send in the documents using the curl command line utility, while many Solr users rely on SolrJ, Solr’s Java-based client. Thus, I thought I would throw up a quick SolrJ example here (and I’ll update the Wiki) demonstrating how to do this.

For this example, I’m using the standard Solr example and the Solr trunk version from this morning which I got using SVN:

svn co https://svn.apache.org/repos/asf/lucene/solr/trunk apache-solr

Next, after changing into the directory I checked out to, I built the example using Apache Ant:

ant clean example //slight overkill, but I did it nonetheless

I then changed into the example directory (trunk/example) and ran:

java -jar start.jar

Solr is now up and running. See the Solr Tutorial for more info on these steps.

On the code side, I used SolrJ by creating a new SolrServer and then constructed the appropriate request, containing a ContentStream (essentially a wrapper around a file) and sent it to Solr:

public class SolrCellRequestDemo {
  public static void main(String[] args) throws IOException, SolrServerException {
    SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr");
    ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
    req.addFile(new File("apache-solr/site/features.pdf"));
    req.setParam(ExtractingParams.EXTRACT_ONLY, "true");
    NamedList<Object> result = server.request(req);
    System.out.println("Result: " + result);
}

In this code for SolrJ, I did an extraction only, but you can easily substitute your parameters based on the Wiki page descriptions on Solr Cell. The key class to use is the ContentStreamUpdateRequest, which makes sure the ContentStreams are set properly, SolrJ takes care of the rest.

I hope that gives people a quick idea of how they can send files to Solr Cell via SolrJ. Also, note, that the ContentStreamUpdateRequest is not just Solr Cell specific, you can send CSV to the CSV Update handler and any other Request Handler that works with Content Streams for updates.

For completeness, the output from the code above looks like (some results reformatted for screen width):

Result: {responseHeader={status=0,QTime=1692},null=

Introduction to The Solr Enterprise Search Server Table of contents 1 Solr in a Nutshell… 2 2 Solr Uses the Lucene Search Library and Extends it!… 2 3 Detailed Features..2 3.1 Schema… 2 3.2 Query… 3 3.3 Core…. 3 3.4 Caching…3 3.5 Replication…4 3.6 Admin Interface…4 Copyright © 2007 The Apache Software Foundation. All rights reserved.

1. Solr in a Nutshell Solr is a standalone enterprise search server with a web-services like API. You put documents in it (called “indexing”) via XML over HTTP. You query it via HTTP GET and receive XML results. • Advanced Full-Text Search Capabilities • Optimized for High Volume Web Traffic • Standards Based Open Interfaces – XML and HTTP • Comprehensive HTML Administration Interfaces • Server statistics exposed over JMX for monitoring • Scalability – Efficient Replication to other Solr Search Servers • Flexible and Adaptable with XML configuration • Extensible Plugin Architecture 2. Solr Uses the Lucene Search Library and Extends it! • A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys • Powerful Extensions to the Lucene Query Language • Support for Dynamic Faceted Browsing and Filtering • Advanced, Configurable Text Analysis • Highly Configurable and User Extensible Caching • Performance Optimizations • External Configuration via XML • An Administration Interface • Monitorable Logging • Fast Incremental Updates and Snapshot Distribution • Distributed search with sharded index on multiple hosts • XML and CSV/delimited-text update formats • Easy ways to pull in data from databases and XML files from local disk and HTTP sources • Multiple search indices 3. Detailed Features 3.1. Schema • Defines the field types and fields of documents • Can drive more intelligent processing • Declarative Lucene Analyzer specification Introduction to The Solr Enterprise Search Server Page 2 Copyright © 2007 The Apache Software Foundation. All rights reserved.

• Dynamic Fields enables on-the-fly addition of new fields • CopyField functionality allows indexing a single field multiple ways, or combining multiple fields into a single searchable field • Explicit types eliminates the need for guessing types of fields • External file-based configuration of stopword lists, synonym lists, and protected word lists • Many additional text analysis components including word splitting, regex and sounds-like filters 3.2. Query • HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby) • Sort by any number of fields • Advanced DisMax query parser for high relevancy results from user-entered queries • Highlighted context snippets • Faceted Searching based on unique field values and explicit queries • Spelling suggestions for user queries • More Like This suggestions for given document • Constant scoring range and prefix queries – no idf, coord, or lengthNorm factors, and no restriction on the number of terms the query matches. • Function Query – influence the score by a function of a field’s numeric value or ordinal • Date Math – specify dates relative to “NOW” in queries and updates • Performance Optimizations 3.3. Core • Pluggable query handlers and extensible XML data format • Document uniqueness enforcement based on unique key field • Batches updates and deletes for high performance • User configurable commands triggered on index changes • Searcher concurrency control • Correct handling of numeric types for both sorting and range queries • Ability to control where docs with the sort field missing will be placed • “Luke” request handler for corpus information 3.4. Caching • Configurable Query Result, Filter, and Document cache instances • Pluggable Cache implementations • Cache warming in background • When a new searcher is opened, configurable searches are run against it in order to Introduction to The Solr Enterprise Search Server Page 3 Copyright © 2007 The Apache Software Foundation. All rights reserved.

warm it up to avoid slow first hits. During warming, the current searcher handles live requests. • Autowarming in background • The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabing high cache hit rates across index/searcher changes. • Fast/small filter implementation • User level caching with autowarming support 3.5. Replication • Efficient distribution of index parts that have changed via rsync transport • Pull strategy allows for easy addition of searchers • Configurable distribution interval allows tradeoff between timeliness and cache utilization 3.6. Admin Interface • Comprehensive statistics on cache utilization, updates, and queries • Text analysis debugger, showing result of every stage in an analyzer • Web Query Interface w/ debugging output • parsed query output • Lucene explain() document score detailing • explain score for documents outside of the requested range to debug why a given document wasn’t ranked higher. Introduction to The Solr Enterprise Search Server Page 4 Copyright © 2007 The Apache Software Foundation. All rights reserved.

,null_metadata={stream_source_info=[null],stream_content_type=[null],stream_size=[13242], producer=[FOP 0.20.5],stream_name=[null],Content-Type=[application/pdf]}}

How an electronics giant meets engineers where they are, with 44 million products in catalog

Meet Mohammad Mahboob: A search platform director navigating 44 million products across...

From Search to Solutions: How AI Agents Can Power Digital Commerce in 2025

Watch this on-demand webinar to discover the six smartest AI-driven DX strategies...

Build custom AI agents without writing a single line of code? Yep, we did that.

Finally, a low-code AI platform (really, no code) that lets the people...

Posting Rich Documents to Apache Solr using SolrJ and Solr Cell (Apache Tika)

You Might Also Like

How an electronics giant meets engineers where they are, with 44 million products in catalog

From Search to Solutions: How AI Agents Can Power Digital Commerce in 2025

Build custom AI agents without writing a single line of code? Yep, we did that.