Tika is a content extraction framework that builds on the best of breed open source content extraction libraries like Apache PDFBox, Apache POI and others all while providing a single, easy to use API for detecting content type (mime type) and then extracting full text and metadata. Combined with the Apache Solr Content Extraction Library (Solr Cell) searching rich content types has never been so easy. In this article, Tika committer and Lucene PMC member Sami Siren, introduces how to use the Tika API and then demonstrates its integration into Apache Solr via the Solr Cell module.

Introduction

In this article, I will go through a basic introduction to Apache Tika, its components, API and a simple content extraction example. I will also take a look at the recently committed Content Extraction Component built on top of Apache Solr (a.k.a Solr Cell).

Full source code for included examples and links on more information about the subject is available from the resources section at the end of this article.

What is Apache Tika?

Apache Tika is a content type detection and content extraction framework. Tika provides a general application programming interface that can be used to detect the content type of a document and also parse textual content and metadata from several document formats. Tika does not try to understand the full variety of different document formats by itself but instead delegates the real work to various existing parser libraries such as Apache POI for Microsoft formats, PDFBox for Adobe PDF, Neko HTML for HTML etc.

The grand idea behind Tika is that it offers a generic interface for parsing multiple formats. The Tika API hides the technical differences of the various parser implementations. This means that you don’t have to learn and consume one API for every format you use but can instead use a single API – The Tika API. Internally Tika usually delegates the parsing work to existing parsing libraries and adapts the parse result so that client applications can easily manage variety of formats.

Tika aims to be efficient in using available resources (mainly RAM) while parsing. The Tika API is stream oriented so that the parsed source document does not need to be loaded into memory all at once but only as it is needed. Ultimately, however, the amount of resources consumed is mandated by the parser libraries that Tika uses.

At the time of writing this, Tika supports directly around 30 document formats. See list of supported document formats . The list of supported document formats is not limited by Tika in any way. In the simplest case you can add support for new document formats by implementing a thin adapter that that implements the Parser interface for the new document format.

Table 1. Supported Document formats

Package formats [a] .tar
.jar
.zip
.bzip2
.gz
.tgz
Text Document Formats .doc (MS Word Document)
.xls (MS Excel Document)
.ppt (MS PowerPoint Document)
.rtf (Rich Text Format)
.pdf (Adobe Portable Document Format)
.html
.xhtml
OpenDocument
.txt (Plain text)
Image Formats .bmp
.gif
.png
.jpeg
.tiff
Audio Formats .mp3
.aiff
.au
.midi
.wav
Misc Formats .pst (Outlook mail)
.xml
.class (Java class files)
[a] Package formats can contain multiple separate documents inside one file. In such a case the Tika extracted content will include content from all of the included documents.

Tika functionalities

The two main functionalities Tika offers are Mime Type detection and content parsing. Mime type detection is usable for discovering the file type of a file if it is not known beforehand. Tika contains a class named AutoDetectParser that uses mime type detection functionality to find out the mime type of a file and then uses that information to dispatch the parsing task to a parser that can understand the format. By using the AutoDetectParser you don’t have to think about different parsers at all, Tika will take care of that for you.

Mime type detection in Tika can operate on several different bits of available information when it tries to detect the format of a file. These hints include submitted mime type strings, resource name (file name extension) and finally the raw bytes of the document. Data structures of Mime Type Detection are configurable so you can easily add new capabilities to it together with new parser adapters.

The most important capability of Tika is parsing content. Tika provides a thin wrapper/adapter on top of existing parsers, defined by the Parser interface. The Parser interface can be seen in the following example:


Example 1. Tika Parser Interface

public interface Parser {

    /**
     * Parses a document stream into a sequence of XHTML SAX events.
     * Fills in related document metadata in the given metadata object.
     *
     * The given document stream is consumed but not closed by this method.
     * The responsibility to close the stream remains on the caller.
     *
     * @param stream the document stream (input)
     * @param handler handler for the XHTML SAX events (output)
     * @param metadata document metadata (input and output)
     * @throws IOException if the document stream could not be read
     * @throws SAXException if the SAX events could not be processed
     * @throws TikaException if the document could not be parsed
    */
    void parse(InputStream stream, ContentHandler handler, Metadata
    metadata) throws IOException, SAXException, TikaException;

}

As you can see the interface to the Tika parser is extremely simple. It takes in just three parameters. The (input) parameter stream is needed so the parser can read the raw data of document The (output) parameter handler is used to send callback notifications about the logical content of a document back to your application. The handler interface is of type org.xml.sax.ContentHandler and it is exactly same interface that is used in Java SAX 2.0 API.

 

Note

Tika provides some ready made ContentHandler implementations that you might find useful while parsing content with Tika.

Finally, the metadata (input/output) parameter provides additional data to the parser as input and can return additional metadata out from the document. Examples of metadata include things like author name, number of pages, creation date, etc.

Parsing content with Tika

In the following example I will show you how to parse a PDF document. I will use Tika to extract title, author and document body as plain text. The full source code, including everything you need to build and run the example is available, see resources for a link.


Example 2. Parsing content and metadata from PDF file

InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());

In the above example I first create a FileInputStream containing the document to parse. Then I use a Tika content handler called BodyContentHandler that internally constructs a content handler decorator of type XHTMLToTextContentHandler. The decorator is responsible for actually forming the plain text output from from the SAX events that the parser emits.

Next I instantiate a PDFParser directly, call the parse method and close the stream. The explicit call to close method of InputStream is required since it is not the responsibility of a parser to call it for you.

In real life you should probably be using AutoDetectParser or getParser(String mimeType) method from TikaConfig instead of constructing and calling the individual Tika parsers directly.

Run the example:

  1. Download and extract example source package.
  2. Compile example: mvn clean install
  3. Execute mvn exec:java -Dexec.mainClass=com.lucidimagination.article.tika.TikaParsePdf -Dexec.args=src/test/resources/SampleDocument.pdf

Now that I have showed you how to parse plain text content from PDF document using Tika the next natural thing to do is allow searching through this data. Grant Ingersoll recently added a new contributed module for Apache Solr that enables us to do just this very easily.

Introduction to Solr Cell (ExtractingRequestHandler)

The Extracting Request Handler is a Solr contrib module that allows users to post binary documents to solr. The Handler uses Tika to detect content type and for extracting indexable content, both text and metadata, from documents submitted to it. Extracting Request Handler is quite flexible in how it maps content to different fields, how it boosts certain fields, etc. For more detailed information about all available options, look at Solr’s Extracting Request Handler Wiki page.

Extracting Request Handler example

In the following example, I will Download and install Solr, configure the Extracting Request Handler and send PDF document to Solr to be indexed.

Procedure 1. Run the example

  1. Download and extract nightly version (or 1.4 or later).
  2. Download and extract example source package
  3. Setup index schema.

    Tip

    You can use the one provided in example package. It contains a very simple schema where documents contain just three fields Id, Title and Text.

    Example 3. Document structure of provided schema.xml

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text" indexed="true" stored="true"/>
    <field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
  4. Setup and configure Extracting Request Handler.

    Tip

    You can use the one provided in example package. The file where you configure the handler is called solrconfix.xml and it is under solr/conf directory.


    Example 4. Extracting Request Handler configuration in solrconfig.xml

    <requestHandler name="/update/extract">
        <lst name="defaults">
        	<str name="ext.map.Last-Modified">last_modified</str>
    
        	<bool name="ext.ignore.und.fl">true</bool>
        </lst>
    </requestHandler>
  5. Start Solr

    Note

    Once you (re)start your Solr instance it should be ready to consume all document formats that Tika supports.

  6. Send the document to be indexed to Solr

    Tip

    You can use the curl command line utility (available on most *NIX systems) to send the document: curl “http://localhost:8983/solr/update/extract?ext.idx.attr=true&ext.def.fl=text” -F “myfile=@src/test/resources/SampleDocument.pdf” Any other tool or program that can HTTP POST the file using the multipart-form-encoding should be fine too.

  7. Send commit message to solr.

    Tip

    You need to issue commit so that the document becomes searchable. You can use curl to do the commit:

    curl “http://localhost:8983/solr/update/” -H “Content-Type: text/xml” –data-binary ‘<commit waitFlush=”false”/>’

  8. Open Solr admin interface and verify that the document we indexed is indeed searchable.

    Tip

    The Solr admin console is, by default, available at address http://localhost:8983/solr/admin.

Conclusion

Apache Solr, in combination with the Extraction Request Handler, is a very powerful and easy to setup combination with lot of potential. The main benefit of having the parsing and extraction component on server side is that the client does not have to understand all of the formats or even be Java-based. By using Tika, it is also quite easy to add parsers for new file types as they are available, without having to change or add code to Solr.

In some cases, you may need content extraction outside of search or with an existing Lucene implementation, in which case you can use Tika standalone, as I have shown in the first part of this article. Additionally, there is at least one use case with Solr where it makes sense to use Tika on the client side. If you only want to index and search document metadata (i.e. not the full text), and your files are quite large, then parsing your documents on the client side and sending just the metadata to Solr makes the most sense, so as to avoid having to send large files over the wire to Solr.

Resources

MS Office and related applications are a trademark of the Microsoft Corporation. PDF is a trademark of Adobe, Inc.