Blog, Open Source, SearchHub, Technical Article, Tutorials and Documentation

Extending Apache Tika Capabilities

by Sami Siren
June 18, 2010

Apache Tika is a toolkit for extracting metadata and textual content from various document formats. Tika itself provides implementation for parsing some document formats while it relies on external libraries (such as Apache PDFBox and Apache POI) for parsing many more.

Tika provides a uniform Java API for all of the supported document formats to make life easier for the user. Additionally, Tika provides functionality for detecting document type and content language.

In my earlier article, Content Extraction with Tika, I looked into using Tika either with Solr or in standalone mode. In this post I will go though some of the aspects involved when implementing support for new document formats. I will also provide a couple of example parsers and a full maven project to get you up to speed quickly.

Extension mechanisms

The basic principle of adding support for more document formats in Tika is very simple. All you need to do is write a Java class that implements the Tika Parser interface and let Tika know about your extension. If you implemented a parser for one of the >900 file formats Tika knows by filename extension or one of the ~300 formats that Tika can recognize from the number of file content bytes this is all you need to do.

There are at least three different ways to let Tika know about the new parser. The first way (and the most flexible one) is to wire it up with java code. If you wanted to use Tika AutodetectParser you’d call setParsers method on AutodetectParser with the parser of your choice. By wiring up things with java code you could also customize the detection logic easily too just by calling the setDetector method.

  public static void registerParser(AutoDetectParser autodetectParser,
      Parser parser) {
    for (MediaType type : parser.getSupportedTypes(new ParseContext())) {
      autodetectParser.getParsers().put(type.toString(), parser);
    }
  }

The next way to customize the set of available parsers is to use an external XML file and construct a TikaConfig object with that configuration file.

The last of the three methods explained here is to use the new mechanism that was added in version 0.7 of Tika: the standard Java ServiceProvider API. In practice this means that as a provider of a new parsing functionality you’d need to list your parser implementation classes in the file META-INF/services/org.apache.tika.Parser inside the .jar that you provide to the implementation.

com.lucid.tika.MyTXTParser com.lucid.tika.VCardParser

Extending the capabilities of Tika by using the ServiceProvider API is very straightforward and simple. There are however a couple of details you should pay attention to when using this mechanism.

If you need to replace a Tika provided parser implementation with your custom implementation you need to make sure that your .jar file is loaded after the tika-parsers.jar file. This is because in the current implementation the last parser registered for certain mime type is used to parse content for that mime type.

To support completely new document types (that Tika knows nothing about) you need to customize the detection process of AutoDetectParser manually. This is because there is no similar mechanism to extend the detection step as there is for adding new parsers. One way to do this is to use CompositeDetector to add your “overlay” detections to be done and trust for the default Detector for detecting the other types.

  public static void setOverlayDetector(Detector overlay, AutoDetectParser parser) {
    ArrayList detectors = new ArrayList();
    detectors.add(parser.getDetector());
    detectors.add(overlay);
    parser.setDetector(new CompositeDetector(detectors));
  }

In this blog post I have demonstrated some ways to extend Tika parsing and detection capabilities if needed in your custom environment. Process-wise, the best possible way to add new capabilities to Tika is to contribute your new parser integrations and enhancements back to the Tika project. This way the community as whole will benefit from the results.

Running the provided example

Download project
Compile project
mvn clean install
Copy dependencies to directory target/dependencies
mvn dependency:copy-dependencies
Execute the default TikaGUI with our additions (“enhanced” .txt parser, vCard parser)
java -cp target/dependency/tika-app-0.7.jar:target/extending-tika-post-1.0-SNAPSHOT.jar:target/dependency/ical4j-vcard-0.9.2.jar:target/dependency/ical4j-1.0-rc2.jar:target/dependency/commons-codec-1.3.jar:target/dependency/commons-lang-2.4.jar org.apache.tika.gui.TikaGUI

About Sami Siren

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.

Fusion Platform Overview

Fusion Platform Pricing

AI Hub

Lucidworks Features and capabilities (all Included)

Product Discovery

Searchandising

Site Search

Workplace Search

Ingest Data and Capture Signals

Employee Search Experience

Customer Service and Case Resolution

AI and Large Language Models

Solutions

Commerce

Customer Service

Knowledge Management

Industries

Retail

Government and Public Sector

Healthcare

B2B Commerce and Distribution

B2B Manufacturing

Financial Services

EXPLORE OUR CONTENT

Ebooks & Reports

Blog

Videos

Press

Resources

About Lucidworks

Documentation

Careers

LucidAcademy

Contact Us

Technical Support

Extending Apache Tika Capabilities

About Sami Siren

LEARN MORE

Fusion Platform Overview

Fusion Platform Pricing

AI Hub

Lucidworks Features and capabilities (all Included)

Product Discovery

Searchandising

Site Search

Workplace Search

Ingest Data and Capture Signals

Employee Search Experience

Customer Service and Case Resolution

AI and Large Language Models

Solutions

Commerce

Customer Service

Knowledge Management

Industries

Retail

Government and Public Sector

Healthcare

B2B Commerce and Distribution

B2B Manufacturing

Financial Services

EXPLORE OUR CONTENT

Ebooks & Reports

Blog

Videos

Press

Resources

About Lucidworks

Documentation

Careers

LucidAcademy

Contact Us

Technical Support

About Sami Siren

Related Articles

LEARN MORE