Apache Tika is a toolkit for extracting metadata and textual content from various document formats. Tika itself provides implementation for parsing some document formats while it relies on external libraries (such as Apache PDFBox and Apache POI) for parsing many more.

Tika provides a uniform Java API for all of the supported document formats to make life easier for the user.  Additionally, Tika provides functionality for detecting document type and content language.

In my earlier article, Content Extraction with Tika, I looked into using Tika either with Solr or in standalone mode. In this post I will go though some of the aspects involved when implementing support for new document formats. I will also provide a couple of example parsers and a full maven project to get you up to speed quickly.

Extension mechanisms

The basic principle of adding support for more document formats in Tika is very simple. All you need to do is write a Java class that implements the Tika Parser interface and let Tika know about your extension. If you implemented a parser for one of the >900 file formats Tika knows by filename extension or one of the ~300 formats that Tika can recognize from the number of file content bytes this is all you need to do.

There are at least three different ways to let Tika know about the new parser. The first way (and the most flexible one) is to wire it up with java code. If you wanted to use Tika AutodetectParser you’d call setParsers method on AutodetectParser with the parser of your choice. By wiring up things with java code you could also customize the detection logic easily too just by calling the setDetector method.

  public static void registerParser(AutoDetectParser autodetectParser,
      Parser parser) {
    for (MediaType type : parser.getSupportedTypes(new ParseContext())) {
      autodetectParser.getParsers().put(type.toString(), parser);
    }
  }

The next way to customize the set of available parsers is to use an external XML file and construct a TikaConfig object with that configuration file.

The last of the three methods explained here is to use the new mechanism that was added in version 0.7 of Tika: the standard Java ServiceProvider API. In practice this means that as a provider of a new parsing functionality you’d need to list your parser implementation classes in the file META-INF/services/org.apache.tika.Parser inside the .jar that you provide to the implementation.


com.lucid.tika.MyTXTParser
com.lucid.tika.VCardParser

Extending the capabilities of Tika by using the ServiceProvider API is very straightforward and simple. There are however a couple of details you should pay attention to when using this mechanism.

If you need to replace a Tika provided parser implementation with your custom implementation you need to make sure that your .jar file is loaded after the tika-parsers.jar file. This is because in the current implementation the last parser registered for certain mime type is used to parse content for that mime type.

To support completely new document types (that Tika knows nothing about) you need to customize the detection process of AutoDetectParser manually. This is because there is no similar mechanism to extend the detection step as there is for adding new parsers. One way to do this is to use CompositeDetector to add your “overlay” detections to be done and trust for the default Detector for detecting the other types.

  public static void setOverlayDetector(Detector overlay, AutoDetectParser parser) {
    ArrayList detectors = new ArrayList();
    detectors.add(parser.getDetector());
    detectors.add(overlay);
    parser.setDetector(new CompositeDetector(detectors));
  }

In this blog post I have demonstrated some ways to extend Tika parsing and detection capabilities if needed in your custom environment. Process-wise, the best possible way to add new capabilities to Tika is to contribute your new parser integrations and enhancements back to the Tika project. This way the community as whole will benefit from the results.

Running the provided example

  • Download project
  • Compile project
    mvn clean install
  • Copy dependencies to directory target/dependencies
    mvn dependency:copy-dependencies
  • Execute the default TikaGUI with our additions (“enhanced” .txt parser, vCard parser)

    java -cp target/dependency/tika-app-0.7.jar:target/extending-tika-post-1.0-SNAPSHOT.jar:target/dependency/ical4j-vcard-0.9.2.jar:target/dependency/ical4j-1.0-rc2.jar:target/dependency/commons-codec-1.3.jar:target/dependency/commons-lang-2.4.jar org.apache.tika.gui.TikaGUI

About Sami Siren

Read more from this author

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.