Typewritter

Despite the broad acceptance of Apache Lucene and Solr technologies across a wide range of environments and content types, we’ve concluded that there is one area where community development and commercial extensions have fallen short: searching printed, physical paper.

As a result, today we at Lucid Imagination are announcing our intent to develop Lucene Paper Distribution, a new set of libraries for libraries, private collections, government organizations, museums, courts, lawyers and office supply companies who still have a lot of the stuff of dead trees, covered with ink and set into discrete, bound volumes with page numbers and all that good stuff.

Lucene Paper Distribution, which we expect to release exactly one year from today, will help that small part of the population that has to conduct searches without networks, software, browsers, or any of those other commonplace conveniences that are already dominated by modern Lucene/Solr applications. Users of Lucene Paper Distribution will benefit from new analyzers that can parse: both 100% cotton and traditional wood-pulp papers; serif and sans-serif fonts; page numbers at both the top and bottom of the page; and both thought balloons as well as speech balloons in comic books and graphic novels from 1937 forward. We also expect to have file extractors for both good and bad penmanship, 8.5×11 inch, legal size, and A4 paper for the European marketplace.

There are many important choices ahead in development of the Lucene Paper Distributions. One large computer company has offered to develop the libraries in COBOL and donate the code; however, we also have contribs offered by teams at Georgia Pacific, Hammermil, and Smith Corona. It will take a while to sort these out, but by April 1 of 2011 we should have a complete package.

Lucene Paper Edition will do many things to expand the frontiers of search, but there are some limitations. We won’t be able to tell you in which issue of Spider-Man Peter Parker originally asked Mary Jane on their first date, how lemmatization could help with the missing 18 minutes of the Watergate tape transcripts, why anyone memorizes Madonna or Elton John Lyrics, the meaning of Fibonacci numbers in the Kochel catalog of Mozart’s works, how to read obscure t-shirts from 1990s developer conventions, why the originators of OS BSD abandoned their plan to call it BFD, or what were the last three words of Carol’s inscription in your high school yearbook.

One final note: there are many additional edge use cases you may be familiar with where Lucene Paper Distribution may be fruitfully applied; please submit these as comments to this blog post. Deadline: April 1, 2010.