By now, many of you have had the opportunity to use the online, searchable version of the Lucidworks Reference Guide for Solr 1.4. In this post, I’ll describe how we took the original document version of the Reference Guide, and transformed it into an online resource searched by Solr.

I hope that you might find this useful if you are faced with creating a similar, online searchable service from existing documents.

The Content

The Reference Guide itself was composed and edited in OpenOffice Writer (OOW), chosen not least owing to its open source provenance. While the primary design goals was to create a single downloadable PDF of the full Reference Guide , the use of OOW gave us a helpful starting point for the transformation into its searchable resource counterpart.

Each chapter in the guide was written as a single OOW writer (e.g., Chapter1.sxw, Chapter2.sxw, etc.). To create a truly useful index of the book, and simplify navigation of the content, we decided that rather than indexing each chapter as a single document, we instead set out index each section as independent text documents. Thus, Chapter 2 was to be indexed as the following 12 documents:

  • 2 Getting Started
  • 2.1 Installing Lucidworks for Solr
  • 2.1.1 Got Java?
  • 2.1.2 Downloading the Lucidworks for Solr Installer
  • 2.1.3 Running the Installer
  • 2.2 Running Lucidworks for Solr
  • 2.2.1 Fire Up the Server
  • 2.2.2 Add Documents
  • 2.2.3 Ask Questions
  • 2.2.4 Clean Up
  • 2.3 A Quick Overview
  • 2.4 A Step Closer

 

 

Document pipeline

The first step of the conversion sequence was to create HTML files. We used writer2latex, an open source program, to generate xhtml files. What writer2latex does is turn a LaTex or OpenOffice document into a navigable web page sequence. The pages include html page titles, previous/next/up buttons, and basic formatting. We created separate xhtml pages for every section in the book.

Next, we had to spend some time fooling with the configuration of writer2latex to match our purpose; it makes various default assumptions about its output based on its mission to create web pageflows that didn’t square with our intent. In the end, we had to post-process each page to match our purposes. (More on the final output format later).

With the post-processed output in hand, we indexed it using basic Solr. We now had a simple solr index that could return desired documents accurately based on standard solr queries.

The Search Application

Next, we had some work to do to turn the Solr application into a full-fledged web application that would fetch the indexed html and renders it within the presentation layer you now see on the search display page (our internal name for this application is LucidFind). Next, we had some work to do to turn the Solr application into a full-fledged web application that would fetch the indexed html and renders it within the presentation layer you now see on the search display page (our internal name for this application is LucidFind). And, since LucidFind already included faceting, we used Solr’s basic faceting capabilities to add facets for the LWCDRG along with our website content and blog posts.

Now, we were into some challenges with the formatting of the documents, which required some work in post processing. Some heavy scripting and HTML tweaking were required in order to get the desired, visually consistent behavior for table formatting, image framing, and CSS creation; problems included paragraph spacing, font sizing, and similar HTML presentation issues. We also encountered a bug in Writer2LaTeX, in which the tool didn’t capture properly chapter numbers from the OOW metadata; on contacting the author, he fixed it for us (gotta love open source!).

Some Content Management Challenges

We also placed the book’s images in a static content directory; this took a little extra work as the original LWCDRG design did not account for this centralized approach; for editing the book and creating the PDF it is more convenient to bind the images into each chapter, though in retrospect we might have been able to do this with some mechanism using external links.

The id for the document includes the chapter number and section number. The actual document text is xhtml body element text without the surrounding

<body></body>

This makes it easy to pull the parent sections and include them for context, wrapping the sections in one large html page. For testing, an xsl script did exactly this.

Lance Norskog is a search engineer at Lucid Imagination.