Searching rich format documents stored in a DBMS

By Jonck van der Kogel

Introduction

As companies gather more and more data, the ability to search this data is becoming increasingly important. Especially with legacy systems, this can sometimes be quite a challenge. One situation you might encounter is where documents in rich formats such as PDF, MS Word/Excel/Powerpoint, etc are stored as BLOBs in a SQL database. Your first reaction might be that this would be a lot of work, since Solr does not support such an import natively. But by using the DataImportHandler of Solr and a custom Transformer, it actually becomes pretty easy and straightforward.

In order to get Solr to recognize that we are working with the DataImportHandler and a custom Transformer, we have to do some configuration first. The three files that need to be modified are:

data-config.xml
solrconfig.xml
schema.xml

data-config.xml

The first step is to set up your data-config.xml. This file will contain your datasource definition, the fields that you want indexed and the query for how to retrieve the data from the database. Here is an example data configuration:

<dataConfig>
	<dataSource driver="oracle.jdbc.driver.OracleDriver"
	 url="jdbc:oracle:thin:@myServer:1521:myDb"
	 user="username"
	 password="password" />
	<document name="distributionDoc">
		<entity name="distribution" query="SELECT distr.id, distr.auhor, distr.titel, distr.summary, distr.publishdate FROM myDistrubutions distr">

		<field column="ID" name="id" />
		<field column="AUTHOR" name="author" />
		<field column="TITEL" name="titel" />
		<field column="SUMMARY" name="summary" />
		<field column="PUBLISHDATE" name="publishDate" />
		<entity name="attachment" query="SELECT attach.ID, attach.file FROM attachments attach
			WHERE attach.ID_DISTRIBUTION='${ distribution.ID}'"
			transformer="com.hinttech.solr.BlobTransformer">

			<field column="ID" name="fileId" />
			<field column="FILE" name="file" blob="true" />
		</entity>
		</entity>
	</document>
</dataConfig>

The dataSource section speaks for itself; here, you set up how to connect to your data source. Remember to include the driver that you specify on the classpath of Solr! In my case I ran Solr using Tomcat so I had to add the Oracle driver to the WEB-INF/lib directory of the solr webapp.

The document section of data-config.xml describes how you want Solr to index your data. In my example, I have documents called “distributions” that need to be searchable on fields like author, title, summary and publication date. Each distribution can have several pdf files attached to it. To set this up, you define a child entity within the entity tag. The distribution table has a primary key “id” which is used to join the attachments. Notice how we use Expression Language in the child entity query. The transformer attribute of the child entity tells Solr which implementation of Transformer this entity needs to transform it into something Solr can index. The field containing the actual BLOB has an added attribute blob=”true”; we’ll get to that shortly.

solrconfig.xml

With the data-config.xml correctly configured, it is now time to enable the DataImportHandler in Solr and tell Solr where the data-config.xml resides. This is done by adding the following to the solrconfig.xml:

<requestHandler name="/dataimport">
	<lst name="defaults">
	<str name="config">/home/username/data-config.xml</str>
	</lst>
</requestHandler>

schema.xml

The last step in configuring Solr is that you need to describe the fields of your data-config.xml in schema.xml. In my example, it would like this:

<fields>
	<field name="id" type="integer" indexed="true" stored="true" required="true" />
	<field name="author" type="string" indexed="true" stored="true" required="false" />
	<field name="titel" type="text" indexed="true" stored="true" required="false" />
	<field name="summary" type="text" indexed="true" stored="true" required="false" />
	<field name="publishDate" type="date" indexed="true" stored="true" required="false" />
	<field name="file" type="text" indexed="true" stored="false" required="false" multiValued="true" />
	<field name="fileId" type="long" indexed="true" stored="true" required="false" multiValued="true" />
</fields>

As you can see the field definitions matches what you set up in data-config.xml verbatim. The only thing that caused me some headaches was that I neglected to set the multiValued=”true” attribute for the fields of the child entity. Without this attribute, it turned out only the first attachment was being indexed, resulting in a lot of queries not finding what I wanted. After a lot of debugging and delving through the source code of DataImportHandler, I discovered that it was simply a matter of setting the multiValued attribute. After that all attachments were properly being indexed.

Custom Transformer

Now let’s code up our custom Transformer. I used PDFBox 0.7.3 to parse the PDF files. While Apache has taken over the PDFBox project, at this time of writing there has not been an official release yet of PDFBox from Apache. I therefore chose to go with the last stable version of PDFBox. PDFBox has a dependency on FontBox and if your PDFs have some security restrictions on them PDFBox will also need some BouncyCastle jars. Below is the Maven2 POM for this project:

<?xml version="1.0" encoding="UTF-8"?>
<project 
	xsi_schemaLocation="http://maven.apache.org/POM/4.0.0 <a href="http://maven.apache.org/maven-v4_0_0.xsd">
" title="http://maven.apache.org/maven-v4_0_0.xsd">
">http://maven.apache.org/maven-v4_0_0.xsd">
</a>	<!-- POM version for Maven 2 -->
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.hinttech</groupId>
	<artifactId>customSolr</artifactId>
	<packaging>jar</packaging>
	<version>1.0</version
	<name>Custom Solr</name></span>
	<build>
		<sourceDirectory>src</sourceDirectory>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-compiler-plugin</artifactId>
				<configuration>
					<source>1.6</source>
					<target>1.6</target>
				</configuration>
			</plugin>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-jar-plugin</artifactId>
			</plugin>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-dependency-plugin</artifactId
				<executions>
					<execution>
						<id>copy-extra-dependencies</id>
						<goals>
							<goal>copy-dependencies</goal>
						</goals>
					</execution>
				</executions>
			</plugin>
		</plugins>
	</build>
	<dependencies>
		<dependency>
			<groupId>org.pdfbox</groupId>
			<artifactId>PDFBox</artifactId>
			<version>0.7.3</version>
		</dependency>
		<dependency>
			<groupId>org.fontbox</groupId>
			<artifactId>FontBox</artifactId>
			<version>0.1.0</version>
		</dependency>
		<dependency>
			<groupId>org.apache.solr</groupId>
			<artifactId>solr-dataimporthandler</artifactId>
			<version>1.3.0</version>
		</dependency>
		<dependency>
			<groupId>commons-logging</groupId>
			<artifactId>commons-logging</artifactId>
 			<version>1.1.1</version>
		</dependency>
		<dependency>
			<groupId>org.bouncycastle</groupId>
			<artifactId>bcprov-jdk16</artifactId>
			<version>143</version>
		</dependency>
		<dependency>
			<groupId>org.bouncycastle</groupId>
			<artifactId>bcmail-jdk16</artifactId>
			<version>143</version>
		</dependency>
	</dependencies>
</project>

Note that I needed to download the PDFBox and FontBox jars manually, and put them in our local Maven repository.

The custom Transformer needs to extend from the org.apache.solr.handler.dataimport.Transformer class and override the transformRow method. The method gets passed the current row in the resultset of the query of the child entity query as a Map<String, Object> and a Context object from which you can retrieve all the fields of the child entity. By looping through all the fields of the current entity you can test whether a field has the “blob” attribute. If so, we know we are dealing with a BLOB, so we can retrieve it from the current row. We read the text from the PDF using PDFBox and put the parsed text back in the row object using the column name (in our case, “FILE”) as the key. The BLOB has now been transformed into text, which then gets passed back to Solr which can now in turn unleash its indexing magic.

import java.io.IOException;
import java.io.InputStream;
import java.math.BigDecimal;
import java.sql.Blob;
import java.sql.SQLException;
import java.util.List;
import java.util.Map;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.Transformer;
import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
public class BlobTransformer extends Transformer {
	private static Log LOGGER = LogFactory.getLog(BlobTransformer.class);

	@Override
	public Object transformRow(Map<String, Object> row, Context context) {
		List<Map<String, String>> fields = context.getAllEntityFields();

			BigDecimal id = null;
			for (Map<String, String> field : fields) {
				String name = field.get("name");
				if ("fileId".equals(name)) {
					String columnName = field.get("column");
					id = (BigDecimal) row.get(columnName);
					LOGGER.info("Processing file with ID: " + id);
				}
				String isBlob = field.get("blob");
				if("true".equals(isBlob)) {
					String columnName = field.get("column");
					Blob blob = (Blob) row.get(columnName);
					COSDocument cd = null;
					InputStream inputStream = null;
					try {
						inputStream = blob.getBinaryStream();
						PDFParser parser = new PDFParser(inputStream);
						parser.parse();
						cd = parser.getDocument();
						PDFTextStripper stripper = new PDFTextStripper();
						String text = stripper.getText(new PDDocument(cd));

						row.put(columnName, text);
						LOGGER.info("Processed file with ID: " + id);
					} catch (SQLException sqle) {
						LOGGER.error(sqle);
						row.put(columnName, "");
					} catch (IOException ioe) {
						LOGGER.error(ioe);
						row.put(columnName, "");
 					}  finally {
						try {
							if (cd != null) {
								cd.close();
							}
							if (inputStream != null) {
								inputStream.close();
							}
						} catch (IOException ioe) {
							LOGGER.error(ioe);
						}
					}
				}
			}
			return row;
		}
}

Create a jar that contains the above class and put it, as well as its dependencies, on the classpath of Solr. Restart Solr. If you are running Solr locally you can now start a full import by calling this url:

http://localhost:8080/solr/dataimport?command=full-import

In our case indexing took about 20 minutes. Once it’s ready, go to

http://localhost:8080/solr/admin/

which shows you the Solr admin page which you can use to run some test queries. If all is well you should now be able to search through the PDF files stored as BLOBs in your SQL database.

Jonck van der Kogel is a software engineer and manager of the Java Competence Center at HintTech in the Netherlands. He has over 10 years of experience in software development, focused on J2EE web applications and webservices.

How an electronics giant meets engineers where they are, with 44 million products in catalog

Meet Mohammad Mahboob: A search platform director navigating 44 million products across...

From Search to Solutions: How AI Agents Can Power Digital Commerce in 2025

Watch this on-demand webinar to discover the six smartest AI-driven DX strategies...

Build custom AI agents without writing a single line of code? Yep, we did that.

Finally, a low-code AI platform (really, no code) that lets the people...