Imagine that you have to integrate and search data from 200 different sources, each of which uses a different structure (if they use a structure at all). Your data may be incomplete, the same information is represented in different ways by different sources, and it’s often vague. Oh, and if a user can’t find the correct result using a simple Google-like search, someone may literally get away with murder.

Welcome to Ronald Mayer’s world. In his talk on Day 2 of Lucene Revolution, he described how Forensic Logic aggregates data from local police departments, courts, and even federal agencies, so that law enforcement officers can get information on crimes and suspects not only in their own jurisdictions, but also in surrounding areas.

Slides for this session:

He’s got lots of challenges, not the least of which are the differences in data formats. You’d think there’d be a standard for this, and as he points out, there is. In fact, there are lots of them. In reality, the standards are so large that most agencies wind up using some subset, plus their own extensions. And that doesn’t count the agencies that have all of their data in Word files in a folder in someone’s computer. (It also doesn’t count data from external sources. Apparently gangs are big into MySpace. Who knew?)

Once they’ve created a transformation for adding a new agency’s data, they have a whole host of practical issues to deal with. For example, “at dusk” and “near any elementary school in the school district” are perfectly valid requests, and their system has to accomodate that. They’ve also got to be able to take a statement such as “suspect is a caucasian male, approximately 6’4″, wearing a red ball cap and black jacket, leather” and have it come up for “tall caucasian baseball cap black leather jacket”. Entity extraction and help from Basis Technology products helps associate an adjective with a noun, which makes things a bit easier.

Interestingly, Forensic also has to work in the other direction; if all of that information was encoded by field, it wouldn’t be searchable in a simple text search. So another task they’ve had to perform is to de-normalize the data back into a searchable narrative.

Forensic gets good use from Lucene, and in fact has been contributing back; they have been making heavy use of the phrase field (pf) and phrase slop (ps) parameters in the new Extended Dismax parser but what they really needed was to be able to combine several sets of them into a single query. So SOLR-2058 proposes (and implements) a new query syntax, field~slop^boost, that allows for independent pf and ps settings, such as:

pf2=important_text^10~10&pf=important_text^100&pf=important_text^100~10

Mayer says that although searches might take a second or three to return data, relevance is much more important in this case.

There are still problems to overcome. For example, relative boosts are tricky. How far away does an event have to happen before it’s as irrelevant as something that happened two years ago? Forensic continually works on refining the process, tags, synonyms, and other parameters, so at any given moment, they’re always indexing the oldest (non-reindexed) documents.

Remember that the next time you get pulled over.

Cross-posted with Lucene Revolution Blog. Nicholas Chase is a guest blogger.This is one of a series of presentation summaries from the conference.