Solr and law enforcement: highly relevant results can be a crime

by Lucidworks
June 1, 2011

Imagine that you have to integrate and search data from 200 different sources, each of which uses a different structure (if they use a structure at all). Your data may be incomplete, the same information is represented in different ways by different sources, and it’s often vague. Oh, and if a user can’t find the correct result using a simple Google-like search, someone may literally get away with murder.

Welcome to Ronald Mayer’s world. In his talk on Day 2 of Lucene Revolution, he described how Forensic Logic aggregates data from local police departments, courts, and even federal agencies, so that law enforcement officers can get information on crimes and suspects not only in their own jurisdictions, but also in surrounding areas.

Slides for this session:

He’s got lots of challenges, not the least of which are the differences in data formats. You’d think there’d be a standard for this, and as he points out, there is. In fact, there are lots of them. In reality, the standards are so large that most agencies wind up using some subset, plus their own extensions. And that doesn’t count the agencies that have all of their data in Word files in a folder in someone’s computer. (It also doesn’t count data from external sources. Apparently gangs are big into MySpace. Who knew?)

Once they’ve created a transformation for adding a new agency’s data, they have a whole host of practical issues to deal with. For example, “at dusk” and “near any elementary school in the school district” are perfectly valid requests, and their system has to accomodate that. They’ve also got to be able to take a statement such as “suspect is a caucasian male, approximately 6’4″, wearing a red ball cap and black jacket, leather” and have it come up for “tall caucasian baseball cap black leather jacket”. Entity extraction and help from Basis Technology products helps associate an adjective with a noun, which makes things a bit easier.

Interestingly, Forensic also has to work in the other direction; if all of that information was encoded by field, it wouldn’t be searchable in a simple text search. So another task they’ve had to perform is to de-normalize the data back into a searchable narrative.

Forensic gets good use from Lucene, and in fact has been contributing back; they have been making heavy use of the phrase field (pf) and phrase slop (ps) parameters in the new Extended Dismax parser but what they really needed was to be able to combine several sets of them into a single query. So SOLR-2058 proposes (and implements) a new query syntax, field~slop^boost, that allows for independent pf and ps settings, such as:

pf2=important_text^10~10&pf=important_text^100&pf=important_text^100~10

Mayer says that although searches might take a second or three to return data, relevance is much more important in this case.

There are still problems to overcome. For example, relative boosts are tricky. How far away does an event have to happen before it’s as irrelevant as something that happened two years ago? Forensic continually works on refining the process, tags, synonyms, and other parameters, so at any given moment, they’re always indexing the oldest (non-reindexed) documents.

Remember that the next time you get pulled over.

Cross-posted with Lucene Revolution Blog. Nicholas Chase is a guest blogger.This is one of a series of presentation summaries from the conference.

About Lucidworks

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.

Fusion Platform Overview

Fusion Platform Pricing

AI Hub

Lucidworks Features and capabilities (all Included)

Product Discovery

Searchandising

Site Search

Workplace Search

Ingest Data and Capture Signals

Employee Search Experience

Customer Service and Case Resolution

AI and Large Language Models

Solutions

Commerce

Customer Service

Knowledge Management

Industries

Retail

Government and Public Sector

Healthcare

B2B Commerce and Distribution

B2B Manufacturing

Financial Services

EXPLORE OUR CONTENT

Ebooks & Reports

Blog

Videos

Press

Resources

About Lucidworks

Documentation

Careers

LucidAcademy

Contact Us

Technical Support

Solr and law enforcement: highly relevant results can be a crime

About Lucidworks

LEARN MORE

Fusion Platform Overview

Fusion Platform Pricing

AI Hub

Lucidworks Features and capabilities (all Included)

Product Discovery

Searchandising

Site Search

Workplace Search

Ingest Data and Capture Signals

Employee Search Experience

Customer Service and Case Resolution

AI and Large Language Models

Solutions

Commerce

Customer Service

Knowledge Management

Industries

Retail

Government and Public Sector

Healthcare

B2B Commerce and Distribution

B2B Manufacturing

Financial Services

EXPLORE OUR CONTENT

Ebooks & Reports

Blog

Videos

Press

Resources

About Lucidworks

Documentation

Careers

LucidAcademy

Contact Us

Technical Support

About Lucidworks

Related Articles

LEARN MORE