A Short Introduction to Indexing / Search using Lucene

At times I find I need an indexing tool to do something akin to an embedded database.  This is an embedded index.  This comes up when trying to run filters over data in a large visual table, or over some other visualization.

From a coding point of view, the initial attempt at filtering might look something like this:

public List<String> filter(String userFilterText) {
        List<String> ret = new LinkedList<String>();
        for( Entity e : entities ) {
                if( e.containsText(userFilterString) {
                         ret.add(e.getEntityId());
                }
        }
        return ret;
}

At some point the number of rows, or data elements, exceeds the ability to respond to the user request in a timely manner.  Even with trying to collect the data and put it into some memory structure, eventually this will break down in some manner.

The solution is to build an index, embedded into the application, which manages the filtering.  This means filtering becomes:

public List<String> filter(String userFilterText) {
        List<String> ret = index.query(userFilterText);
        return ret;
}

Although for a small number of items this is a bit slower, it is never really slow enough to impede user perspective.  That is, if there is a lot of stuff, there will be an expectation of something being slightly slower and this is acceptable. In addition, the filter now has a way to filter by field instead of just using something like String.contains() or even regular expressions.

Building one of these indexes is quite simple.  You add data with Document.add(Field).  You query with searcher.search(Query, Collector).  It is really just that simple.  A fairly useful module can be had for less than 1000 lines of code.

The class IndexProvider.java is at the heart of the example.  You call IndexProvider.index(data) for every object you have to index.  And then you can call IndexProvider.search(String) to query over the built up index.

The entry point is Example.main() and has one artificial requirement.  The first time the example is run, it will create a directory named index and index example.csv.  The second time it is run, it will run a query for ‘the’ over the content.

Other, more complicated, queries are possible.  To get all of the Lorem text

ut eu

To get a specific field,

+ut +f1:two

This allows the visualization filtering to be as rich as any query.  And, more importantly, the filtering can be tied to what ever the data happens to be without any code changes involved.

Click here ->lucene-starter to download a .tgz file with a pom and sources.

You Might Also Like

How Lenovo made search a strategic growth driver in the AI era

Discover how Lenovo turned search into a strategic growth driver with Lucidworks,...

Read More

The State of Generative AI 2025: 3 questions to understand your agentic AI readiness

How prepared are businesses for agentic AI? Lucidworks data gives us the...

Read More

Announcing our 2025 Superstars of Search Award winners: Mouser, TE, and Coppel

Celebrating 3 incredible Lucidworks clients who transformed their search experiences and drove...

Read More

Quick Links