A Short Introduction to Indexing / Search using Lucene

At times I find I need an indexing tool to do something akin to an embedded database. This is an embedded index. This comes up when trying to run filters over data in a large visual table, or over some other visualization.

From a coding point of view, the initial attempt at filtering might look something like this:

public List<String> filter(String userFilterText) {
        List<String> ret = new LinkedList<String>();
        for( Entity e : entities ) {
                if( e.containsText(userFilterString) {
                         ret.add(e.getEntityId());
                }
        }
        return ret;
}

At some point the number of rows, or data elements, exceeds the ability to respond to the user request in a timely manner. Even with trying to collect the data and put it into some memory structure, eventually this will break down in some manner.

The solution is to build an index, embedded into the application, which manages the filtering. This means filtering becomes:

public List<String> filter(String userFilterText) {
        List<String> ret = index.query(userFilterText);
        return ret;
}

Although for a small number of items this is a bit slower, it is never really slow enough to impede user perspective. That is, if there is a lot of stuff, there will be an expectation of something being slightly slower and this is acceptable. In addition, the filter now has a way to filter by field instead of just using something like String.contains() or even regular expressions.

Building one of these indexes is quite simple. You add data with Document.add(Field). You query with searcher.search(Query, Collector). It is really just that simple. A fairly useful module can be had for less than 1000 lines of code.

The class IndexProvider.java is at the heart of the example. You call IndexProvider.index(data) for every object you have to index. And then you can call IndexProvider.search(String) to query over the built up index.

The entry point is Example.main() and has one artificial requirement. The first time the example is run, it will create a directory named index and index example.csv. The second time it is run, it will run a query for ‘the’ over the content.

Other, more complicated, queries are possible. To get all of the Lorem text

ut eu

To get a specific field,

+ut +f1:two

This allows the visualization filtering to be as rich as any query. And, more importantly, the filtering can be tied to what ever the data happens to be without any code changes involved.

Click here ->lucene-starter to download a .tgz file with a pom and sources.

Share the knowledge

Cyber 5 2025’s Story Is About Change: Mobile and AI-Powered Natural-Language Search Are Redefining Holiday Shopping

The Black Friday through Cyber Monday period offers a window into where...

Introducing Lucidworks Dynamic Index

Lucidworks Dynamic Index™ is a query-time personalization engine that instantly attaches the...

Lucidworks Named a Leader: What This Means for Search, AI—and Your Business

Lucidworks’ recognition as a Leader signals that enterprise search and AI now...

A Short Introduction to Indexing / Search using Lucene

You Might Also Like

Cyber 5 2025’s Story Is About Change: Mobile and AI-Powered Natural-Language Search Are Redefining Holiday Shopping

Introducing Lucidworks Dynamic Index

Lucidworks Named a Leader: What This Means for Search, AI—and Your Business