At times I find I need an indexing tool to do something akin to an embedded database.  This is an embedded index.  This comes up when trying to run filters over data in a large visual table, or over some other visualization.

From a coding point of view, the initial attempt at filtering might look something like this:

public List<String> filter(String userFilterText) {
        List<String> ret = new LinkedList<String>();
        for( Entity e : entities ) {
                if( e.containsText(userFilterString) {
                         ret.add(e.getEntityId());
                }
        }
        return ret;
}

At some point the number of rows, or data elements, exceeds the ability to respond to the user request in a timely manner.  Even with trying to collect the data and put it into some memory structure, eventually this will break down in some manner.

The solution is to build an index, embedded into the application, which manages the filtering.  This means filtering becomes:

public List<String> filter(String userFilterText) {
        List<String> ret = index.query(userFilterText);
        return ret;
}

Although for a small number of items this is a bit slower, it is never really slow enough to impede user perspective.  That is, if there is a lot of stuff, there will be an expectation of something being slightly slower and this is acceptable. In addition, the filter now has a way to filter by field instead of just using something like String.contains() or even regular expressions.

Building one of these indexes is quite simple.  You add data with Document.add(Field).  You query with searcher.search(Query, Collector).  It is really just that simple.  A fairly useful module can be had for less than 1000 lines of code.

The class IndexProvider.java is at the heart of the example.  You call IndexProvider.index(data) for every object you have to index.  And then you can call IndexProvider.search(String) to query over the built up index.

The entry point is Example.main() and has one artificial requirement.  The first time the example is run, it will create a directory named index and index example.csv.  The second time it is run, it will run a query for ‘the’ over the content.

Other, more complicated, queries are possible.  To get all of the Lorem text

ut eu

To get a specific field,

+ut +f1:two

This allows the visualization filtering to be as rich as any query.  And, more importantly, the filtering can be tied to what ever the data happens to be without any code changes involved.

Click here ->lucene-starter to download a .tgz file with a pom and sources.

About Tim Casey

Read more from this author

Best of the Month. Straight to Your Inbox!
Dive into the best content with our monthly Roundup Newsletter! Each month, we handpick the top stories, insights, and updates to keep you in the know.