Blog, Open Source, SearchHub, Technical Article, Tutorials and Documentation

Exploring Lucene’s Indexing Code: Part 1

by Mark Miller
March 4, 2009

Next: Exploring Lucene’s Indexing Code: Part 2

While I have mucked around quite a bit in the search side code of Lucene, I am much less familiar with the hardcore indexing side (I’m talking the hardcore code, casual users need not apply – unless your interested). I’d like to learn more about Lucene’s indexing code, but its not so easy to wrap my mind around on first glance. I’m not looking to be an expert right away, but to have an overview understanding of the lower level details involved in constructing a Lucene index. In instances like this, I find its best to start from a high level and work my way in, hopefully understanding the overall process, and then each of the pieces.

From our use of Lucene, we know that the indexing code must center around the IndexWriter class. To help me get a handle on what IndexWriter does, I am going to trace a few key methods from a very simple Lucene test application that simply adds one small document to an index with an IndexWriter and then closes that IndexWriter. I’m going to limit my trace to IndexWriter methods, just so the info is digestible, and only key methods to start with. We don’t want to get too bogged down in the details – methods will change over time anyway. The idea is to get somewhat of an overview of the underlying process.

Also, we should remember our user knowledge of a Lucene index. The index is made up of 1-n segments. Each segment contains a number of Documents. A Lucene Document contains a number of Fields, which is just a field name, value, and attributes. New segments are written as we add Documents to the index, and segments are merged over time based on certain criteria. The fewer segments you have to search over, the better the performance. Searches on the whole index roll over each segment and an optimize will merge all segments down to one.

The Test Code

Directory directory = new RAMDirectory();
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);

String doc = "a b c d e";
Document d = new Document();
d.add(new Field("contents", doc, Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(d);
writer.close();

And Now The Trace:

After static variable initialization, the IndexWriter instance is initialized:

 *Enter:void IndexWriter.init(Directory, Analyzer, boolean, boolean, IndexDeletionPolicy, boolean, int, DocumentsWriter.IndexingChain, IndexCommit) :1
 *Exit:void IndexWriter.init(Directory, Analyzer, boolean, boolean, IndexDeletionPolicy, boolean, int, DocumentsWriter.IndexingChain, IndexCommit) :1

Then we add the document and close the IndexWriter.

 *Enter:void IndexWriter.addDocument(Document) :1
 *Exit:void IndexWriter.addDocument(Document) :1

We certainly want to dig deeper into IndexWriter.addDocument, because we know a lot of the interesting stuff happens there. But before that, we can get a nice idea of the close process, which begins on line 5.

 *Enter:void IndexWriter.close() :1
  *Enter:void IndexWriter.close(boolean) :2
   // another thread may be closing
   *Enter:boolean IndexWriter.shouldClose() :3
   *Exit:boolean IndexWriter.shouldClose() :3
   // close doc writer, flush / maybe merge / commit / close everything
   *Enter:void IndexWriter.closeInternal(boolean) :3
    *Enter:boolean IndexWriter.doFlush(boolean, boolean) :4
    *Exit:boolean IndexWriter.doFlush(boolean, boolean) :4
    *Enter:void IndexWriter.maybeMerge() :4
    *Exit:void IndexWriter.maybeMerge() :4
    // either abort merges or wait for merges
    *Enter:void IndexWriter.finishMerges(boolean) :4
    *Exit:void IndexWriter.finishMerges(boolean) :4
    // commit all pending adds and deletes, sync index files
    *Enter:void IndexWriter.commit(long) :4
     *Enter:void IndexWriter.startCommit(long, String) :5
     *Exit:void IndexWriter.startCommit(long, String) :5
     *Enter:void IndexWriter.finishCommit() :5
      *Enter:void IndexWriter.setRollbackSegmentInfos(SegmentInfos) :6
      *Exit:void IndexWriter.setRollbackSegmentInfos(SegmentInfos) :6
     *Exit:void IndexWriter.finishCommit() :5
    *Exit:void IndexWriter.commit(long) :4
   *Exit:void IndexWriter.closeInternal(boolean) :3
  *Exit:void IndexWriter.close(boolean) :2
 *Exit:void IndexWriter.close() :1

So we haven’t gotten far, but at the same time we are at the end of our test code. We see that IndexWriter needs a bit of init, and will flush, maybe merge, and then commit upon closing if we add some documents. We have a sense for the overall process that we are inspecting, and in Part 2, we can start to dig into what happens in the IndexWriter.addDocument call.

Next: Exploring Lucene’s Indexing Code: Part 2

Update:

Got a request for the AspectJ code that was used to make the traces, so here is a sample aspect. Keep in mind that this particular sample is not thread safe (shared StringBuilder), so if you are tracing multiple threads, it needs a bit of work. This will get you started though. If you install the Eclipse AspectJ module, add the AspectJ nature to the Lucene project in eclipse (right click on the project and look at the menu for it), put the aspect somewhere with the Lucene source files (call it Trace.aj – you can also put the aspect directly in your Java file), and then build and run, you should be all set to play around. There are also command line programs and Ant tasks if you prefer to go that route – its just a bit more setup.

// the following aspect will print entry/exit stamps for all public methods
// that begin with org.apache.lucene and all private methods of IndexWriter (just to give an example)
public aspect Trace {

  private StringBuilder sb = new StringBuilder();

  pointcut traceableCalls() : traceMethods();

  pointcut traceMethods() :
    execution(public * org.apache.lucene..*(..) ) || execution(private * org.apache.lucene.IndexWriter..*(..) );

  /**
   * log method entry.
   */
  before() : traceableCalls() {
    sb.append(" ");
    int indent = sb.length();
    String enterLine = sb.toString() + "*Enter:"
        + thisJoinPoint.getSignature().toString() + " indent:" + indent;
    System.out.println(enterLine + "n");
  }

  /**
   * log method exit.
   */
  after() : traceableCalls(){
    int indent = sb.length();
    String exitLine = sb.toString() + "*Exit:"
        + thisJoinPoint.getSignature().toString() + " indent:" + indent
        + " thread:" + Thread.currentThread().getId();
    sb.setLength(sb.length() - 1);
    System.out.println(exitLine + "n");
  }
}

Lucidworks Platform Overview

Lucidworks Platform Pricing

AI Hub

Lucidworks Features and capabilities (all Included)

Product Discovery

Searchandising

Site Search

Workplace Search

Ingest Data and Capture Signals

Employee Search Experience

Customer Service and Case Resolution

AI and Large Language Models

Search Path

Solutions

Commerce

Customer Service

Knowledge Management

Industries

Retail

Government and Public Sector

Healthcare

B2B Commerce and Distribution

B2B Manufacturing

Financial Services

EXPLORE OUR CONTENT

Ebooks & Reports

Blog

Videos

Press

Search Path

Resources

About Lucidworks

Documentation

Careers

LucidAcademy

Contact Us

Technical Support

The Test Code

And Now The Trace:

Update:

About Mark Miller

Related Articles