Exploring Lucene’s Indexing Code: Part 1

Next: Exploring Lucene’s Indexing Code: Part 2

While I have mucked around quite a bit in the search side code of Lucene, I am much less familiar with the hardcore indexing side (I’m talking the hardcore code, casual users need not apply – unless your interested). I’d like to learn more about Lucene’s indexing code, but its not so easy to wrap my mind around on first glance. I’m not looking to be an expert right away, but to have an overview understanding of the lower level details involved in constructing a Lucene index. In instances like this, I find its best to start from a high level and work my way in, hopefully understanding the overall process, and then each of the pieces.

From our use of Lucene, we know that the indexing code must center around the IndexWriter class. To help me get a handle on what IndexWriter does, I am going to trace a few key methods from a very simple Lucene test application that simply adds one small document to an index with an IndexWriter and then closes that IndexWriter. I’m going to limit my trace to IndexWriter methods, just so the info is digestible, and only key methods to start with. We don’t want to get too bogged down in the details – methods will change over time anyway. The idea is to get somewhat of an overview of the underlying process.

Also, we should remember our user knowledge of a Lucene index. The index is made up of 1-n segments. Each segment contains a number of Documents. A Lucene Document contains a number of Fields, which is just a field name, value, and attributes. New segments are written as we add Documents to the index, and segments are merged over time based on certain criteria. The fewer segments you have to search over, the better the performance. Searches on the whole index roll over each segment and an optimize will merge all segments down to one.

The Test Code

Directory directory = new RAMDirectory();
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);

String doc = "a b c d e";
Document d = new Document();
d.add(new Field("contents", doc, Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(d);
writer.close();

And Now The Trace:

After static variable initialization, the IndexWriter instance is initialized:

 *Enter:void IndexWriter.init(Directory, Analyzer, boolean, boolean, IndexDeletionPolicy, boolean, int, DocumentsWriter.IndexingChain, IndexCommit) :1
 *Exit:void IndexWriter.init(Directory, Analyzer, boolean, boolean, IndexDeletionPolicy, boolean, int, DocumentsWriter.IndexingChain, IndexCommit) :1

Then we add the document and close the IndexWriter.

 *Enter:void IndexWriter.addDocument(Document) :1
 *Exit:void IndexWriter.addDocument(Document) :1

We certainly want to dig deeper into IndexWriter.addDocument, because we know a lot of the interesting stuff happens there. But before that, we can get a nice idea of the close process, which begins on line 5.

 *Enter:void IndexWriter.close() :1
  *Enter:void IndexWriter.close(boolean) :2
   // another thread may be closing
   *Enter:boolean IndexWriter.shouldClose() :3
   *Exit:boolean IndexWriter.shouldClose() :3
   // close doc writer, flush / maybe merge / commit / close everything
   *Enter:void IndexWriter.closeInternal(boolean) :3
    *Enter:boolean IndexWriter.doFlush(boolean, boolean) :4
    *Exit:boolean IndexWriter.doFlush(boolean, boolean) :4
    *Enter:void IndexWriter.maybeMerge() :4
    *Exit:void IndexWriter.maybeMerge() :4
    // either abort merges or wait for merges
    *Enter:void IndexWriter.finishMerges(boolean) :4
    *Exit:void IndexWriter.finishMerges(boolean) :4
    // commit all pending adds and deletes, sync index files
    *Enter:void IndexWriter.commit(long) :4
     *Enter:void IndexWriter.startCommit(long, String) :5
     *Exit:void IndexWriter.startCommit(long, String) :5
     *Enter:void IndexWriter.finishCommit() :5
      *Enter:void IndexWriter.setRollbackSegmentInfos(SegmentInfos) :6
      *Exit:void IndexWriter.setRollbackSegmentInfos(SegmentInfos) :6
     *Exit:void IndexWriter.finishCommit() :5
    *Exit:void IndexWriter.commit(long) :4
   *Exit:void IndexWriter.closeInternal(boolean) :3
  *Exit:void IndexWriter.close(boolean) :2
 *Exit:void IndexWriter.close() :1

So we haven’t gotten far, but at the same time we are at the end of our test code. We see that IndexWriter needs a bit of init, and will flush, maybe merge, and then commit upon closing if we add some documents. We have a sense for the overall process that we are inspecting, and in Part 2, we can start to dig into what happens in the IndexWriter.addDocument call.

Next: Exploring Lucene’s Indexing Code: Part 2

Update:

Got a request for the AspectJ code that was used to make the traces, so here is a sample aspect. Keep in mind that this particular sample is not thread safe (shared StringBuilder), so if you are tracing multiple threads, it needs a bit of work. This will get you started though. If you install the Eclipse AspectJ module, add the AspectJ nature to the Lucene project in eclipse (right click on the project and look at the menu for it), put the aspect somewhere with the Lucene source files (call it Trace.aj – you can also put the aspect directly in your Java file), and then build and run, you should be all set to play around. There are also command line programs and Ant tasks if you prefer to go that route – its just a bit more setup.

// the following aspect will print entry/exit stamps for all public methods
// that begin with org.apache.lucene and all private methods of IndexWriter (just to give an example)
public aspect Trace {

  private StringBuilder sb = new StringBuilder();

  pointcut traceableCalls() : traceMethods();

  pointcut traceMethods() :
    execution(public * org.apache.lucene..*(..) ) || execution(private * org.apache.lucene.IndexWriter..*(..) );

  /**
   * log method entry.
   */
  before() : traceableCalls() {
    sb.append(" ");
    int indent = sb.length();
    String enterLine = sb.toString() + "*Enter:"
        + thisJoinPoint.getSignature().toString() + " indent:" + indent;
    System.out.println(enterLine + "n");
  }

  /**
   * log method exit.
   */
  after() : traceableCalls(){
    int indent = sb.length();
    String exitLine = sb.toString() + "*Exit:"
        + thisJoinPoint.getSignature().toString() + " indent:" + indent
        + " thread:" + Thread.currentThread().getId();
    sb.setLength(sb.length() - 1);
    System.out.println(exitLine + "n");
  }
}

How an electronics giant meets engineers where they are, with 44 million products in catalog

Meet Mohammad Mahboob: A search platform director navigating 44 million products across...

From Search to Solutions: How AI Agents Can Power Digital Commerce in 2025

Watch this on-demand webinar to discover the six smartest AI-driven DX strategies...

Build custom AI agents without writing a single line of code? Yep, we did that.

Finally, a low-code AI platform (really, no code) that lets the people...