With the approach of Apache Lucene Eurocon in Barcelona in October, and Apachecon NA 2011 in Vancouver in November, it’s no surprise that the open source movers and shakers are out in force talking up some pretty cool new things in Lucene/Solr. This past week’s OStatic features a guest post entitled “Under the Hood in Apache Lucene 4.0″ by Simon Willnauer, recently installed chair of the Lucene PMC, on the latest technical breakthroughs coming in 4.0. Featured innovations discussed there include:

  • The full switch to using bytes (UTF8) in place of text strings for indexing within the search engine library means the ‘term dictionary’ can load up to 30x times faster using 1/10th the memory
  • ‘flexible indexing’. So data structure for the index format can now be chosen and loaded as a pluggable codec — optimized codecs to index individual datasets or even individual fields.
  • ‘concurrent flushing’ so that when an individual thread buffer fills up it can flush its memory independently to persistent storage  while others continue working.
  • Improvements to ‘fuzzy matching’using the sophisticated Levenshtein Automation can compute a list of terms within a set edit distance of the search term, before matching only these against the index. No mere ‘simple’ change: it yielded a “mere 20,000% increase in fuzzy matching performance”.

No mean feat: read the details of putting the Levenshtein Automaton theoretical work into practice, with pivotal efforts contributed by Lucid’s own Mark Miller and Robert Muir, written up earlier this year by veteran committer Mike McCandless,

Simon Willnauer, Mark Miller, Robert Muir are among a dozen or so Lucene/Solr committers speaking at Apache Lucene Eurocon, October 19-20 in Barcelona. And don’t miss the most complete curriculum of Lucene, Solr, and Big Data/Hadoop search training in one place, with a wide range of workshop intensives.  Save a fist full of Euros (up to €200!) and register today – early bird pricing is good through September 6th..