Short Version

The Apache Lucene Connector Framework project has officially entered incubation.  LCF, for short, is going to be a framework for connecting to content repositories like Sharepoint, Documentum, etc. and will make it easy to hook into Lucene, Solr, Nutch, Mahout, Tika, while, of course, remaining agnostic of the final destination of the data.  See the Connectors website and the original proposal for more info.  Help wanted!

Long Version

Background

A while back, MetaCarta, a spatial search company, approached us about open sourcing their internally developed Connector Framework at the Apache Software Foundation.  After several discussions and a whole bunch of legwork getting a proposal together, the LCF is now officially launched in the Apache Incubator!  We’ve already got a great roster of committers lined up and are working to incorporate the software grant from MetaCarta, from which we can build out a first release, so stay tuned!  Lucid Imagination, of course, is a big supporter of this project and we look forward to it’s success!

What is a Connector Framework?

To quote the proposal:

[The Lucene] Connector Framework is an extendible [sic] incremental crawler, which uses a database to manage configuration and crawl history, and provides reasonably high performance in accessing content in multiple repositories for the main purpose of search engine indexing. Connector Framework also establishes a repository-specific security model which can be used to limit search user access to repository content based on a user’s identity. Connector Framework also includes existing connectors and authorities for:

  • File system
  • Windows shares
  • JDBC-supported databases
  • RSS feeds
  • General websites
  • LiveLink [from OpenText]
  • Documentum [from EMC]
  • SharePoint [from Microsoft]
  • Meridio [from Meridio]
  • Memex [from Memex]
  • FileNet [from IBM]

There are two pieces in particular to highlight in the quote.  First of all, it’s an extensible framework, meaning new connectors can be added without the need for application developers writing “one-off” code just for that connector.  For anyone who’s lived that pain, you know first hand what I mean.  In fact, I’ve already heard from others who are thinking of contributing their connectors for other data stores as well!  Second, the framework accounts for repository specific security.  In corporate environments, this is vital to making sure that the right people, and only the right people, have access to the right information at the right time.

Why is this important?

Many, many search engines, not too mention many other applications, have either rolled their own connectors or bought a company that provides them.  Connectors, in some situations, are the cost of entry into  certain markets, but are rarely the feature that seals the deal.  By making these open source, we can all share the cost of maintaining it while increasing the quality of a piece of software well beyond what any one company can achieve.  Beyond that, we hope the repository companies will also step up and contribute (some are already quite open), as making it easier to access these repositories will no doubt lead to more applications, which of course should mean more sales for said companies.

How can you contribute?

For starters, subscribe to the mailing lists.  Then check out the How To Contribute page on the Wiki.  Beyond that, chip in with your connector use cases on the mailing lists and be a part of the community.

What’s next?

First off, the community will have to process the software grant from MetaCarta and then commit the code to LCF’s Subversion repository.  From there, we’ll do just like any Apache project does and look to build out not only the code, but also the community, all on the path to graduating from the Incubator and taking our place as a full-fledged Lucene subproject.  Keep your eyes here and on the mailing lists and websites for more information in the future!