I’ve long had a passion for improving the findability within libraries. The richness of the cultural artifacts that one can find with a bit of foraging astonishes the imagination. I had the pleasure of working with the Applied Research in Patacriticism group at the University of Virginia. While building the first version of Collex (collect/exhibit) for NINES I was approached by Bess Sadler asking about the viability of using Solr for searching and faceting on library data. The library world was just seeing scalable faceting take stage with NCSU’s Endeca installation, but the price tag prohibited most every other institution from enjoying faceting. With the prodding from Bess, I learned a bit about MARC, created some Ruby scripts, invented Solr Flare, and was able to pretty much match what NCSU was doing with only a handful of evenings of hacking. I presented my work with an all-day preconference class on Solr and keynote at the 2007 code4lib conference. A lot of things have happened since, and in a large part because of, this initial work. Solr Flare spun off into Blacklight, a Ruby on Rails front-end being used by Stanford’s SearchWorks effort, UVa’s “VIRGO Beta”, and a number of other institutions. VUFind, a PHP-based front-end, is also a popular front-end technology, and there are several other OPACs (online public access catalog, fancy name for “website with a search box”) that reside on top of Solr.
VUFind and Blacklight both share a common indexer, SolrMarc. SolrMarc provides a flexible, extensible tool for mapping the complex standard library MARC format into Solr documents.
Recently it was reported that that SolrMarc indexing performance needed help (Stanford reported 12 hours to index 6M records). I couldn’t help but want to help. So I grabbed the latest SolrMarc (version 2.1, in development), a publicly available MARC file containing 5.7M records, and gave it a try. First I ran SolrMarc against the file, and I killed the job after 9 hours. Rather than looking too deep into the code to see what might be wrong, I decided to get a baseline on how fast indexing MARC could be using the simplest thing that could possibly work. I created a custom MarcEntityProcessor, a hook into Solr’s DataImportHandler. Using the MARC4J library directly, only indexing the id and a toString() of the entire MARC record, I was able to index the same dataset in 55 minutes. It went from 22 docs/s to 1,745 docs/s! To be fair, my implementation didn’t do the fancy mappings needed in the real world, and there is still a bit of work to do in order to fully flesh out a DataImportHandler refactoring, but hopefully this new approach will be embraced by the library Solr community.
This was a long-winded way of saying… I’m devoting a chunk of my next couple of months to the needs of the Solr using library community. My favorite conference of all time is coming up, code4lib conference, and I’m getting ramped up. Naomi Dushay (of Stanford) and I are leading a Solr Blackbelt preconference where we’ll be going through heavy topics like query parsing and improving relevancy.
Stay tuned for lots more light being shed on Solr in the Library!