Case Study:
Chartered by the European Commission in 2007, the overarching goal of Europeana is to create an online environment that builds on Europe’s rich history, combining multicultural and multilingual environments with technological advances and new business models. Ultimately, this will bring together various cultural heritage domains – museums, libraries, archives, and audio-visual archives – from across Europe and create a single, unified portal to showcase their collections.

Highlights

  • Solr helps users find the cultural treasure they are looking for, searching through millions of objects across thousands of years, in 26 European languages.
  • Solr provides sophisticated browsing and searching capabilities to find paintings, photographs, objects, books, newspapers, archival records, films and sound that have been digitized by Europe’s heritage organizations.
  • Open source technology enables contribution by hundreds of cultural institutions, with hundreds more in the queue.

Introduction

The European Union is an established political and economic fact, made remarkable its tremendous cultural and linguistic diversity. This rich history is rooted in centuries of history, millions of documents and artifacts, dozens of languages, scattered in hundreds of institutions. With world-famous museums – including the Rijksmuseum in Amsterdam, British Library in London, and Louvre in Paris – now taking their collections digital, you could browse your way through Europe’s cultural heritage without a Eurail pass, a GPS device or a guidebook, one museum at a time.

The European Union has established Europeana as its digital library, museum and archive. This collection of collections is a single, unifying web portal connecting users with millions of digital objects, including film material, photos, paintings, sounds, maps, manuscripts, books, newspapers, and archival papers. Bringing together digitized collections and information from libraries, museums, universities, and other national institutions, Europeana provides unparalleled online access to Europe’s cultural and scientific heritage. In pursuit of open, powerful search, the Europeana development team chose the Solr open source search platform, using its capabilities to help users in any of the member states – and around the world – traverse these vast collections, reaching across time and distance through the Internet.

Historic Challenges

Enabling users from many different cultures, using many different languages, to find the document, audio, or image resource they are looking for is a challenging requirement. In addition, the search solution had to be sustainable and extensible, encompassing not only the millions of digital objects that are part of the current project, but also the significantly larger number of objects, users, and contributors yet to come.

A core team based in the national library of the Netherlands, the Koninklijke Bibliotheek, runs the project. After a year in planning and prototyping components, a team of three developers had a matter of weeks to build and deploy the final prototype portal. They created an organizing pipeline to convert all the metadata (in 26 languages) from dozens of constituent institutions into a custom, unifying format that they could manage and control. The project uses Solr and the CopyField class to separate different languages into different indexes. The team then configured the schema.xml file to create custom processing pipelines for each field type, by language.

The prototype showed that harvesting data in all the different languages and custom metadata formats required a normalizing pipeline to convert everything into a general format. Data is first aggregated into a PostgreSQL database, in XML that conforms to Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH), which is an application-independent interoperability framework based on metadata harvesting. The PostgreSQL database is used to associate the artifact data with user tags, proposed search terms, and other related items. A database-to-Solr indexer transforms the internal index format into Solr format and prepares it for indexing. All searches are performed using Solr.

The project was built with a team of three senior developers. Project Lead Sjoerd Siebinga, a historical and computer linguist, specializes in cross-lingual search. A few years ago, he used Solr as a prototype that aligns various thesauri and make them searchable. Three years ago, he joined The European Library project, which developed Europeana as a separate service. Using Solr and other open source tools was a natural outgrowth of his prior experience.

Sjoerd explained, “Because we had to do faceting-related item searches, preferably with auto-completion, I immediately thought of Solr. I knew that on my own development machine with Solr, I could scale it up and do sharding, so I thought it was the best way to meet the demands of scalability and redundancy. Also, licensing costs for a packaged solution would have been too high.”

To prove its scalability before beginning the project, the technical team benchmarked Solr. In a load-balanced environment, Solr was able to handle 8,000 concurrent users before the test machines were unable to handle the load. Sjoerd observed, “I loaded about 10 million items into Solr and saw that it was still pretty fast after putting significant loads on it, so we decided to proceed with it.”

Sjoerd found Solr very straightforward to use. “With Solr, you can do so many things without writing a lick of code. I hadn’t realized how easy it is to extend our custom request, response writer, and update handler. Just move it all to Solr and let it do the heavy lifting. I tell other developers, ‘See how much you can get for free with Solr! There are just two configuration files, nothing else. Read them and you’re good to go,'” he said.

The Europeana site today

The current Europeana site is a prototype that provides links to four million digital items, including, categorizing them as:

  • Images: Paintings, drawings, maps, photos, and pictures of museum objects
  • Texts: Books, newspapers, letters, diaries, and archival papers
  • Sounds: Music and spoken word from cylinders, tapes, discs, and radio broadcasts
  • Videos: Films, newsreels, and television broadcasts

The initial prototype site, www.europeana.eu, is now up and running with four million items in the archive, handling 5,000 concurrent users. It receives anything from 200,000 to a million hits every day.

Future plans

The project’s architects aim to deploy a production version with 10 million items by June 2010, handling 20,000 concurrent connections. However, it may well hold 30 million items by then, judging by the number of items in the queue already, and the increasing number of aggregators that Europeana is working with. Aggregators harvest content from large numbers of providers, homogenize the metadata and channel it directly into Europeana – the prime example being culture.fr, which aggregates content from 480 French museums and archives and delivers it to Europeana.

Most of the current projects that are contributing content to Europeana are domain aggregators, including the European Film Gateway, the Archives Portal for Europe, Europeana Connect for sound material and Athena for museum collections.

Looking ahead, Sjoerd said, “We’re constantly reviewing what is the easiest way for people to add data and manage all the mapping themselves, so we don’t have to do it. We are testing the idea of creating an online wizard so institutions can upload data, analyze it, go into a sandbox and test the search. If they are happy with it, they can hit ‘submit.’ We will then review it and move it into production.”

After the launch of the Europeana prototype, the project’s final task is to recommend a business model that will ensure the Web site’s sustainability. Moving from the project’s current XML-based format, the next version will rely on a resource description framework (RDF)-based internal schema built on the Simple Knowledge Organization System (SKOS), a WC3 standard.

This will align different thesauri and ontologies, enabling contextual groupings with enriched data and including actual digital objects, like movies and images. The team is also working to expand the discoverability of all of the elements of a cultural artifact. Using new metadata standards, the treatment of the single artifact will also be able to add key parameters such as location in time. For example, if a query asks ‘What is the location of the Rosetta Stone?’ There are several answers: The stone was found in Egypt by the French, relocated to France, then to England. All of these answers are part of a complete history of the Rosetta Stone. In addition, free-range queries and mobile asset management with geographic search will allow people to access the site via a mobile device such as a cell phone, input their location, and find out what historical resources are available nearby.

A core requirement of the project is sustainability, and that it use open source wherever possible. In September 2009, the project itself will be open-sourced, to enable smaller institutions to implement it.

Hardware

In this high-profile deployment, everything is redundant.

  • 1 master and 2 slave machines running Solr, each with 8 cores and 16GB RAM
  • 2 machines running Image Magic to generate thumbnails, each with 8 cores and 16GB RAM
  • 2 database machines: 1 virtual, 1 hardware, with 32GB RAM
  • 4 portal servers, completely stateless, with round-robin load balancing

 

Software

Everything runs on Linux and the Apache Tomcat servlet container, except Solr, which runs in a Jetty servlet container.

  • Red Hat Enterprise Linux
  • Apache Tomcat
  • PostgreSQL database
  • Spring and Hibernate software were used to write applications