With or without the influence of social media, enterprise data is taking on some of the same characteristics as social media. Data structure is mixed between structured, semi-structured and unstructured; it arrives at ever-accelerating rates, in near real time; demand is growing for access in real time; and it continues to span multiple, incompatible repositories. A common example: sales or service reps correspond with customers and each other by email, intermixed with filling in fields on a fixed form in CRM apps.
More data, more variety and more of it coming, in real time—and more often than not, search technology provides the only way to put this data to good use, whether inside or outside the enterprise. As fast as companies, communities and consumers are producing data about each other and everything else, they need faster, more versatile search capabilities to find the information they want.
The good news for the enterprise is that many of the tools and techniques needed to tackle enterprise data search problems are up and running: they power Internet-based consumer social media today. An overwhelming proportion of social media sites are built on Apache Lucene/Solr open source technology. Lucene is a leading open source search programming library; Solr is the Lucene search server. Now, availability of commercial grade solutions makes Lucene/Solr ready for enterprise search applications—much as Red Hat did for Linux.
Let’s begin by contrasting social data with traditional database-based search/retrieval, still in use at most companies. These systems begin with building a data model. Users choose exclusively from a well-controlled, well-normalized set of attribute fields. Results relevance requires a fair amount of user expertise in both the underlying data and the query syntax. Data is typically updated in batch mode. Tightly coupled in this way, database search matches the data well enough—but with little flexibility.
Along comes the Internet. Now, the data model is more flat than hierarchical or structured. Anyone could create data. The expertise required for running search queries? Kindergartners put keywords in Google, with pretty good results most of the time.
But Google search is far from the end state, because competitive advantage in search results means going beyond “pretty good.” Search today must contend with “semi-normalized” data; it must match free text with attributes about documents (time, date, origin), users (profile, demographic, authority), location (address, distance) and many others. Data lives in multiple, siloed repositories, and is piling up in real time. No longer can we say that a particular result is definitively relevant or not; relevance is in the eyes of whomever issued the query.
The good news is that these are not theoretical problems: they are the real-life use cases successfully deployed at leading social media sites built on Lucene/Solr open source search.
Let’s look first at Technorati, the blog search and discovery engine. It indexes about 300,000 new blog posts a day, often in real time. Once built on a relational database, Technorati moved to Lucene, and then to Solr. One key metric: how many results appeared on the first page that were under a minute old? Technorati maintains an innovative indexing structure, using two separate indexes: one for the posts, and one for metadata about the blogs, their origin, topic, frequency and other information about the shape and size of the blogs beyond the content. Using the dual-index structure, Technorati builds the search results page and pre-stages content browsing pages on one topic or another.
LinkedIn is a social media site driven by Lucene/Solr search, no doubt familiar to anyone reading this article. Lucene drives a powerful faceting system, allowing you to pivot and navigate by user and company attributes extracted from user-generated data. Where LinkedIn really shines is in ranking results by their relationship to you. That value is not fixed in the data: it’s computed by Lucene in real time, calculating who you will want to see based on the arithmetic of your personal relationships.
The phone directory is a classic, legacy search model forever changed by social media. Working with Lucid Imagination, AT&T Interactive uses Lucene/Solr for their www.yp.com website. Structured listings are combined with user reviews of the businesses and Lucene-driven computations of location and distance. When user queries are imprecise or vague, Solr’s “did you mean” function provides accurate results even to one-word queries like “pizza.”
Finally, let’s look at MySpace. Not only an iconic social media business built on Lucene, it’s one of the world’s largest search sites: 125 million active users on the site, with several hundred million profiles; 250,000 new users per day; 41 million emails a day; 33 million videos with 62,000 added daily. More than 800 billion rows of data and 8 billion friend relationships are searched by Lucene across a terabyte of new data per week, all in near-real time.
Custom relevancy is at the heart of MySpace. Sifting through content to separate valuable items from noise provides users what they’re looking for. Lucene’s scoring model includes item metadata, customizing results for every user and every search. Searches for “Madonna videos” will find uploads from the pop artist, and ignore fakes or home movies about a dog named Madonna. MySpace uses Lucene’s algorithmic power to identify underage users, for example, when an “18-year-old” has most of her friends in 6th grade.
The range of these social media use cases, in scale and scope, has laid the ground work for the next generation of enterprise search. Collaboration, business intelligence analytics, security, purchase transactions, all increasingly face data of scope, scale and update rates similar to social media. With the flexibility of the Lucene/Solr platform, search application developers can achieve a precise fit between the torrent of heterogeneous data and search applications that deliver competitive advantage.
Originally published in KMWorld.