This morning we were reminded of a recent study by Frost and Sullivan a global consulting firm that has been studying the evolution of enterprise search. There was a time, not too long ago, when enterprise search was an interesting but rather small category, quietly growing in the shadow of the much larger consumer search category. If Frost and Sullivan is right, the category will soon make more noise. The firm projects that the global market for enterprise search will reach $4.68 billion by 2019.
What’s driving the growth? We generally agree with the consulting firm’s analysis, and have a view of our own. We see three principal drivers, each adding to the rapid growth of the market.
First there’s the big trend identified by Frost & Sullivan: big data. The trend is both an opportunity and a challenge for the enterprise. It’s an opportunity because the volume of data might enable the enterprise to learn more and do more in any number of areas. But it’s a challenge because the volume of data requires new technology and new practices. In the meantime, it looks like both the opportunity and the challenge is helping to drive growth in search. As Grant Ingersoll, CTO of Lucidworks, recently told Forbes, “search is the UI for data today.”
Second, there’s the sheer complexity of the data. Organizations need to sort through not just mountains of data, but data of different kinds, most of it unstructured. We’ve been having this discussion for years, and the early talks around unstructured data helped the enterprise category to evolve. What may be different today is that the conversation about big data has thrown the earlier conversation into relief. It’s far clearer today to the multitudes that the complexity of data requires different approaches.
Finally, there’s a different kind of complexity that many organizations are just beginning to grapple with: the complexity of use cases. In an article about the Frost & Sullivan study, the author notes how search vendors “are looking beyond meta-data tagging and towards statistical and probabilistic algorithms, deep-content analytics, and pattern detection.” Why? Because search has evolved to meet the needs of a different, sophisticated enterprise user. And here’s where the power of open source can make a big difference. With a larger pool of innovators, and a method for sharing technologies and practices, the open source community is well positioned to meet the complexity challenge in the enterprise search market. At this year’s Revolution event, we will showcase some of the ways the community is already meeting that challenge. Here’s a peek at what day one looks like for the big data crowd. Tomorrow we’ll take a closer look at day two.
DAY ONE BIG DATA SESSIONS
Building a Real-time, Big Data Analytics Platform with Solr
- Presented by Trey Grainger, Search Technology Development Manager, CareerBuilder
- Having “big data” is great, but turning that data into actionable intelligence is where the real value lies.
This talk will demonstrate how you can use Solr to build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.
- At CareerBuilder, we utilize these techniques to report the supply and demand of the labor force, compensation trends, customer performance metrics, and many live internal platform analytics.
You will walk away from this talk with an advanced understanding of faceting, including pivot-faceting, geo/radius faceting, time-series faceting, function faceting, and multi-select faceting. You’ll also get a sneak peak at some new faceting capabilities just wrapping up development including distributed pivot facets and percentile/stats faceting, which will be open-sourced.
The presentation will be a technical tutorial, along with real-world use-cases and data visualizations. After this talk, you’ll never see Solr as just a text search engine again.
Analytics in OLAP with Lucene and Hadoop
- Presented by Dragan Milosevic, Senior Architect, zanox
- Analytics powered by Hadoop is powerful tool and this talk addresses its application in OLAP built on top of Lucene.
Many applications use Lucene indexes also for storing data to alleviate challenges concerned with external data sources. The analyses of queries can reveal stored fields that are in most cases accessed together. If one binary compressed field replaces those fields, amount of data to be loaded is reduced and processing of queries is boosted. Furthermore, documents that are frequently loaded together can be identified. If those documents are saved in almost successive positions in Lucene stored files, benefits from file-system caches are improved and loading of documents is noticeably faster.
- Large-scale searching applications typically deploy sharding and partition documents by hashing.
The implemented OLAP has shown that such hash-based partitioning is not always an optimal one. An alternative partitioning, supported by analytics, has been developed. It places documents that are frequently used together in same shards, which maximizes the amount of work that can be locally done and reduces the communication overhead among searchers. As an extra bonus, it also identifies slow queries that typically point to emerging trends, and suggests the addition of optimized searchers for handling similar queries.
Batch Indexing and Near Real Time, Keeping Things Fast
- Presented by Marc Sturlese, Architect, Backend engineer, Trovit
- In this talk I will explain how we combine a mixed architecture using Hadoop for batch indexing and Storm, HBase and Zookeeper to keep our indexes updated in near real time.Will talk about why we didn’t choose just a default Solr Cloud and it’s real time feature (mainly to avoid hitting merges while serving queries on the slaves) and the advantages and complexities of having a mixed architecture. Both parts of the infrastructure and how they are coordinated will be explained with details.Finally I will mention future lines, how we plan to use Lucene real time feature.