For quite some time now, Lucidworks has been hosting a community site named Search Hub (aka LucidFind) that consists of a searchable archive of a number of Apache Software Foundation mailing lists, source code repositories and wiki pages, as well as related content that we’ve deemed beneficial. Previously, we’ve had three goals in building and maintaining such a site:
- Provide the community a focused resource for finding answers to questions on our favorite projects like Apache Solr and Lucene
- Dogfood our product
- Associate the Lucidworks brand with the projects we support
- Show others how it’s done by open sourcing the code base under an Apache license. (Note: you will need Lucidworks Fusion to run it.)
- Leverage Fusion’s built in Apache Spark capabilities for offline, background enhancement of the index to improve relevance and our analytics.
- Deploy machine learning experiments.
- Build on Lucidworks View.
While we aren’t done yet, we are far enough along that I am happy to announce we are making Search Hub 2.0 available as a public beta. If you want to cut to the chase and try it out, follow the links I just provided, if you want all the gory details on how it all works, keep reading.
Rebooting Search Hub
When Jake Mannix joined Lucidworks back in January, we knew we wanted to significantly expand the machine learning and recommendation story here at Lucidworks, but we kept coming back to the fundamental problem that plagues all such approaches: where to get real data and real user feedback. Sure, we work with customers all the time on these types of problems, but that only goes so far in enabling our team to control it’s own destiny. After all, we can’t run experiments on the customer’s website (at least not in any reasonable time frame for our goals), nor can we always get the data that we want due to compliance and security reasons. As we looked around, we kept coming back to, and finally settled on, rebooting Search Hub to run on Fusion, but this time with the goals outlined above to strive for.We also have been working with the academic IR research community on ways to share our user data, while hoping to avoid another AOL query log fiasco. It is too early too announce anything on that front just yet, but I am quite excited about what we have in store and hope we can do our part at Lucidworks to help close the “data gap” in academic research by providing the community with a significantly large corpus with real user interaction data. If you are an academic researcher interested in helping out and are up on differential privacy and other data sharing techniques, please contact me via the Lucidworks Contact Us form and mention this blog post and my name. Otherwise, stay tuned. In the remainder of this post, I’ll cover what’s in Search Hub, highlight how it leverages key Fusion features and finish up with where we are headed next.
The Search Hub beta currently consists of:
- 26 ASF projects (e.g. Lucene, Solr, Hadoop, Mahout) and all public Lucidworks content, including our website, knowledge base and documentation, with more content added automatically via scheduled crawls.
- 90+ datasources (soon to be 120+) spanning email, Github, Websites and Wikis, each with a corresponding schedule defining its update rate.
- Nine index pipelines and two query pipelines for processing incoming content and requests.
- Five different signal capture mechanisms in the UI, including: Page View, Page Ping (heartbeat), Searches, Document clicks, Typeahead search clicks. See below for the gory details on signals.
- Lucidworks Fusion 2.4.1
- A Python Flask middle tier and Python-based project bootstrapping mechanism.
- An HTTPd server with a mod_wsgi Docker container running on AWS.
- An MBOX Parsing Stage for dealing specifically with email.
- A custom, scheduled Spark job that runs periodically to coalesce mail threads and help reduce the effect of email thread hijacking effects. (There is always more work to be done here.)
Next Generation Relevance
While other search engines are touting their recent adoption of search ranking functions (BM25) that have been around for 20+ years, Fusion is focused on bringing next generation relevance to the forefront. Don’t get me wrong, BM25 is a good core ranking algorithm and it should be the default in Lucene, but if that’s your answer to better relevance in the age of Google, Amazon and Facebook, then good luck to you. (As an aside, I once sat next to Illya Segalovich from Yandex at a SIGIR workshop where he claimed that at Yandex, BM25 only got relevance about ~52% of the way to the answer. Others in the room disputed this saying their experience was more like ~60-70%. In either case, its got a ways to go.)If BM25 (and other core similarity approaches) only get you 70% (at best) of the way, where does the rest come from? We like to define Next Generation Relevance as being founded on three key ideas (which Internet search vendors have been deploying for many years now), which I like to call the “3 C’s”:
- Content — This is where BM25 comes in, as well as things like how you index your content, what fields you search, editorial rules, language analysis and update frequency. In other words, the stuff Lucene and Solr have been doing for a long time now. If you were building a house, this would be the basement and first floor.
- Collaboration — What signals can you capture about how users and other systems interact with your content? Clicks are the beginning, not the end of this story. Extending the house analogy, this is the second floor.
- Context — Who are you? Where are you? What are you doing right now? What have you done in the past? What roles do you have in the system? A user in Minnesota searching for “shovel” in December is almost always looking for something different than a user in California in July with the same query. Again, with the house analogy: this is the attic and roof.
These captured signals include:
- Page visits.
- Time on page (approximated by the page ping heartbeat in Snowplow).
- Queries executed, including the capture of all documents and facets displayed.
- What documents were clicked on, including unique query id, doc id, position in the SERP, facets chosen, and score.
- Typeahead click information, including what characters were typed, the suggestions offered and which suggestion
With each of these signals, Snowplow sends a myriad of information, including things like User IDs, Session IDs, browser details and timing data. All of these signals are captured in Fusion. Over the coming weeks and months, as we gather enough signal data, we will be rolling out a number of new features highlighting how to use this data for better relevance, as well as other capabilities like recommendations.
Getting Started with Spark on Search Hub
The core of Fusion consists of two key open source technologies: Apache Solr and Apache Spark. If you know Lucidworks, then you already likely know Solr. Spark, however, is something that we’ve added to our stack in Fusion 2.0 and it opens up a host of possibilities that were previously something our customers had to do outside of Fusion, in what was almost always a significantly more complex application. At it’s core, Spark is a scalable, distributed compute engine. It ships with machine learning and graph analytics libraries out of the box. We’ve been using Spark for a number of releases now to do background, large scale processing of things like logs and system metrics. As of Fusion 2.3, we have been exposing Spark (and Spark-shell) to our users. This means that Fusion users can now write and submit their own Spark jobs as well as explore our Spark Solr integration on the command line simply by typing
$FUSION_HOME/bin/spark-shell. This includes the ability to take advantage of all Lucene analyzers in Spark, which Steve Rowe covered in this blog post.
All of these demos are showcased in the SparkShellHelpers.scala file. As the name implies, this file contains commands that can be cut and pasted into the Fusion spark shell (
bin/spark-shell). I’m going to save the details of running this to a future post, as there are some very interesting data engineering discussions that fall out of working with this data set in this manner.
Our long term intent as we move out of beta is to support all Apache projects. Currently, the project specifications are located in the project_config folder. If you would like your project supported, please issue a Pull Request and we will take a look and try to schedule it. If you would like to see some other feature supported, we are open to suggestions. Please open an issue or a pull request and we will consider it.If you’re project is already supported and you would like to add support for it similar to what is on Lucene’s home page, add a search box that submits to
http://searchhub.lucidworks.com/?p:PROJECT_NAME, passing in
your project name (not label) for PROJECT_NAME, as specified in the project_config. For example, for Hadoop, it would be
In the coming months, we will be rolling out:
- Word2Vec for query and index time synonym expansion. See the Github Issue for the details.
- Classification of content to indicate what mailing list we think the message belongs to, as opposed to what mailing list it was actually sent to. Think of it as a “Did you mean to send this to this list?” classifier.
- User registration and personalized recommendations, with alerting. For a preview, check out our webinar on June 30th.
- Content and collaborative filtering recommendation.
- Community analytics, powered by Spark. Find out who in the community you should be listening to for answers!
- User Interface improvements.