Reddit handles ~10M monthly posts and almost 3M daily comments across 138K active communities. Search is a key way Reddit’s more than 330M monthly users find content. After trying five different search stacks in 12 years, it needed a solution that could scale and quickly locate relevant content for 47M daily searches.
With Fusion, Lucidworks can deliver the fast, scalable, responsive, and relevant search solution the Reddit community requires. While matching search relevancy with the previous search engine in the initial implementation, Fusion also laid the foundation for future improvements to better optimize results.
Index time was reduced, with a full corpus index that previously required two weeks now taking just two days with ingestion rates of five thousand updates per second. Search clusters were reduced from 200 to 30 nodes. Uptime is at 99.99% with 400 queries and 1,000 updates per second including live streaming updates.
Reddit has a deluge of content that’s constantly being submitted, voted on, and discussed; so building a great search experience is no easy feat. The team needed a scalable solution that could quickly locate relevant content for millions of searches per day. Over the past 12 years, Reddit implemented five different search solutions with varying success.
When the site first launched in 2005, it used a very basic search built with Postgres. This solution lasted three years, but traffic logs were showing that while search was just 2% of site traffic, it was using more than half of system resources. After that, the team tried using open source Apache Lucene with a Python RPC client. Then later, a solution using open source Apache Solr was deployed. In 2010, as Reddit reached an unprecedented number of pageviews per month, the team deployed another search – this time built with Memcached and Cassandra – to try and keep ahead of the Reddit’s user base and growthtrajectory. The next approach was to outsource search to an outside company. But it was a short-lived partnership due to the vendor’s acquisition several months later. Then Reddit switched to a cloud-based search provicder, but it did not meet Reddit’s growing requirements. But query response times got longer, latency was slowing, searches started timing out, and users grew increasingly impatient. At a virtual town hall in 2016, Reddit’s CEO was asked, “Where do you see Reddit in 10 years?” The CEO responded, “Reddit search might work by then.”
Reddit partnered with Lucidworks to build the site’s new search with Fusion, an advanced platform for developing smart apps. Reddit wanted a solution that included Apache Solr, the open source search technology known for its reliability and scalability. Fusion includes Solr as part of its search stack, so the team was confident the new search solution would be able to scale to the magnitude needed for such a popular site.
The new search app was separated into three parts for faster, more iterative development. The first part is a search microservice which runs two to three instances a day. Then the existing data pipeline architecture was used to pull in nightly batch updates, building a canonical view of all content as well as the live updates streaming in through Kafka. And the third piece is the Fusion cluster itself, which is 24 nodes, indexing all the content of the front page of the internet. The previous search app had weighed in at 200 instances.
Index time was significantly reduced as well with a full corpus index that would require two weeks reduced to just two days with ingestion rates of five thousand updates per second. To test scaling and performance of the Fusion cluster, all search traffic sent to the live site was also routed to Fusion.
Then over a few weeks, more and more queries were gradually sent to and processed by the Fusion cluster with results sent back to the live site and viewed by more and more users. Error rates dropped significantly to practically zero.
In addition to scalability and performance, the new search app had to achieve the same level of relevance in search results. By tracking clicks on search results, the team was able to determine that the search results coming from Fusion had the same quality as the existing search, so users would not experience a drop in relevance. Easy wins were achieved with filtering out spam and other bad queries. Relevancy is also being tuned to aggregate results by Reddit communities (a.k.a. “Subreddits”), so a search for news about the Super Bowl won’t be incorrectly routed to the community dedicated to sharing photos of majestic owls, /r/SuperbOwl.
The new search app built with Fusion reached full 100% availability to all of Reddit’s 330+M million users in September 2017, and so far the reaction has been overwhelmingly positive. The search cluster is indexing a quarter of a billion posts with 1000+ updates per second, serving 400 queries per second to both users and to third-party apps through Reddit’s APIs.