Fanfeedr.com is a real-time, personalized sports aggregation website with a social networking layer on top. It now aggregates more than 3,500 sources providing information on more than 55,000 athletes and over 4,000 teams, including those from over 1,700 colleges and universities across 15 different sports. By aggregating data in a database but using Solr to index the documents and the relationships between them, Fanfeedr can both deliver highly relevant content and keep pace with the rapid growth and variety of incoming content.
To really “get” Fanfeedr, imagine this: You are the World’s Number 1 Derek Jeter fan. You know everything there is to know about him, from his annual RBI totals to what he had for breakfast. You even like the Yankees, mostly because they bask in Jeter’s aura. You love to see him play in person whenever you can, but doing so is a bit tough, given that you live in Tulsa, Oklahoma.
So what do you do to follow your main man? Truth is, you probably spend a lot of time on the Internet, visiting maybe 5 or 6 sites on any given day. Now ask yourself: If you could get all the data on the Web from a single source, would you do it? In a New York minute!
- The typical sports fan’s interests are structured around aspects of the sport such as players, teams, and leagues, rather than their information sources, i.e., sports Websites and other publishers
- That fan normally visits 4 to 6 Websites to obtain the information on their area(s) of interest
- 57% of such fans live in a metropolitan area other than the location of their favorite team or player
Taken together, these facts added up to an opportunity and a challenge: What if every sports fan could completely control the online world to match his or her own personal interests, and easily select manage which the content he or she is personally presented?
Fanfeedr.com is a real-time, personalized sports aggregation website with a social networking layer on top; it was founded to answer and serve these sports fans. Within its first year it has expanded to aggregate more than 3,500 sources providing information on more than 55,000 athletes and over 4,000 teams, including those from over 1,700 colleges and universities across 15 different sports.
Collectively speaking, that’s an enormous amount of data, and as we’ll see shortly, the volume is only growing. On the one hand, the data a particular sports fan might want is rarely all in the same place, so aggregation has a real value. On the other, if each user could access all that data at once, he or she would be at the end of a fire-hose of data. The site needs to provide the ability to allow users to pare down the stream so they get just want they want. How do they do it?
CEO Ty Ahmad Taylor puts it simply: “Search is key to our business. It’s where our users need us to excel.” Fanfeedr’s Lead search engineer, Sangraal Aiken, agrees: “Search is the main way we help to ensure that the content users get is relevant. We continue to refine and improve it, and there’s still work to be done. From a developer’s perspective, handling massive and ever increasing amounts of data in real time is key. For example, the gap between data acquisition and search-index availability needs to be as close to zero as possible.”
The company has developed an interesting approach to gathering, extracting, indexing, and delivering data to its end users, and has chosen Lucene/Solr to be an integral component of these services.
Search Architecture and Process Flow Overview
To meet the challenge of constant, focused appetite of sports fans for sports data, Fanfeedr has deployed a search architecture based on the following process flow:
- Start by acquiring, extracting, and parsing data from an XML data feed that includes, among other things, team names, player rosters, and live score data. These items define, or structure, the sports world as the system knows it, and this structure is reflected in the Fanfeedr core database.
- Next, crawl the list of 3,500+ (and growing) RSS feeds for news sources across the Web. This is performed by a system component called the ‘Aggregator,’ which extracts the data and analyzes keywords using a Bayesian algorithm to calculate the probability a given article or story falls into a database category. When this process is complete, the database contains an ever-growing list of categorized sports content.
- Another component, a Python script called the ‘Indexer,’ polls the aggregator constantly for newly-updated content. It builds XML documents based on Solr’s schema and assigns a priority to them. The priority system is essential for maintaining control over what commands are processed by Solr first.
- The Indexer then passes these XML documents to an ‘Index Syndicator’ servlet, which runs on every Solr server in the environment, up to what is effectively an indefinitely large number of servers.
- In addition, multiple indexer scripts can be run on different types of content and on different schedules allowing FanFeedr to maintain further control over index currency.
- The Index Syndicator maintains a priority based queue of Solr commands and “syndicates” those commands, in order of priority or synchronously by the order in which they were received, to the other Solr servers in its pool. Every Solr machine is capable of building its own index and handling the syndication of commands to its peers, which gives us flexibility when scaling and/or partitioning our index horizontally.
To avoid multiple commit pile-ups, Solr is configured to auto-commit its index after a short duration or when the ‘new_doc’ count gets above a certain threshold. Note that this threshold bears tuning over time to ensure Lucene/Solr keeps up with the ever-growing body of content. Optimization of the index is done daily by cron.
When Fanfeedr’s aggregator subsystem posts an update, the ‘updated_at’ column for that entity is revised in the database. This causes the Indexer to re-fetch the document and start the process all over again, thus ensuring that the Solr index tracks behind the database as closely as possible. At Fanfeedr’s current scale, this latency is about a minute, mainly due to the need to avoid stacking Solr commits, but as the system grows, attention will be required to keep that latency minimal, and reduce it to as close to zero as possible.
Most of Fanfeedr’s data not only needs to be indexed by Solr, but also stored and returned by Solr, including relationships between documents. This is done in order to eliminate database queries whenever possible, and represents a practical decision: Lucene/Solr scales more easily and better than the database, and dollar for dollar requires less hardware for performance
Fanfeedr currently has 750,000 documents in its index; 15,000 of these are categories, the rest, content. This grows by 5,000 to 10,000 documents a day and the rate is increasing. Indexing user-generated-content will add significantly to the challenge as user volume increases.
Thus far, the need for real-time indexing has ruled out using Solr’s built in index-replication system, mainly because of the limitations of rsync in the face of big indexes. To a large extent, the IndexSyndicator is a work-around to this problem. It simplifies indexing since the Indexer script still only needs to send the XML to a single server, yet all servers are actively building their own indexes. The servlet handles updating its peers without any additional effort from the developer. Scalability is readily addressed: Adding a new Solr server requires adding a one-line configuration property to include the new machine in the servlet’s tracked-server list. It also simplifies server deployment since every server is ideally configured in the same way.
In effect, FanFeedr flattens their relational database into Solr. Any given content item on the site can be related to several, or sometimes up to hundreds, of categories. At index time, these relations are selected from the database and indexed in a multi-valued field. This allows, for example, users to search for all content related to the player “Derek Jeter.” More importantly, it allows all content that falls into multiple categories to be very easily built into customized “FanFeeds” based on what users define as their filters. With this system, a user’s highly customized “FanFeed” view on the site becomes a single efficient Solr search. Doing an equivalent query with a relational database might easily become a performance nightmare.
Even with these optimizations already in place, Sangraal is already preparing for the next round. Opportunities on the drawing board for performance improvement include:
- Constant optimizing and performance-testing Solr itself to identify the best configuration for Fanfeedr’s unique use case.
- Purging/archiving/sharding of old content
- Reducing the amount of indexed and stored data to only the items that are most needed.
- Scaling the search architecture horizontally, utilizing multiple cores and/or Solr installations, a task made easier by the Index Syndication servlet system.
But any critical next steps will await the release of Solr 1.4; when it comes out, Fanfeedr can start utilizing features such as partial-index reopening which should help with real-time indexing performance. Also of interest to Sangraal is the pure java implementation of index replication. Indeed, there is the risk that the index syndication method, though an interesting extension to current Solr functionality, might become obsolete with Solr 1.4’s new replication framework.
In addition to supporting growth in the number of users, and concomitant increases in the volume of user-generated-content, plans are also afoot to integrate significant new areas of capability into the site. For example, the site’s “reputation economy,” in which reputation points are allocated, for example, on correct predictions of game out-comes, is currently in its most nascent stages. This is an area that slated for additional development.
And in order to further refine the Search process itself, Sangraal in investigating an attribute he refers to as “hotness,” which will be a component of the future relevancy algorithm. Hotness is to be based on several internally tracked metrics to quantify a user’s actions around a story as well as its currency. In addition to Solr’s text-match scoring system, hotness and some other measures can be used to boost relevant documents at index and/or query time.
Also being considered is the development of a system for computing “internal page-rank” based on how many stories in the system reference another story in the system. These internal page ranks could ultimately be factored into the “hotness” algorithm.
The Bottom Line
Fanfeedr is already technically well positioned to exploit the business potential of its core idea. Prospective revenue streams depend on, for example, capabilities for lead generation in areas like ticket sales and merchandising. The accuracy of these leads in turn will stem significantly from user search behaviors, and the ability for users to refine, and for the site to track and record, them will become increasingly vital.