Finding just the right thing when shopping can be exhausting. You can spend hours sifting through kind-of-related results only to give up in frustration. Maybe that’s why 79 percent of people who don’t like what they find will jump ship and search for another site.

There has to be a better way to serve customers with better search.

And there is. Learning to rank is a machine learning method that helps you serve up results that are not only relevant but are … wait for it … ranked by that relevancy. 

Search engines have been using learning to rank for almost two decades, so you may be wondering why you haven’t heard more about it. 

Like earlier many machine learning processes, we needed more data, and we were using only a handful of features to rank on, including term frequency, inverse document frequency, and document length.

79 percent of people who don’t like what they find will jump ship and search for another site. 

So people tuned by hand, and iterated over and over. Now the data scientists are the exhausted ones instead of the shoppers. Exhaustion all around!

Today, we have larger training sets and better machine learning capabilities.

But there are still challenges, notably around defining features; converting search catalog data into effective training sets; obtaining relevance judgments, including both explicit judgments by humans and implicit judgments based on search logs; and deciding which objective function to optimize for specific applications.

What Is Learning To Rank?

Learning to rank (LTR) is a class of algorithmic techniques that apply supervised machine learning to solve ranking problems in search relevancy. In other words, it’s what orders query results. Done well, you have happy employees and customers; done poorly, at best you have frustrations, and worse, they will never return.

To perform learning to rank you need access to training data, user behaviors, user profiles, and a powerful search engine such as SOLR.. 

The training data for a learning to rank model consists of a list of results for a query and a relevance rating for each of those results with respect to the query. Data scientists create this training data by examining results and deciding to include or exclude each result from the data set.

This vetted set of data becomes the gold standard that a model uses to make predictions. We call it the ground truth, and we measure our predictions against it.

Pointwise, Pairwise, and Listwise LTR Approaches

The three major approaches to LTR are known as pointwise, pairwise, and listwise.

Pointwise

Pointwise approaches look at a single document at a time using classification or regression to discover the best ranking for individual results. 

Classification means putting similar documents in the same class–think of sorting fruit into piles by type; strawberries, blackberries, and blueberries belong in the berry pile (or class), while peaches, cherries, and plums belong in the stone fruit pile.

Regression means giving similar documents a similar function value, so that we can assign them similar preferences during the ranking procedure. 

During these processes, we give each document points for how well they fit. We add those up and sort the result list. Note here that each document score in the result list for the query is independent of any other document score, i.e each document is considered a “point” for the ranking, independent of the other “points” 

Pairwise

Pairwise approaches look at two documents together. They also use classification or regression — to decide which of the pair ranks higher. 

We compare this higher-lower pair against the ground truth (the gold standard of hand ranked data that we discussed earlier) and adjust the ranking if it doesn’t match. The goal is to minimize the number of cases where the pair of results are in the wrong order relative to the ground truth (also called inversions).

Listwise

Listwise approaches decide on the optimal ordering of an entire list of documents. Ground truth lists are identified, and the machine uses that data to rank its list. Listwise approaches use probability models to minimize the ordering error., They can get quite complex compared to the pointwise or pairwise approaches.

Figure 1 – Learning to (Retrieve and) Rank – Intuitive Overview – Part III

Practical Challenges in Implementing Learning to Rank

You need to decide on the approach you want to take before you begin building your models. 

 Given the same data, is it better to train a single model across the board or to train multiple models for different data sets? How do well-known learning to rank models perform for the task?

You also need to:

  • Decide on the features you want to represent and choose reliable relevance judgments before creating your training dataset.
  • Choose the model to use and the objective to be optimized.

 In particular, the trained models should be able to generalize to:

  • Previously unseen queries not in the training set and
  • Previously unseen documents to be ranked for queries seen in the training set.

Additionally, increasing available training data improves model quality, but high-quality signals tend to be sparse, leading to a tradeoff between the quantity and quality of training data. Understanding this tradeoff is crucial to generating training datasets.

Microsoft Develops Learning to Rank Algorithms

The available options for learning to rank algorithms has expanded in the past few years, giving you more options to make those practical decisions about your learning to rank project. These are fairly technical descriptions, so please don’t hesitate to reach out with questions.

RankNet, LambdaRank, and LambdaMART are popular learning to rank algorithms developed by researchers at Microsoft Research. All make use of pairwise ranking.

RankNet introduces the use of the Gradient Descent (GD) to learn the learning function (update the weights or model parameters) for a LTR problem. Since the GD requires calculation of gradient, RankNet requires a model for which the output is a differentiable function — meaning that its derivative always exists at each point in its domain (they use neural networks but it can be any other model with this property).

Learning to rank (LTR) is a class of algorithmic techniques that apply supervised machine learning to solve ranking problems in search relevancy. In other words, it’s what orders query results. Done well, you have happy employees and customers; done poorly, at best you have frustrations, and worse, they will never return.

RankNet is a pairwise approach and uses the GD to update the model parameters in order to minimize the cost (RankNet was presented with the Cross-Entropy cost function). This is like defining the force and the direction to apply when updating the positions of the two candidates (the one ranked higher up in the list while the other one down but with the same force). As an optimization final decision, they speed up the whole process using the Mini-batch Stochastic Gradient Descent (computing all the weight updates for a given query, before actually applying them).

LambdaRank is based on the idea that we can use the same direction (gradient estimated from the candidates pair, defined as lambda) for the swapping, but scaling it by the change of the final metric, such as nDCG, at each step (e.g., swapping the pair and immediately computing the nDCG delta). This is a very tractable approach since it supports any model (with differentiable output) with the ranking metric we want to optimize in our use case.

LambdaMART is inspired by LambdaRank but it is based on a family of models called MART (Multiple Additive Regression Trees). These models exploit the Gradient Boosted Trees that is a cascade of trees, in which the gradients are computed after each new tree, to estimate the direction that minimizes the loss function (that will be scaled by the contribution of the next tree). In other words, each tree contributes to a gradient step in the direction that minimizes the loss function. The ensemble of these trees is the final model (i.e., Gradient Boosting Trees). LambdaMART uses this ensemble but it replaces that gradient with the lambda (gradient computed given the candidate pairs) presented in LambdaRank.

This algorithm is often considered pairwise since the lambda considers pairs of candidates, but it actually has to know the entire ranked list (i.e., scaling the gradient by a factor of the nDCG metric, that keeps into account the whole list) – with a clear characteristic of a Listwise approach.

Learning to Rank Applications in Industry

Wayfair

In a post in their tech blog, Wayfair talks about how they used learning to rank for the purpose of keyword searches. Wayfair is a public e-commerce company that sells home goods. Search is therefore crucial to the customer experience since. Wayfair addresses this problem by using LTR coupled with machine learning and natural language processing (NLP) techniques to understand a customer’s intent and deliver appropriate results.

The diagram below shows Wayfair’s search system. They extract text information from different datasets including user reviews, product catalog, and clickstream. Next, they use a variety of NLP techniques to extract entities, analyze sentiments, and transform data.

Wayfair then feeds the results into its in-house Query Intent Engine to identify customer intent on a large portion of incoming queries and to send many users directly to the right page with filtered results. When the Intent Engine can’t make a direct match, they use the keyword search model. This is where LTR comes to the rescue.

Figure 2 – Wayfair Search System

In their keyword search approach, Wayfair issues the incoming search to produce results across its entire product catalog. Under the hood, they have trained a LTR model (used by Solr) to assign a relevance score to the individual products returned for the incoming query. Wayfair’s then trains its LTR model on clickstream data and search logs to predict the score for each product. These scores ultimately will determine the position of a product in search results. The model improves itself over time as it receives feedback from the new data that is generated every day.

The results show that this model has improved Wayfair’s conversion rate of customer queries

Slack

The Search, Learning, and Intelligence team at Slack also used LTR to improve the quality of Slack’s search results. More specifically, they built a personalized relevance sort and a section search called top results, which presents both personalized and recent results in one view. Note here that search inside of Slack is very different from typical web search, because each Slack user has access to a unique set of documents and what’s relevant at the time frequently changes.

Slack provides two strategies for searching: recent and relevant. Recent search finds the messages that match all terms and then presents them in reverse chronological order. Relevant search relaxes the age constraint and takes into account how well the document matches the query terms. 

Slack employees noticed that relevant search performed slightly worse than recent search according to the search quality metrics, such as the number of clicks per search and the click-through rate of the search results in the top several positions.

Figure 3 – Top Results for the query “platform roadmap”

Incorporating additional features would surely improve the ranking of results for relevant search. To accomplish this, the Slack team uses a two-stage approach: (1) leveraging Solr’s custom sorting functionality to retrieve a set of messages ranked by only the select few features that were easy for Solr to compute, and (2) re-ranking those messages in the application layer according to the full set of features, weighted appropriately. In building a model to determine these weights, the first task was to build a labeled training set. The Slack team used the pairwise technique discussed earlier to judge the relative relevance of documents within a single search using clicks.

This approach has been incorporated to Slack’s top results module, which shows a significant increase in search sessions per user, an increase in clicks per search, and a reduction in searches per session. This indicates that Slack users are able to find what they are looking for faster.

Skyscanner

Skyscanner, a travel app where users search for flights and book an ideal trip uses LTR for flight itinerary search.

Search is complex and involves prices, available times, stopover flights, travel windows, and more. Skyscanner’s goal is to help users find the best flights for their circumstances.

The Skyscanner team translates the problem of ranking items into a binary regression one. They label their data about items that users think of as relevant to their queries as positive examples and data about items that users think of as not relevant to their query as negative examples. Then, they use such data to train a machine learning model to predict the probability that a user will find a flight to be relevant to the search query. More specifically, the term relevance is defined to be the commitment click-through to the airline and travel agent’s website to purchase it, since this requires many action steps from the user.

Figure 4 – Relevance in flight search: a search result is relevant if you bought it

After applying LTR to the data, they do both offline and online experiments to test the model performance. In particular, they compare users who were given recommendations using machine learning, users who were given recommendations using a heuristic that took only price and duration into account, and users who were not given any recommendations at all. The results indicate that the LTR model with machine learning leads to better conversion rates – how often users would purchase a flight that was recommended by Skyscanner’s model.

These examples show how LTR approaches can improve search for users. 

All three LTR approaches compare unclassified data to a golden truth set of data to determine the how relevant search results are. As data sets continue to grow, so will the accuracy of LTR.