Table Stakes ML for Smart Search

Presented at virtual Activate 2020. Maximizing business value in a digital environment where search queries are the primary view onto your user’s needs dictates that Search is a Machine Learning problem. But how much “ML” is really needed these days, and what kind of infrastructure is required to support all this ML?

In this talk, Jake discusses a short list of what is required (in terms of product feature, architectural component, and engineering technique) for our search engines to get a “seat at the table” of our user’s highly divided attention.

Speaker:
Jake Mannix, Search Relevance Architect, Salesforce.com, Inc.

Attendee Takeaway:
For some, these will be reminders that yes: you need a Feature Store, personalization, session history, and the like. For those new to ML-in-Search, it’ll be short list of places to start (but it might be 3-5 years before you’ve added it all!)

Intended Audience:
Engineers, Data Scientists, and Product Managers for Search teams, Executives, and Managers making Search Engine tech purchasing decisions

Transcript

Jake Mannix: As a search relevance architect, people often ask me, how do I make the search functionality of my site or app or startup smarter? What is required versus what can come later?

Over the years of responding to these questions, I’ve cataloged a list of table stakes, smart search features and architectural components to support them and I’ll share with you these today.

But before we dive in, we have to ask, what makes search smart? Well, good search relevance, right? Well, but what is that? To maximize visit business value in a search driven flow, you have to be able to predict with as much accuracy as possible, what your users will do after querying you. Will they click on a record to go to another page? Will they actually add to cart or buy an item on your eCommerce site? Will they share your article on social media? What is your business need that you’re trying to encourage? The best way to get them to do more of what you want is to have a good idea of what they’ll do, given what you show them.

Thus search relevance is a complex data science problem built on detailed instrumentation to learn predictive models based on your users, their questions, and the potential answers.

That’s a quote from me a couple of years ago at another conference, but that’s okay.

Jumping in, for search to be smart, it must be conversational. By this I mean something very specific. A search session is not typically a single query and a single set of responses. If I’m looking for white, black tie coat, i.e a white dinner jacket for a formal black tie event, and you issue that to an untuned Lucene Solr based system, you’re likely going to be disappointed by the first results.

If you’re patient, perhaps you’ll try to reformulate your query a few times, black tie coat that is white, or even how I described it earlier. White tie, white dinner jacket for a formal event, hopefully you won’t get into a chair throwing match with your search engine.

Incidentally, it turns out that Google not only understands this difficult query, but correctly returns different results for white, black tie coat versus black white tie coat.

On the other hand, Amazon does not handle this query correctly. They do handle white dinner jackets, but including black tie throws them off completely. Well, my main point about making search conversational is perhaps better phrased as session aware or no short term memory loss.

Remember what your users are querying. They’re trying to teach them, teach you what they mean. Some of you may say, no, no, no, no. I’d solve this particular kind of query with X or Y or Z technique from natural language processing. It would do great. I’d agree. Search should be natural. Long gone are the days of simply relying on stop words, fancy tokenizers, stemmers and lemmatizers.

Table stakes from a natural search even goes beyond statistical phrase identification to handle how to open an account if I have one already, if you use a full NLTK stop word list, you’re likely to get a filtered query of the form, open account already, which while technically is what the user is asking.

Frankly a simple keyword based system is unlikely to succeed with that. These days users expect to be able to use queries like this, or even my shameless plug, Maria’s top open opportunities in Seattle with stage prospecting which were modified this month because Einstein Search at Salesforce can indeed handle this particular complex natural query.

You may note in that example, there was a pretty blatant ambiguity in the second case. Who the heck is Maria? There are probably many Marias in my company. It’s a pretty common name. Well, yes, but perhaps there’s only one Maria on my team who I interact with or query about frequently.

If your search engine is personalized, you’re gonna be able to handle that. Even in the enterprise, there’s a notion of the social graph a lot of the time. Beyond a simple user to user social graph, personalization means storing recent and not so recent queries and other interactions of your users and possibly knowing how to segment or cluster them into similar demographic groups.

It also means building sophisticated or even simple, just counting based user interest profiles and keeping track of your user’s queries and other interactions. Which brings me to a reminder, that search must be responsive to run a natural and personalized smart search system, we need to hear what our users are asking and understand what they do when they do something that they truly wanted to do.

Is a click always a positive signal? Not always. What if they go back and look at the search results a second time? They found that it was wrong, so they abandoned that page. What if they add to cart? Then they remove it from their cart? Is that a positive signal? It may depend on your business’s analytics. This is yet another thorny data science problem actually, but you don’t wanna have to drive your delicious cycle, engineers cost way too much for that.

You’re gonna need to teach your machines to learn this. Most of what I’ve been describing requires machine learning precisely because you probably don’t have as much data as Google. As alluded to earlier, they can often get away with counting as their machine learning algorithm. You can too sometimes, but when you can’t, your more sophisticated ML models need to be modular. Just like with the rest of the software engineering, you need loosely coupled subsystems also for ML. There’s a balance to be made here though, totally uncoupled, leads to underpowered models and duplicated efforts.

Tightly coupled leads to unstable systems that are brittle and fragile. After doing a little multivariate testing with this slide, it was found that breaking my one slide per concept pattern became necessary for this ace in the whole concept of ML modularity.

Let’s dig in a tiny bit.

Once you start down the path of what I’ve described so far, you may have many personalization modules that have collaborative filtering layers, user ID embedding, deep learning models, text classifiers of different types, embeddings of various document encodings and query encodings, multiple ranking layers across the latency and quality tradeoff spectrum from a linear model deep inside your Lucene layer up through a L2 and L3 re ranker versus a whole page relevance kind of at the top level and any amount of ML coupling between these will entail that these models will be feeding into each other.

You’ll have query classification probabilities as features in your ranking model. You’ll have ranking stores, ranking scores as features in your text-to-SQL engine. You’ll have personalization layers and recommenders as shared components in many different models.

To give you a few different examples of what I’m talking about let’s look at some pitfalls in the modularity landscape, the good, the bad, and the ugly.

The good case. The query understanding team builds an awesome personalization embedding model. You take it, you save that frozen embedding layer and you share it with the ranking team where they can fine tune a ranking model without changing the weights of the underlying personalization embedding model. Win-win for both teams and both products.

Bad case. The document understanding team builds a sophisticated text-to-vector encoder, but it’s highly specialized on long form documents. And it doesn’t work with queries. It can’t be shared with them.

The ugly example. That’s a real world example we’ve run into Salesforce. The whole page relevance model gets trained using the level three re-ranking scores as input features, but then the ranking team launches a new ranking with scores that now are scaled between zero and one instead of 1 to 100. The whole page relevance model breaks entirely, unless you carefully recalibrate your models.

It gets tricky. There’s a lot of problems you can have when it comes to coupling the models together. But if you don’t do it, each of them are too underpowered to drive proper search relevance.

Let’s say we’re gonna start diving into building these features for your search engine and address all these tricky data science problems. What kind of architecture do you need as support? It should go without saying that you need solid data pipelines, but it bears repeating. Treat relevance feedback from your UI tier as a first class citizen in your data landscape right up there with find and add to index APIs.

It should also go without saying that you need computers for just training your models somewhere scalable and secure. You’re not having your data scientists train on their desktops, are you?

But also the full life cycle of data transformations on top of your data pipelines needs to compute and eventually populate low latency feature stores with data which doesn’t necessarily belong in the search index, but is used at query time.

Which brings me to data-driven products. These require a couple of different kinds of data storage, not just your inverted index, it is the kind of the bread and butter of search, but also a document store which stores the uninverted form of the original documents.

This is used for learning a document understanding model, as well as providing raw features to be used at re ranking time that maybe you don’t want actually stored in your index for space reasons, but maybe the top 50 results. You fetch them from a first layer, feed it into a re ranker, all the raw data that then goes through a ranker that is maybe too slow to run over every document, but is applicable for maybe the top 50.

You’ll also need a feature store, which is gonna store all the features that, you know, keyed on a given query that comes in. You fetch the features of the query, maybe all the natural language features or the personalization features. The user ID has an embedding that you wanna pull out or the most recent queries the user did or information from the session or the features of the documents you wanna re rank.

A feature store is kind of a table stakes architectural component that wasn’t thought about maybe five, six years ago.

Lastly, a model store. Because no one wants their models stuffed into Lucene or worse serialized into Zookeeper. You definitely want no homeless models.

For efficient model serving, inference should be outside of Lucene and Solr. You can get low latency, independently scalable predictions.

What do I mean by that? It means that if your index size starts growing up by large amounts and you don’t retrain a model, you don’t wanna have to end up replicating your model into all the different index charts.

On the other hand, if you shift from building a small linear model, going up to a deep learning model, you don’t wanna overburden all your index shards with that scaling change either.

You want them to be able to scale independently and tune your relevance to latency trade-off at will.

Lastly, you want to ensure that you have low training and serving skew. This is a tricky data science problem that mostly means when you’re running runtime inference, you’re supplying the same features with the same pre and post processing that you had at runtime and training time.

This can be a major nightmare when done wrong, very tricky to debug. But if you have a proper online offline feature store that ensures that they’re available at training time and runtime in the same exact form with the same pre-processing steps, you’ll be able to avoid this. You’ll also need to have a way to validate your decoupled model loading from the model store into a serving layer that works the same as it did in training.

Then you’ll be good here.

In closing, I would be remiss, if I didn’t have another couple of shameless plugs, you should try out Einstein Search in Salesforce’s flagship CRM product and you’ll see in action, the outcome of many of these practices of natural language, natural language search, personalization, and so forth.

You won’t actually be able to see the architectural components, but they’re hiding under the hood. If you wanna learn more about how to build some of this stuff yourself, you should also see our talk on ML4LR, our new open source deep learning library for doing a variety of search problems later in this conference.