Semantic vector search has become a major component in modern search eco-systems. It ties together catalog data with customer behavior data within your retrieval system and helps you automatically adjust search results with recent trends.
In this talk, we will cover the process of encoding all available multi-modal product data: images, attributes, descriptions into input features for the relevance model and training this model using customer behavior data.
As a result, a semantic vector search system will be able to successfully tackle many tough cases such as out-of-vocabulary, thematic, symptom and subjective searches.
Researchers and practitioners in information retrieval, ecommerce search, vector retrieval and deep learning. Data and ML engineers. Business, product and engineering leaders responsible for the ecommerce product discovery domain.
Learn how to boost your search experience and get great business impact from all available data in your system.
Stanislav Stolpovskiy, Search Architect, Grid Dynamics
Eugene Steinberg, Technical Fellow, Head of Ecommerce, Grid Dynamics
Hello, welcome to the Semantic Vector Search presentation. My name is Stanislav Stolpovski, I’m a Search/ML Architect at Grid Dynamics and I will be presenting today with my colleague Eugene. So I’ve been working with search for last 10 years. We build a lots of bigger search platform Lucene and not Lucene based. And for the last five years, I’m working with machine learning and deep learning, especially. I love machine learning and for information retrieval and robotic and I have Eugene with me today, Eugene.
Hello guys. My name is Eugene Steinberg. I’m a technical fellow and founding engineer of Grid Dynamics. With this, I’m heading our digital commerce practice. And I will today co present with Stas on a very exciting topic, semantic vector search. We are very interested in this topic for the last few years and very glad to be here with you.
Okay yeah, Eugene go ahead. Yeah.
Let’s, so today we are going to talk about search and we will talk about why in many situations, equipment search and product discovery queries do not provide results we expect from it, and how to make the search better. So, as you know, in majority of our approaches to search many e-commerce queries are typically trying just to match their parts of a query with terms from a query to the data which we have in the products of eCommerce products.
And then very often, this is the main approach to product retrieval, and to search in our case. However, in many, many cases, some very legitimate queries from the customers are not finding the good matches because we do not match data especially well. And that often leads to zero results page or completely irrelevant results. And of course, we all know that if the customers can’t find it, they will never buy it. So our goal as search engineers, as relevance engineers to always find relevant product for our customers.
Let’s go to the next one. So, why we are getting zero and irrelevant results. There are many reasons, and we can classify those reasons, mostly on those few broad categories. Very often we have an out of vocabulary search, which means that we have a term, which our customers are using, but that term doesn’t exist in our catalog and attributes or product descriptions, or doesn’t exist in our synonym base or knowledge base. Very often our customers do not even talk about the attributes of a product which they want to buy. They are talking about their problem. They’re talking about the symptoms which we have, like, I have a problem, like what kind of product can help me. And in this case, the traditional search is pretty clueless about how to retrieve for relevant products, which can solve a particular problem.
Sometimes customers are inspired by a particular theme and they are going to talk about black history month or some other kind of semantic modern address. And in this case, again, we not often have this kind of data in our product catalog or knowledge base to satisfy those kinds of queries. Compatibility queries. Then we are searching for something which compliments different products, or even the non-product searches. Sometimes when customers are just searching for something which does not exist in our catalog and they have to be referred to a different part of the site is also a problematic kind of searches.
So let’s look at the next slide and the evolution of product search. Which we built over the last probably decade or more. So it all started with a typical traditional approach of full text search. When we started to build the systems, which are based on the traditional information retrieval techniques and those techniques were developed for the large articles, full-text search kind of scenarios and the main approach is to find the terms which customer used in the query and match them to the same terms which we can find in the data and then rank the results by the frequency of those matches. Sometimes we augment, augment this kind of solution with large lists of synonyms, trying to improve their ability to understand the queries. But it all had a lot of deficiencies this was false positive matches, representations where the term is misunderstood in a different context.
So the next big advancement was, then we migrated to the idea of the concept search. And that’s when we started to match things, not strings. And we started to match a particular kind of combinations of terms in a, in a context of a particular field, and therefore things like semantic query parsing. And we expanded our synonym base with a lot of different types of linguistic relationships, things like one site synonyms. Things like instructions to not associate a particular terms together with the different concepts. And a general knowledge graph about a particular domain, which helps to better understand customer queries. And up to this day it’s, we’re one of the most powerful technique in the industry, and it’s often augmented with some business rules to improve relevance or, and more often it’s augmented with a different machine learning approach to re-rank or learn to rank using the customer behavior data to rerank some of the top results to drive up their relevant search results based, based on the way how customers are interacting with the product.
And the next great idea, which we will be focusing on today is to fully embrace the power of deep learning and natural language processing and depart from the idea of trying to match a particular terms in a particular data fields inside their products and go completely into the vector space. And the main idea here is to encode both products and queries as semantic vectors. And the reason why we call this, these vectors which encoding products and queries, the semantic vectors is because they carry meaning. Which means that the similar products will be sitting together in this vector space very close, and they will be close with their corresponding queries, and we will be able to convert the problem of matching the terms to the matching vectors, which has very well solve problem. And to make all the system work, we will need to train where our deep learning model and natural language processing model on a lot of domain data and a lot of customer behavior data to drive the correct representation of our products and queries with the vectors. So that’s that story, which we will be talking about today.
So this is a great idea, but how to pragmatically apply it to the eCommerce solution. So if we look at the typical distribution of eCommerce, we will see that we have a combination of head, torso, and long tail queries. And head queries are usually very popular queries and very, very simple. Like it’s a boring name, title, combinations of those. And those queries are very well served by the typical traditional solution, when we just match what customers are asking to the particular attributes of a product. However, if it goes to go towards the long tail, we very quickly see that the queries become much more complicated. They start to involve unknown terms. They start to involve symptoms and out of vocabulary terms.
And this is when, where vector search or semantic vector search really shines. And then this is when we can prevent zero results and we can drive the relevant results. Even in the situation when we didn’t see this particular term in our data, based purely on the deep learning understanding of the query and a lot of information about customer behavior in the context of similar queries, which we can accumulate. And Stas will talk about the details of how to make it.
Thank you. So the core idea under the hood of this process is to have a neural network, a specially trained model, which will be able to get user input as a text, and convert it into a query vector, which we call embedding. And this special network can input search queries, and the product descriptions, and place them all together into the big multi-dimensional space. And the specialty for this network, is that the product which will be neighbors to each other will be similar to each other. And the gradient neighbors on the search query will get the relevant product itself. So if we will place everything in the same dimensional space, and we want to find that product for the search query, we just need to run the simple algorithm called KNN, K nearest neighbor search, and find the product embeddings, which is most close population, the closest to this query embedding. And that’s how the system will retrieve the document, which it thinks relevant for this query.
But when we’re talking about a special model, let’s talk how we can create the special model. So we have a lots of models that can convert a query to embedding. There’s a whole new bunch of family of transformers, which take texts as import and produce embeddings. But then out of the box, you know the cell is working very good for placing products, close to each other. We can upload this process by providing proper training data as always. And this training data will consist from the three main elements, we will take a dispatch credit as an anchor. We will use the product which we think relevant for this query as a positive sample. And we will use product, which we think is not relevant for this space as a negative sample, and training for this network will be optimizing vector distance between the query embedding, and the positive embedding. So we will, we will bring them closer to each other during the training, and we will push aside query embedding negative embedding, so they can be far away from each other.
Running this sovereign for multiple time, we can reach a model generalizations. So it will not only work for the queries which we will send to her, but we will try to generalize the whole concept for the query and the work on even for the new query, which had never seen before. And when we’re talking about this training data, the biggest challenge here will be to find a negative sample because we know we always know which signals we can use to find the positive product. But the negative is something that have a gap and we are not always sure which product is not relevant to the query. It can be new products and they, they can have no signals because they’re new. They can have no signals because it’s never been shown to user.
So there are several different complex techniques starting from the simplest term, sampling to the more deeper complex of intersecting attributes, which we can use to find these negatives. And this, this is one of the challenges that we’re facing during this training. Also, we can have not only text as input. In most papers that right now, people submitting, they’re exploring the possibility when you have texts from the query, and a text description of the product. And that’s happening because most of the open source data sets have only text description for the product as a single line. But in real world, in the e-commerce catalog, you have much more information you can use to see this neural network with more rich and more enhanced signals. It’s not only starting with product description. It’s also including a structured, structured catalog with different separate attributes, which can be relevant. This also can include the product image information, which is also, can be relevant for some categories.
As example, if we’re talking about fashion category then the product images have much bigger impact on the result than even texts, because it’s more visually, visual information that sometimes, in some cases you don’t even spend time to describe. As example, if I’m looking for a tee shirt with red lines, red lines may be not mentioned inside the text description, but it can present on the image and that can be a very strong signal for the neural network under the hood. And to combine all the signals together, we have to come up with some complex joins for these features. And in our use case, we are creating a reference model, which consists from two big parts, the part which will be used in real time and part which will be use for offline processing. In real time, we will focus only on query embedding. We will take the simple text as input, we’ll provide several layers, which will convert this text into word embedding. And then we will use the word embedding for search.
As an offline process, we will take a product, we will try to extract all the useful information from it, using different techniques for encoding. Including simple BiGram text encoding, or transformer encoding. Or even using state of the art congressional neural networks to read, to extract features from the images. After we accept all these features, we will combine them by joining them together and sending it to the pre-final fully connected layer that will help us to aggregate all this information. And we’ll set a proper weight during training for all these features based on this importance. And the final layer of this big neural network, we will be the same layer which we using with original real time query. Because if we have two different neural networks, the dimensionality and distribution for the final embedding vector will be different. But if we’re using the same layer there will be the same for the query embedding and for the product embedding, and they will be distributed well inside the dimensional space when you’re trying to do the actual search.
And during several project which we’ve accomplished with vector search, we’ve found that there are several key aspects that you should follow when you’re trying to do this feature join. First of all, don’t try to use pre-trained networks, it’s happening for transformers and it’s happening for image embeddings. It’s always good to fine tune your network before doing actual feature searching. It also will be useful if you will use visual search neural network for image searching and encoding, because it will provide more, better features. Also the key importance is to use the three rounds because it will allow your vector search to handle misspellings as well. From the things that you shouldn’t use, I can also mention that you should not use negative product, which just have low CTR metrics. They can confuse neural networks because sometimes product is not being clicked, not because it’s not relevant, but because the price is too much, because the image is bad, and so on.
And if we will try to feed neural network with negative samples like this, it will confuse you a lot. So when we have this neural network, let’s try to talk how we can use this in real world application. And there is two more big scenarios for this. One, we can have this fallback. When we don’t have any results, we can use our neural network just as a fallback case and show something which we think is relevant. Even though the original search, it doesn’t have any results. In this case, we feel safe. It’s easiest. And the simplest integration, which is fallback, if we don’t have results, and show results with semantic search. The more complex use case is when we’re trying to join our results from the regular search and from the semantic vector search. In this case, we can have a different theories to how we can achieve this, and we can compute metrics, compute how we can, how we match the query. Did we understand it well? If we did understand it well, we can return a regular search. If we do not understand it well we can return semantic vector search. Or we can just have a cup for the query present in head. If we’ve seen this query before, we’ll probably think that the regular search will handle it better than a vector search.
There are many scenarios like this, and it’s always challenging to find and compute proper metrics when you’re using trend method. But it’s still solvable task, and this will give you much better results than just a fallback. So let’s summarize. Semantic vector search is where the industry is going right now. Every big company is trying to play with it. And there is lots of success stories for this, but it’s still not a silver bullet. It’s not solving everything. And it’s always tricky when you have a cold start, you can’t use this model from the cold start. But you definitely should invest into this, and you definitely should try it out because that’s the future of the search.
And that’s our presentation, and I will be glad to answer all your questions and happy searching.