The Power of Semantic Vector Search

Recent advances in Deep Learning allow us to push the boundaries of search engines to new frontiers; one is Semantic Vector Search. By leveraging its functionality, we are able to tackle challenges that classical search engines cannot handle well out of the box. Some examples include underperforming and zero-result queries, natural language questions, recommendations and virtual assistants.

In this talk, we’ll dive into Semantic Vector Search as well as scalable methods used to train cutting-edge deep neural encoders for production. We will also share several use cases showing how we leverage the power of Semantic Vector Search to help Lucidworks customers increase essential KPIs today.

Intended Audience

Researchers and practitioners in information retrieval, ecommerce search, vector retrieval and deep learning. Data and ML engineers. Business, product and engineering leaders responsible for the ecommerce product discovery domain.

Key Takeaway

Learn how to leverage the power of Semantic Vector Search to increase KPIs with several Lucidworks customer examples and use cases.

Speaker

Sava Kalbachou, Research Engineer, Lucidworks


[Sava Kalbachou]

Hi, everyone. My name is Sava Kalbachou. I’m an AI research engineer at Lucidworks and today we are going to talk about the power of semantic vector search. 

Let me start by providing a high-level overview of what the semantic vector search is and how it’s different from the classical keyword matching search. Classical search engines would utilize the inverted index to store current document associations. It’s very efficient on search time but it’s limited to keyword matching and cannot capture the context. 

On the other side, semantic vector search would encode text into vectors in such a way that semantically similar texts will be located near each other in the vector space, whereas dissimilar text will be far away from each other. In other words, semantic vector search gives us the possibility to represent arbitrary objects as vectors in some high-dimensional vector space and use a vector similarity function for searching. It has a variety of applications in search, ranking, recommendation systems, face recognition, speaker verification, and so on. For example, here’s a 3D projection and visualization of vector space of the e-commerce products obtained from one of our models. 

As you can see, there are a lot of distinct groups of similar products. For example, the highlighted group consists of external hard drive products. Here is another group. Although, in this case, products are from different departments. They’re still grouped together as they’re about the same thing, learning Spanish language. 

Semantic vector search has a lot of advantages. First of all, it has rich semantic capabilities, which lead to much better relevancy. By design, it doesn’t rely on keyword matching, as classical search engines do. Thus it handles a lot of problems out of the box, such as misspelled queries, synonyms, phrases, underperforming, long-tail and zero-result queries. Moreover, in many ways, it can act as a recommender system. We can get all kind of similarities, not only query to item but also item-to-item, query-to-query and so on. 

And in general, potentially anything can be encoded and mapped into the same vector space, even objects of a different nature, like text and images. Fortunately, training data isn’t really required to start with. There are plenty of general pre-trained encoders for cold-start scenario. But if training data can be obtained, it can be used to improve relevancy even further. Models are directly trained to run high resource items that users will most likely interact with. There are a few particular challenges though. 

Semantic vector search requires reasonably more hardware resources. There is also some lack of explainability. Although, some visualization can be built, like the one we saw a couple of slides before. It still might be hard to explain why some objects are grouped together. There is lack of real-time control, and there is no simple way to change a particular behavior of the trained and deployed encoder. It is also quite challenging to apply for long documents. Typical deep learning models perform great on sentence or paragraph level but struggle to efficiently encode long documents. 

All right, but how do we retrace our data? What do we use for encoding text into vectors? As you might know, the most common way to do text search is to use TF-IDF, and variations of this formula like BM25 are used in all major classical search engines. Interestingly, we can also use it for vectorizing our text data into big sparse vectors, but as it relies solely on the words matching, it cannot account for any semantic similarities between text. Fortunately, in 2013, the Word2Vec algorithm was introduced, which allowed us to launch semantically rich word embeddings. So words with similar meaning will have similar vectors. 

Unfortunately, although Word2Vec embeddings capture some global word contexts, they don’t account for word ordering in a particular text, thus losing some of the semantic capabilities. This is why we go ever further by using neural encoding. We train deep learning models that are able to account for word ordering and produce semantically rich embeddings for texts. And for achieving that, we use metric learning techniques. Metric learning is an area of machine learning with the goal to learn a similarity function that measures how similar or related two objects are. By using metric learning, we are able to train neural networks to encode arbitrary objects into the same high-dimensional vector space in such a way that similar objects are as close as possible to each other, whereas dissimilar objects are far away from each other. 

The training data can be obtained from a variety of sources. It can be existing created FAQ and knowledge base articles, customer support logs and history, or it can be signals of users interactions with search results. Here’s a great picture that provides a high-level overview of the deep metric learning training procedure. We start with our original vector space that we get right after the encoder initialization. It can be one encoder, this way its objects are of the same nature, like texts, or it can be two separate encoders if you have different types of data, like text and images. Then we choose a metric function, which for example, can be Euclidean distance or Cassandra similarity is another very popular choice. And the goal of the objective function is to minimize the distance between similar objects and maximize between dissimilar. 

With that objective function, we train our encoder on pairs of objects. At the end, we should get a learned vector space where similar objects are grouped together. Now let’s take a look at how semantic vector search can be implemented in the production. And for doing that, we will be using a high-level architecture of what we build at Lucidworks. In our platform, on the modeling side, we have general pre-trained encoders that can be used right out of the box for cold-start scenarios, but we also have a training model that allows us to train encoders of customers’ domain-specific data to achieve even better personalized vector search results. The training sets can be built from customers’ existing data, like FAQ pairs or call center logs, as well from the collected search signals, like user search interactions with products. 

Once models are obtained, they are deployed into the cluster for real-time usage. During the data ingestion process, our index pipeline calls deployed models to encode incoming data into vectors. Those vectors then are stored in a vector similarity search engine whereas companion data is stored in the metadata storage. On query time, query pipeline interacts with the deployed models to encode an incoming query into the vector. Then the query vector is sent to vector search engine for fast and efficient vector similarity with indexed documents. Finally, data of the most relevant documents is retrieved from the metadata storage and sent back to users. Signals of users’ interactions with these search results are captured and constantly used in the improvement loop for relevancy fine tuning. 

Now, let me show you a few applications where we use semantic vector search at Lucidworks. First, we have a product called Smart Answers. It utilizes deep learning-based vector search to answer questions, and it can be used to power chatbots and virtual assistant applications. For example, here is a customer support demo app where users ask natural language questions about backup capabilities. The user’s question here is how to speed up the backup. As you can see, semantic search is capable to infer the meaning of the question and suggest the right answer despite the fact that a different wording is being used, like the word speed up in the question but is slow, troubleshoot the slowness in the answer. 

Here’s another example with the query can I see uploaded data? Vector search is able to understand the intent of the command question and suggest the most relevant answer, which explains that it takes up to five minutes for the uploaded data to appear in the knowledge reports. Another great example is the Never Null product that we recently built. It uses semantic vector search capabilities to ensure that users find exactly what they’re looking for, and thus, it operates in the semantic vector space, it drastically reduce the number of zero search results. Here’re a few examples of how it works on a e-commerce website with the comparison to the classical keyword matching search. And let’s start with a simple example, like misspelled queries. The query drifter is a really English word but the brand Driftr doesn’t have the E letter in the end. Standard search would fail to return anything unless specific manual rules are added. Whereas semantic vector search is capable to handle such misspellings out of the box. 

Here’s another more challenging misspelling example. We swapped letters and an additional letter. Yet again, semantic vector search can handle it very well. But things can get more interesting if users start to search for brands that aren’t in the product’s catalog. Kuhl doesn’t have products for the Samsonite brand. Yet semantic vector search is capable to return the most relevant products from other brands in vector log. Okay, if the user search for ChapStick, semantic vector search can understand that user wants to buy lip balm and returns the only one available product from the catalog. Similarly, if users search for some abstract clothing style or if they use a bit different wording, like the query Outlaw, semantic vector search can infer the semantic and suggest the most relevant products, such as Above The Law, and The Lawless. 

Now, as we are building a new SaaS platform, we want to incorporate machine learning into every application. That will allow use to automatically optimize models based on prescriptive KPIs for each of the use cases, leverage the power of semantic vector search right out of the box with the built-in models, but it also will allow us to train custom models for a personalized customer experience. We’re also working on making it automate on end-to-end for hassle-free experience with standardized data inputs. And for sure we want to expand semantic vector search with advanced features like multilingual and cross-lingual capabilities. And even with multimodal search, which allow us to incorporate additional information like images or structured content. 

And I think that’s all from me from today. Thank you a lot for attending this presentation. Please free to reach out to me with any questions.

Play Video