The best search experiences feel like they’re one step ahead of you. When you type “shorts” into the search box at your favorite clothing retailer, you get a page full of the short pants you were looking for, without seeing short-sleeved shirts or short skirts or, somehow, skorts.
Creating that seamless interaction requires a system that’s part psychic and part translator. That’s because the words that people type into the search bar don’t always reflect what they’re after. They can be riddled with ambiguity: Does a “hot dress” mean one that’s sexy? Or one that’s popular? Or one that can be worn during the summer? Vague or incomplete queries abound, and although typos generally can be handled, you may be surprised at how many ways someone can misspell “mattress.
Those missed connections may seem trivial, but they make up a considerable threat to your business. Visitors who used an on-site search tool converted 1.8 times better than the industry average, according to a report by eConsultancy. In the same study, visitors using search were responsible for 13.8 percent of the revenues of the sites studied.
But when people don’t see what they want on the first page of search results (or, even worse, get the “your search returned no results” message), they won’t take the time to check their spelling, try another word, or narrow the results with a filter. They simply leave, even if your site has what they’re looking for.
“In the absence of good search results, the average user will simply think you don’t have what they’re looking for and will go somewhere else,” said Lauryn Smith, a senior user experience researcher at Baymard Institute.
Smith pointed to one of Baymard’s reports that showed almost one-third (31 percent) of test subjects were either unable to find what they wanted or abandoned their search. And nearly two out of three times, (65 percent) it took more than one attempt for people to find what they were looking for.
There are any number of ways that a search can go wrong, but the good news is that there’s a solution that catches nearly all of those errors. A system that can automatically detect synonyms can turn those ambiguous queries into precise results. And search is key to customer experience.
Why Machines Create Better Synonym Lists
It’s possible to create a synonym list on your own. Plenty of resources provide related words to pull from, plus you’re likely already an expert on all the terms around your particular business. The most common way to create a synonym list by hand is by looking at the search terms people use on your site. Even with a list containing thousands of words, you could do pretty well matching up misspellings and writing redirects for related words so that a search for “computer monitor” returns results for “display.”
All of that, however, takes time. And your initial set of rules may prove difficult to keep up to date as product names change and new products are introduced. The Shopper-First Retailing study by Salesforce found that 69 percent of shoppers expect to see new merchandise at a site or store whenever they visit, and 75 percent of site search queries use new terms each month.
Manual updates also put a huge burden on your staff to make judgment calls. Peter Curran, the president and co-founder of ecommerce technology company Cirrus10, pointed out in a recent Q&A session with Lucidworks how a term as simple as “leopard print” could lead to mismatched results.
“In the results from a website I showed … I get prints, as in wall art, with various animal images in my search results, but I don’t get any garments with a leopard pattern, which is what I wanted,” Curran said. “This started when someone — probably someone focused on home décor — decided that the word ‘leopard’ should be equivalent to the word ‘animal’ but didn’t think about leopard-print fabrics.”
That points out the main problem with manual updating: When it comes to search results, linguistics don’t really matter; it’s all about figuring out what people are actually looking for. That requires pattern matching, where the machines have us beat.
“The system is agnostic as far as grammar goes,” said Carlos Valcarcel, a senior solutions architect at Lucidworks who has spent more than a decade working with various search systems. “It doesn’t actually care what the words mean. What it cares about is the intent of the users.”
WordNet vs Word2vec
Most synonym-matching algorithms have two common starting points. There’s WordNet, a database of English-language synonyms first built in 1985. It’s now up to 117,000 sets of words grouped together by their meanings. It’s a remarkable resource, but it has shortcomings in the ecommerce world. For example, it knows that “galaxy” is a star system, but doesn’t know that it’s also a popular Samsung phone.
Then there’s Word2vec, a computational model written by in 2013 by a team of Google researchers led by Tomas Mikolov, that takes words and produces a vector (the “vec” of Word2vec, meaning a group or collection) of related words. Word2vec differs from WordNet in that it isn’t concerned with grammar. It turns text into a numerical form that the model can read. By analyzing the mathematical similarities between those forms, Word2vec teaches a computer context by highlighting words that are “close to” other words.
It’s a good place to begin, but Word2vec can fall short as customers search for products because the word pairs are related, but not always interchangeable. Word2vec knows that “king” and “queen” are related words, but it doesn’t know that someone searching for king-sized sheets doesn’t want queen-sized sheets.
Detecting Synonyms With Machine Learning
To solve the problems inherent in WordNet and Word2vec, Lucidworks developed a five-step synonym detection algorithm as part of its Fusion platform.
1. Find Similar Search Queries
Rather than beginning with a set of predetermined synonyms or related words, the algorithm uses customer behavior as the seed for building the list of synonyms. What are people typing in the search box? And what links do they click on from that list of results?
For example, a page gets 500 clicks when it appears in search results for “apple mac charger.” That same page also gets 200 clicks when it appears in search results for “mac power.” In that case, you can assume that “apple mac charger” and “mac power” are similar queries, because they lead to the same set of results.
2. Query Pre-Processing
At this stage, there are a few clean-up steps to get down to a usable synonym list. First there’s stemming, which reduces various forms of a word down to a common form (deconstructing connection, connective, connected, and connecting to connect). Then there’s removing stop words, or the most common words in a language. Taking out words like the, is, at, which, and on will speed up search performance.
This is also the time to address misspellings. It’s better to treat them as unidirectional synonyms, rather than bi-directional. So “matress” will lead to results for “mattress,” but not vice versa. Before moving on to the next stage, the algorithm also changes any multi-word phrases into something it can read as a single word, typically by putting an underscore between the words. (mac book becomes mac_book).
3. Extract Synonyms
Now it’s time to pull a set of synonyms out of the cleaned-up user queries. The algorithm finds them by looking for words and phrases that appear before or after the same word. From the similar queries “laptop charger” and “laptop power,” you can deduce that “charger” and “power” are synonyms, because they both follow “laptop.”
4. Get Rid of the Noise With a Graph Model
At this step, it’s helpful to use a graph model to further define the relationships among your group of potential synonyms. Similar terms are grouped together on a graph based on the probability that they’re related to each other. If words are very likely to be synonyms, they’ll appear close to each other on the graph. And we know they’re related to one another because they end in the same result page – the first step in this process.
A good example would be “mac,” “apple mac,” and “macbook.” They’re close enough to be grouped together so that any one of them would be considered a synonym for the other. On the other hand, the graph model would reveal that “LED TV” is similar to “TV,” and “LCD TV” is similar to “TV,” but “LED TV” isn’t similar to “LCD TV.”
5. Categorize: Synonym Pair or Context Match
The last step looks at the synonym list to sort out true synonyms with words and phrases that match up with each other in context. The words “earbud” and “earphone” often appearing before and after the brand name “Bose” gives you high confidence that they’re synonyms. On the other hand, “game” and “PlayStation” aren’t synonyms, but they’re related in context because they consistently appear before and after the word “console.”
How Does Lucidworks Compare?
The Lucidworks method doesn’t contain a lot of fancy data modeling or deep learning techniques, but the streamlined approach gets results. (Fusion saves the math for predicting next steps.)
In a presentation at Lucidworks’ 2018 Activate conference, VP of Data Science Chao Han compared the company’s synonym-detection approach to the Word2vec method using the product catalog of a national electronics retailer. The Lucidworks method produced synonyms with 82 percent accuracy, while Word2vec came in at 32 percent.
“The graph method beats out things like Word2vec and comparable deep learning techniques at the moment,” said Ian Pointer, a senior data engineer at Lucidworks. “It shows that there’s still some power left in traditional NLP techniques.”
Better Synonyms Equal Better Search Results
A well-crafted search leaves people feeling confident and satisfied, while a search that returns a bunch of irrelevant results (or even worse, nothing at all) creates frustration, doubt, and, for retailers, a chance that customers will move on to your competitors.
Better synonym detection will always be at the core of accurate searches, even as new models, like voice search on smart speakers, car dashboards, or mobile digital assistants, grow more popular.
Steve Jobs once said, “Customers don’t know what they want until we’ve shown them.” And in a way, that’s the real challenge of search: being able to figure out the question that is really being asked.
More on Automatic Synonym Detection
Technical Dive: Find out how Solr detects synonyms at index or query time