Mintel has been a trusted source of market intelligence for nearly 50 years, with over 35 thousand unique users each month. With a breadth of subject matters covered in Mintel’s reports and a worldwide customer base that speaks a great number of languages, providing an intuitive search experience for each customer was a challenge for Mintel’s Engineering Team. Learn how Mintel uses Fusion query pipelines to deliver relevant search results to every user in their preferred language.
Speaker: Adrian Rogers, Mintel, Architect
In this session attendees will learn how we’ve structured our index and our query pipelines to allow searches to be made against content indexed in multiple languages and to be returned in the preferred language of the user.
This session would be useful for software developers. Particularly if you’re working with content in multiple languages
Search Relevance; Commerce – Ecommerce Search and Personalization;
Adrian Rogers works as an architect within the IT team at Mintel. He’s worked across a variety of projects within the business including helping to build out the content search capabilities using Solr and Fusion.
Welcome, everyone. My name’s Adrian Rogers. I’m an architect working for Mintel. And today I’m going to cover how we’ve tackled the issue of multi-language search.
I’m gonna start with a brief overview of the challenges that we’ve faced, and then introduce how Lucidworks Fusion has assisted us in this area. I’ll also go into a little bit more of the technical details behind our solution before wrapping up at the end.
So as a trusted source of market intelligence, that’s been around for almost 50 years, Mintel has an ever-increasing body of research content. Initially, this content was just written in English but as we began to move into new markets more recently, we’ve started wanting to tailor our content for those markets, and specifically, we’ve been wanting to translate the content into the local languages for those markets. Currently, we support six different languages for our content, and those are English, Chinese, Brazilian Portuguese, German, Thai and Japanese.
Now, the challenge we have here is that we want our users to be able to search for these pieces of content in the language that they’re most comfortable with. And I should probably clarify this at the start is that we’re not trying to translate their searches, we’re just trying to make sure that if they’re searching for a language we support, they are able to search against those pieces, and surface them correctly. On top of that, we also want them to be able to see their results, and particularly the translations of content that match what they want to see without showing other translated versions at the same time and cluttering up the search results with duplicates.
So as a user of Lucidworks Fusion, we rely on one of its key features here to assist us with that problem. And the feature that we are using for this mainly is query pipelines. So we make extensive use of pipelines to ensure that no matter where a search has come from, we’re able to handle language and translations appropriately. And we do this by dynamically making changes to both the queries that we send to Solr, as well as the fields that are requested back, and then returned through to the end user.
I’ll start with some basics. So before we can actually talk about the pipelines, we need to look at the schema because without being able to store and process fields properly in the different languages, we can’t search on them accurately and provide the right functionality. So this here is just a sample of our schema. But each language we support essentially has its own field type, and this allows the appropriate tokenizers and stemmers to be applied, as well as any extra special stop words, or other language-specific pieces that need to go into the analyzers. These field types are then applied to any of our searchable fields for those specific languages.
For a given piece of content, any translations of that content are indexed into a single Solr document. So this means that one Solr document will contain the English versions of fields, as well as the localized language, and if there’s been more than one localized language, then that would be included as well. This is what helps us to meet our criteria of never showing more than one translation of a piece of content because there’ll only ever be a single document returned, and we’ll just call the appropriate fields from there.
And speaking of those fields, as well as the fields for searching, we also have fields that are stored as plain strings for retrieval. So these, they’re plain strings because we don’t need to actually do any tokenization or anything. We just want the values. And this is also set up with the same fields, name structure. So everything we have is kind of ended with .lang, a dot and then the language code. Once we have the schema set up and in place, the next step that we had was to tackle the query pipeline.
Now, in the pipeline, we start by taking two non-standard input variables that Solr wouldn’t know what to do with. However, for us, first of all, we have to check that both of these are present because we will be using them in future pipeline steps, and also, we need to check that they’re actually valid language codes. So we have here the input language, which is the language that the user is trying to search in. And currently, we actually end up deriving that from the output language. However, we do have plans here at Mintel to explore tools to try and auto-detect the input language to try and make that a bit more all encompassing. The output language represents the language that the user is trying to view our website in, and the language that they prefer to read their content in.
And so by making sure that these settings are present and that they’re for a language we support, then this sets us up to be able to perform the next steps of the pipeline. The second step is to modify those fields that Solr actually searches on. So this is done by adding qf arguments for the specific input languages being requested.
Now, we’ll always include English as one of the fields to search on since the majority of our content is still in English, and we don’t want to curtail the users’ results so heavily if there isn’t actually much in the way of results, and also particularly because we’re not auto detecting the input language yet. However, what we will do is we’ll apply a boost to any of the requested language fields so that any articles that match that language, hopefully will bubble up a little bit higher and so the user should see articles and content in their language that they prefer.
Once we’ve finished rewriting these return fields, then the pipeline continues to apply the rest of our business logic. So this is anything that is essentially language agnostic, things like relevancy boosting, signal-based user preferences, all of these other things where we’re basing it on structured metadata and not free text fields. These steps aren’t affected by the input and output languages, so those are run before finally executing the query, and returning the user hopefully the results that they want to see. So just to summarize briefly. In order to handle multi-language searching, first of all, we updated our schema so that our search index could properly support searching in the languages that we wanted to use.
Then we added a step to identify which language the user wants to search in, and which language they want to receive their results in. We added a step to update the Solr request so that searches would be run against the user’s preferred input language. We added a step to update the Solr request as well to return the fields in the preferred language that the user wants to read. And then finally, we run any other language-agnostic steps, as usual.
Thank you very much for listening. I hope this has been helpful.