Lucidworks' Advanced Linguistics Package provides tokenization in more than 30 non-English languages and advanced entity extraction for 19 languages.

Trusted by Some of the World's Largest Organizations

AI-powered search is challenging enough when you’re only working with datasets in the English language. The search engine must identify the correct language, normalize non-English characters, and improve content recall – without sacrificing precision. Non-English languages have distinct semantics, grammar and slang, not to mention the unique characters found in Asian, European, and Arabic base languages.

Language_Extraction

Language Identification

  • Fast, dependable language identification in 55 languages
  • Customizable dictionaries, script conversion, and orthographic normalization
  • Advanced morphological features include tokenization, lemmatization, and decompounding
Entity_Extraction

Entity Extraction

  • Find entities which cannot be exhaustively listed in rules
  • Field training field kits create personalized entity extraction models for your use case
  • Foundation for apps in eDiscovery, social media analysis, and financial compliance
European_Languages

European Languages

  • Lemmatization for words with inflection (beau, beaux, belle)
  • Distinguish words with common stems (animal, animate)
  • Noun decompounding for German, Dutch, and Scandinavian languages
Asian_Languages

Asian Languages

  • Tokenization of Asian characters improves search precision
  • Normalization of meaningless character variations
  • Convert older style Japanese Kanji characters to modern characters

Lucidworks offers an Advanced Linguistics Package for Fusion, powered by partner BasisTech and their Rosette technology. Fusion with Rosette delivers personalized, relevant results, regardless of the language used to search or browse. Rosette Entity Extractor (REX) delivers structure, clarity, and insight, by revealing the key information—names, places, organizations, products, and key phrases—in 19 languages.

Fusion + Rosette: A Whole New World of Finding

The Advanced Linguistics Package for Fusion personalizes search in Asian languages (Chinese, Japanese, and Korean), European base languages, and Arabic base languages. Global organizations that support multiple languages for their digital commerce or digital workaplce search solutions can make the content they manage more accessible, more relevant and more personalized to a global audience.

Asian Base Linguistics

Rosette’s Chinese base linguistics converts Chinese scripts to a single form – whether traditional or simplified – in order to be searched and processed correctly. Japanese base linguistics tools normalize Katakana spelling variations and also normalizes older kanji to modern kanji. The system understands the difference between Chinese and Japanese text written in Han script, and accurately returns pronunciation information.

European Base Linguistics

The Advanced Linguistics Package includes language-specific tools for lemmatization and decompounding. Words in French, Spanish, and Italian can be highly inflected (e.g. Beautiful in French can be spelled beau, beaux, belle or belles). Lemmatization links words based on their meaning, not on how they look. This is useful for entity recognition and search relevancy. Decompounding is useful for German, Dutch, and Scandinavian languages.

Arabic Base Linguistics

Arabic words frequently incorporate grammatical elements indicating attributes such as verb aspect, object, conjugation, person, number, gender, and others. Designed to plug into mainstream search engines and data mining applications, Arabic Base Linguistics performs orthographic and lexical normalization of Arabic text for use in Fusion queries. Fusion’s Advanced Linguistics Package also supports base linguistics for Persian (Farsi and Dari), Pashto, and Urdu.

European Base Linguistics

The Advanced Linguistics Package includes language-specific tools for lemmatization and decompounding, particularly useful for European languages. Words in French, Spanish, and Italian can be highly inflected.  Beautiful in French can be spelled beau, beaux, belle or belles. Lemmatization links words based on their meaning, not on how they look, and that linking is useful for entity recognition and search relevancy. Decompounding is useful for German, Dutch, and Scandinavian languages like Danish, Norwegian, Swedish and Finnish. For example, Fusion can understand the German word jugendarbeitslosigkeit as a compound of jugend (youth) and arbeitslosigkeit (unemployed).

Let’s Talk

Contact us today to learn how Fusion can help you and your team put the power of machine learning and search to work to dazzle your customers and empower your employees.

Contact Us Today

Stay up to date on Fusion