Advanced Linguistics Capabilities

Lucidworks’ Advanced Linguistics Package provides tokenization in more than 30 non-English languages and advanced entity extraction for 19 languages.

AI-powered search is challenging enough when you’re only working with datasets in the English language. The search engine must identify the correct language, normalize non-English characters, and improve content recall – without sacrificing precision. Non-English languages have distinct semantics, grammar and slang, not to mention the unique characters found in Asian, European, and Arabic base languages.

Lucidworks offers an Advanced Linguistics Package for Fusion, powered by partner BasisTech and their Rosette technology. Fusion with Rosette delivers personalized, relevant results, regardless of the language used to search or browse. Rosette Entity Extractor (REX) delivers structure, clarity, and insight, by revealing the key information—names, places, organizations, products, and key phrases—in 19 languages.

Language Identification

  • Fast, dependable language identification in 55 languages
  • Customizable dictionaries, script conversion, and orthographic normalization
  • Advanced morphological features include tokenization, lemmatization, and decompounding

Entity Extraction

  • Find entities which cannot be exhaustively listed in rules
  • Field training field kits create personalized entity extraction models for your use case
  • Foundation for apps in eDiscovery, social media analysis, and financial compliance

European Languages

  • Lemmatization for words with inflection (beau, beaux, belle)
  • Distinguish words with common stems (animal, animate)
  • Noun decompounding for German, Dutch, and Scandinavian languages

Asian Languages

  • Tokenization of Asian characters improves search precision
  • Normalization of meaningless character variations
  • Convert older style Japanese Kanji characters to modern characters

Fusion + Rosette

The Advanced Linguistics Package for Fusion personalizes search in Asian languages (Chinese, Japanese, and Korean), European base languages, and Arabic base languages. Global organizations that support multiple languages for their commerce, customer service, and workplace solutions can make the content they manage more accessible, more relevant and more personalized to a global audience.

Asian Base

Rosette’s Chinese base linguistics converts Chinese scripts to a single form – whether traditional or simplified – in order to be searched and processed correctly. Japanese base linguistics tools normalize Katakana spelling variations and also normalizes older kanji to modern kanji. The system understands the difference between Chinese and Japanese text written in Han script, and accurately returns pronunciation information.

European Base

The Advanced Linguistics Package includes language-specific tools for lemmatization and decompounding. Words in French, Spanish, and Italian can be highly inflected (e.g. Beautiful in French can be spelled beau, beaux, belle or belles). Lemmatization links words based on their meaning, not on how they look. This is useful for entity recognition and search relevancy. Decompounding is useful for German, Dutch, and Scandinavian languages.

Arabic Base

Arabic words frequently incorporate grammatical elements indicating attributes such as verb aspect, object, conjugation, person, number, gender, and others. Designed to plug into mainstream search engines and data mining applications, Arabic Base Linguistics performs orthographic and lexical normalization of Arabic text for use in Fusion queries. Fusion’s Advanced Linguistics Package also supports base linguistics for Persian (Farsi and Dari), Pashto, and Urdu.