AI-powered search is challenging enough when you’re only working with datasets in the English language. The search engine must identify the correct language, normalize non-English characters, and improve content recall – without sacrificing precision. Non-English languages have distinct semantics, grammar and slang, not to mention the unique characters found in Asian, European, and Arabic base languages.
Lucidworks offers an Advanced Linguistics Package for Fusion, powered by partner BasisTech and their Rosette technology. Fusion with Rosette delivers personalized, relevant results, regardless of the language used to search or browse. Rosette Entity Extractor (REX) delivers structure, clarity, and insight, by revealing the key information—names, places, organizations, products, and key phrases—in 19 languages.
Rosette’s Chinese base linguistics converts Chinese scripts to a single form – whether traditional or simplified – in order to be searched and processed correctly. Japanese base linguistics tools normalize Katakana spelling variations and also normalizes older kanji to modern kanji. The system understands the difference between Chinese and Japanese text written in Han script, and accurately returns pronunciation information.
The Advanced Linguistics Package includes language-specific tools for lemmatization and decompounding. Words in French, Spanish, and Italian can be highly inflected (e.g. Beautiful in French can be spelled beau, beaux, belle or belles). Lemmatization links words based on their meaning, not on how they look. This is useful for entity recognition and search relevancy. Decompounding is useful for German, Dutch, and Scandinavian languages.
Arabic words frequently incorporate grammatical elements indicating attributes such as verb aspect, object, conjugation, person, number, gender, and others. Designed to plug into mainstream search engines and data mining applications, Arabic Base Linguistics performs orthographic and lexical normalization of Arabic text for use in Fusion queries. Fusion’s Advanced Linguistics Package also supports base linguistics for Persian (Farsi and Dari), Pashto, and Urdu.