Rosette’s Chinese base linguistics converts Chinese scripts to a single form – whether traditional or simplified – in order to be searched and processed correctly. Japanese base linguistics tools normalize Katakana spelling variations and also normalizes older kanji to modern kanji. The system understands the difference between Chinese and Japanese text written in Han script, and accurately returns pronunciation information.
The Advanced Linguistics Package includes language-specific tools for lemmatization and decompounding. Words in French, Spanish, and Italian can be highly inflected (e.g. Beautiful in French can be spelled beau, beaux, belle or belles). Lemmatization links words based on their meaning, not on how they look. This is useful for entity recognition and search relevancy. Decompounding is useful for German, Dutch, and Scandinavian languages.
Arabic words frequently incorporate grammatical elements indicating attributes such as verb aspect, object, conjugation, person, number, gender, and others. Designed to plug into mainstream search engines and data mining applications, Arabic Base Linguistics performs orthographic and lexical normalization of Arabic text for use in Fusion queries. Fusion’s Advanced Linguistics Package also supports base linguistics for Persian (Farsi and Dari), Pashto, and Urdu.