Apache Solr, Blog, Events and Meetups, Open Source, SearchHub

They type it how? Difficulties in Japanese spell-check

by Lucidworks
June 1, 2011

The first part of Takahiko Ito’s talk on Day 2 of Lucene Revolution was interesting, but the second half introduced me to a problem — and a serious one — that I hadn’t even known existed. Slides for this session:

http://www.slideshare.net/takahi-i/lucene-revolution-2011

Ito, of Japan’s social network mixi, first described the tool mixi had built, Anuenue, which simplifies the creation of Solr clusters. It’s not yet integrated with Solr Cloud and Zookeeper — that’s one of their future goals — but it does seem pretty simple. A single XML configuration file defines a master (which does the indexing) slaves (which receive the replicated index and serve requests from the mergers) and mergers, which take requests from the client and get the data from the slaves, then combine all the returned data into a single set and return it to the client. For example:

<nodes> <masters> <master iname="master1"> <host>host1</host> <port>8983</port> </master> </masters> <slaves> <slave> <host>host2</host> <port>8983</port> <replicate>master1</replicate> </slave> <slave> <host>host3</host> <port>8983</port> <replicate>master1</replicate> </slave> </slaves> <mergers> <merger> <host>host4</host> <port>8983</port> </merger> <merger> <host>host5</host> <port>8983</port> </merger> </mergers> </nodes>

So no more having to manually edit every configuration file in the cluster and manage each server individually. Once your cluster is installed, Anuenue also provides cluster commands, such as for starting or stopping the entire cluster, or posting data to it. Anuenue also provides support for Japanese “did you mean” support, which is where things get really interesting. As it turns out, In English (and probably a lot of other latin languages) “did you mean”, or spellchecking, isn’t necessarily easy, but it’s straightforward. In most cases, you can get a pretty good result by looking at the “edit distance” between two words, which takes into account the number of common letters. For example, “like” and “likes” are close, because they’re only off by one character. On the other hand, “like” and “foobar” are completely different, so there would be a large edit distance between them. But here’s the problem, Ito points out. In English, you can tell that a word is a good candidate for a misspelled word if the edit distance between them is small. But in Japanese, that all goes out the window. A word and its misspelling may have no common characters at all. Why? Because as a rule, Japanese people don’t type Japanese characters. Instead, they type the latin characters for the phonetic spelling of the word, and the computer offers them the possible (or probably) Japanese terms. If the user chooses the wrong one, there may be no common characters between the misspelling and its correct equivalent. The solution? A dictionary that maps common misspellings with their corrected equivalents. Ito talked about a tool called Oluolu, which analyses a site’s log and determines wrong/right pairs by watching users sessions, then comparing the phonetic versions (or “readings”) to see if they are similar. In the long run, this “problem” might actually lead to a better solution; because the dictionaries that Oluolu creates are based on real user interactions, they’re likely to be more accurate than machine ‘guesses’, and provide better results. Ever wonder what happened to artificial intelligence (as in HAL, the IBM-minus-one beast of the 1968 movie 2001). Seems automating existing human intelligence, noisy as it is, turns out to be much more effective.

Cross-posted with Lucene Revolution Blog. Nicholas Chase is a guest blogger.This is one of a series of presentation summaries from the conference.

Lucidworks Platform Overview

Lucidworks Platform Pricing

AI Hub

Lucidworks Features and capabilities | Lucidworks Studios

Product Discovery

Searchandising

Site Search

Workplace Search

Ingest Data and Capture Signals

Employee Search Experience

Customer Service and Case Resolution

AI and Large Language Models

Search Path

Analytics Studio

Commerce Studio

Solutions

Commerce

Customer Service

Knowledge Management

Industries, Package & Service Offerings

Retail

Government and Public Sector

Healthcare

B2B Commerce and Distribution

B2B Manufacturing

Financial Services

B2B Core Package

B2C Core Package

Customer Service Offerings

EXPLORE OUR CONTENT

Ebooks & Reports

Blog

Videos

Press

Search Path

Resources

About Lucidworks

Documentation

Careers

LucidAcademy

Contact Us

Technical Support

About Lucidworks

Related Articles