Commercial vs. Open Source Language Packs: An Interview with Andrew Paulsen, Regional Director at Basis Technology
Today we’re talking with Andrew Paulsen of
Basis Technology
about their commercial language packs.
Since Basis first came to market, open source has made huge strides forward in supporting multiple languages.
Not only does Solr support many European languages, but it also has multiple options for
Japanese
and
Chinese,
including morphological tokenization.
But despite all this progress, Basis is still around selling their wares.
Why would anybody pay for software when open source alternatives exist, especially when using an open source search engine?
What are the advantages to the commercial packages? And how is it that Basis is still in business!?
Andrew was gracious enough to endure our interrogation!
Hi Andrew, thanks for chatting with us.
First of all, can you tell us a little bit about Basis, and maybe also a bit about your background?
Sure. Basis Technology has been around for over 18 years providing text analytics software to some of the world’s most successful and innovative software companies such as Adobe, EMC, Google, Microsoft, HP, Salesforce.com, Oracle, Symantec and Yahoo!… to name just a few. I have worked out of our San Francisco offices since 2000. Yes, you heard correctly, 12 years which is a long time in the software industry. Over the years I have worked with numerous software companies (including all the aforementioned) improve both their web and enterprise search quality by implementing language specific linguistic support.
Let’s get right into it, and I assume you get this question a lot. Solr now comes with a fair number of language specific analyzers, so why are companies still willing to purchase language packs from commercial vendors?
Yes, very good question. One would think that our business would be shrinking, but it is actually growing considerably. I think that is because software companies are getting more sophisticated and can derive more value out of high quality linguistic analysis. Also, I think that companies are focusing more on markets outside of the US for growth and are putting more effort and resources to creating high quality support for search in foreign languages, as opposed to just creating checkbox features.
To sum up our value proposition in relation to open source linguistics; we providing higher quality, more in-depth features, a wider breadth of language coverage and better performance/reliability. And as you know software engineers are expensive these days, especially search engineers with an NLP background. Companies can actually save money and increase development productivity by licensing a commercial ready NLP platform as opposed to having these well paid engineers implementing and testing various linguistic modules from around the world with various levels of quality and performance.
For Western languages, one of the things Basis enables is lemmatization vs. simple stemming. This was something that FAST ESP used to talk about a lot, before being acquired by Microsoft. Can you tell us more about this, and maybe why it matters? Any examples?
Sure… As a side note FAST was one of our customers. When Microsoft acquired FAST, many independent software vendors who licensed FAST moved over to SOLR/Lucene. These former FAST customers have high expectations for non-English search and so many of those customers licensed our software to integrate with their SOLR implementations.
In regards to the benefits of lemmatization verses stemming; Stemming provides context insensitive algorithms that normalize tokens by chopping off the end of the words based upon rules. The resulting stem is not an actual word, but an artifact of what used to be a word.
Example:
- The word “babies” stems to “babie”
- The word “copying” stems to “copi”
Lemmatization on the other hand is context sensitive and normalizes a word to its true dictionary form.
Example:
- The word “babies” lemmatizes to “baby”
- The word “copying” lemmatizes to “copy”
Stemming can create a lot of problems such as different words creating the same stem or the same word creating different stems. There are also many instances where stemming flat out fails and does nothing to try and normalize the word. These types of problems and failures can create havoc for a company trying to provide high quality search.
Stemming was originally developed to support English search and although there are some problems with English stemming, it generally works well for keyword search applications. The real problems come into play with European languages.
European languages are highly inflected compared with English, meaning depending on the context the same word can be written many ways. Do you remember trying to conjugate Spanish in high school, or deal with masculine and feminine forms, etc… The more inflected a language is, the more important it is to provide morphological analysis to provide the correct lemma for indexing.
Here is a whitepaper with extensive examples and explanations:
When indexing text, the first step is to break it up into words. For Asian languages this is tough because they often don’t have spaces between words. In the early days of Solr, there was the primitive “ngram” based CJK module, which just chopped text into character pairs and had no concept of linguistic integrity. Obviously Basis was superior to that. But in recent years morphologic analyzers have been added for Chinese and Japanese, and it looks like Korean is in the works. How will you compete with these?
Agreed, the open source analyzers have certainly gotten better, but so has our technology. So for example the SOLR Chinese analyzer only supports Simplified Chinese and does not support Traditional Chinese. This is a deal killer for many companies since supporting simplified Chinese is often a requirement.
Also, we provide more in-depth features for companies that are serious about providing high quality search such as; a user dictionary, a de-compounding option, part of speech tagging, base noun phrase extraction, and for Chinese we also handle pinyin readings… We also provide various knobs and dials to fine tune the text processing to meet a company’s unique search requirements. These sets of features, which we refer to as Base Linguistics, are available across all the languages that Basis Technology’s Rosette supports.
We’ve taken a look at the SOLR Japanese and it has a good set of features, it was created by a former FAST employee who is well respected in this space. Since the SOLR module for Korean isn’t available yet I cannot comment on that technology.
But the real take away is this, there are about 20-25 commercially significant markets by language in the world. We currently provide linguistic analysis on over 40 languages where SOLR only supports a couple of languages with morphological analysis. What happens when you need quality search support for Russian, German, Spanish, Dutch, Danish, etc…
One of the key values we bring is we handle all these languages at the highest linguistic quality standards. A Development or Product manager needs to ask himself, do I want my highly skilled (and expensive) engineers integrating, testing and supporting different pieces of linguistic software, developed by different groups/individuals with varying degrees of quality, performance and stability, when there is a commercial ready platform to take these tasks of their plate?
Are there any specialized or niche languages where you have an advantage?
It is true that Asian and Middle Eastern languages are more complex then European languages and need high quality linguistic processing to get high quality search results, so one might say this is our niche domain. But the fact is that European languages such as French, Italian, Spanish and German are highly inflected and require context sensitive morphological analysis. We are seeing very strong demand in Europe for our platform due to this fact. So my answer is that we have an advantage across the board.
How is your software packaged? A simple jar with some config and dictionary files?
The Rosette SDK also provides Entity Extraction which identifies entities such as people, places and organizations for over 20 languages. Entity Extraction can be used to implement what we refer to as “Discovery Search” features such as faceted navigation, trending terms, etc… And like our Base Linguistics, Entity Extraction also works out of the box with SOLR. Over the past year or so we have seen a significant increase in demand for Entity Extraction. This is an interesting topic and perhaps we can talk more about it in the near future. In the interim here is a white paper that may be of interest:
Do you use some type of horrible license manager that makes distributed installations a huge hassle?
Our licensing mechanisms are basically invisible to developers and do not inhibit distributed installations in any manner. The licenses are not restricted by CPU’s’ or Throughput. The software resides on our clients’ servers and never talks to Basis Technology. Our customers have complete control over the software and are not restricted technically in any manner. As we speak our software is installed and running on some of the largest distributed platforms on the planet.
As you know, this interview is targeted at developers. If a coder would like to “kick the tires”, how do they get started? What can they download and try? And do you have config examples for Solr?
Yes, we try to make this as easy as possible. The request form is located here: http://www.basistech.com/text-analytics/requests/evaluation-request.html
Yes, Yes I know developers don’t like forms. But the form is only requires you basic contact information and the benefit is that we compile the software to the developers platform of choice and provide email support to any questions that arise. In order to provide the software, documentation and support we simply need to know who we are sending our software to.
In regards to config examples for SOLR, yes absolutely. It is simply a matter of making a few changes to Solr’s schema file. Our documentation provides specific examples that illustrate the changes to the file schema.xml needed to enable our analyzers. We also include sources for the Java code used to connect Solr to our linguistic modules for power users that want to augment what our Solr connector already does.
Andrew, if folks have read this far, then maybe you’ve piqued their interest. So much does this all cost? How many $0’s? And if I’m a startup company without much cash, is there any point to even trying it?
Our pricing is based on how and where it is being deployed, meaning that smaller companies pay significantly less for our software then say Microsoft, Apple or Google. We work with plenty of small startups.
See here:
This is interesting stuff, thanks Andrew! For the techies out there, were can they get more technical info? And presumably somebody would eventually need to talk to a salesperson, what’s that link?
We are pretty open with our documentation and software. If readers are interested they can start here:
If you have specific questions, yes you might have to eventually talk to a salesperson, but our sales people are very technical (not pushy) and will be more than happy to put you in contact directly with one of our developers. And after all, they are really nice people (including me).
Best of the Month. Straight to Your Inbox!
Dive into the best content with our monthly Roundup Newsletter!
Each month, we handpick the top stories, insights, and updates to keep you in the know.