Developing a Solr Plugin
Thanks to Andrew Janowczyk of Searchbox.com for this post.
For our flagship product, Searchbox.com, we strive to bring the most cutting-edge technologies to our users. As we’ve mentioned in earlier blog posts, we rely heavily on Solr and Lucene to provide the framework for these functionalities. The nice thing about the Solr framework is that it allows for easy development of plugins which can greatly extend the capabilities of the software. We’ll be creating a set of slideshares which describe how to implement 3 types of plugins so that you can get ahead of the learning curve and start extending your own custom Solr installation now.
There are mainly 4 types of custom plugins which can be created. We’ll discuss their differences here:
Search Components: search components are plugins which operate on a result set of a query. The results that they produce typically appear at the end of the search request. For example, which is explained fully in our slideshare with full source code here, suppose we have a field called “myfield” and wanted to count various words at query time, the response would look something like this:
[code language=”xml”]
    <response>
      <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">79</int>
      </lst>
      <result name="response" numFound="13262" start="0">
        <doc>
          <str name="id">f73ca075-3826-45d5-85df-64b33c760efc</str>
          <arr name="myfield">
            <str>dog body body body fish fish fish fish orange</str>
          </arr>
        </doc>
        <doc>
          <str name="id">bc72dbef-87d1-4c39-b388-ec67babe6f05</str>
          <arr name="myfield">
            <str>the fish had a small body. the dog likes to eat fish</str>
          </arr>
        </doc>
      </result>
      <lst name="demoSearchComponent">
        <lst name="f73ca075-3826-45d5-85df-64b33c760efc">
          <double name="body">3.0</double>
          <double name="fish">4.0</double>
          <double name="dog">1.0</double>
        </lst>
        <lst name="bc72dbef-87d1-4c39-b388-ec67babe6f05">
          <double name="body">1.0</double>
          <double name="fish">2.0</double>
          <double name="dog">1.0</double>
        </lst>
      </lst>
    </response>
[/code]
We see the typical Solr response enclosed in the XML tags, but now beneath it, we also see a < /lst> element which contains the results of our word counts for both of the result documents which is provided by our demoSearchComponent. So basically this process will be run for every document in a result set. This could be useful for real time language detection, snippet making, or tagging (all of which are available as searchbox plugins for free trial). But keep in mind that if your content is static, does it make sense to perform this process every time the same document appears in the results set? Not really, which is where process factories come in.
Process Factories: in the case where the content is static and the result expected by the Search Component is deterministic, it makes more sense to use a Process Factory. During index time, a process factory can be added to the data import chain so that every document will be analyzed and new fields can be stored with the resulting information. For example, the language of a document never changes, so during index time we can perform a language detection once and have it stored in a new field, so that during search queries we can return that information without any additional overhead (namely the re-detection of the language).
Request Handlers: Request handlers are used to provide a REST endpoint from the Solr instance to get some work done. An example would be: http://localhost:8982/collection1/detectLanguage?q=”text to use to detect language”. Notice that the difference here is that we can provide plain text as the query and have it process the text directly and produce an XML result similar to the following:
[code language=”xml”]
    <response>
        <lst name="responseHeader">
            <int name="status">0</int>
            <int name="QTime">0</int>
        </lst>
        <lst name="results">
            <str name="language">en</str>
        </lst>
    </response>
[/code]
These types of plugins are very useful for not only processing mainly text, but for working with the underlying repository. You will also have access to the information contained in the documents, so for example you could use a requestHandler to count how many times the word “searchbox” appears in all of the indexed documents, since the result is not on a per document bases but instead representative of the entire repository, it makes more sense to use a request handler.
Filters: during index and query time, we often create a set of filter chains for analyzes of text such as lower casing, stemming, n-gramming etc. There is no reason why you can’t add your own type of analysis! Though, from our experience, it seems that most of the common filters are already created with enough configurable parameters that they should cover almost all use cases. So, before you try to develop your own filter, it makes sense to thoroughly look through the list of all available filters and see if there is a combination of ones that will meet your needs exactly.
In the coming weeks we’ll have tutorials and available source code for developing your own Search Components, Process Factories, and Request Handlers. Of course, if you have a specific project in mind, the solr experts over at searchbox.com would love to hear about it!
You can see the original post here.