Overview

Lucidworks Fusion uses a data pipeline paradigm for both data ingestion (Index Pipelines) and for search (Query Pipelines).  A Pipeline consists of one or more ordered Pipeline Stages.  Each Stage takes input from the previous Stage and provides input to the following Stage. In the Index Pipeline case, the input is a document to be transformed prior to indexing in Apache Solr.

In the Query Pipelines case, the first stages manipulate a Query Request. A middle stage submits the request to Solr and the following stages can be used to manipulate the Query Response.

The out-of-the-box stages included in Lucidworks Fusion let the user perform many common tasks such as field mapping for an Index Pipeline or specialized Facet queries for the Query Pipeline.  However, as described in a previous article, many projects have specialized needs in which the flexibility of the JavaScript stage is needed.

The code snippets in this article have been simplified and shortened for convenience.  The full examples can be downloaded from my GitHub repo https://github.com/andrewshumway/FusionPipelineUtilities.

Taking JavaScript to the Next Level with Shared Scripts, Utility Functions and Unit Tests

Throwing a few scripts into a pipeline to perform some customized lookups or parsing logic is all well and good, but sophisticated ingestion strategies could benefit from some more advanced techniques.

  • Reduce maintenance problems by reusing oft-needed utilities and functions.  Some of the advanced features of the Nashorn JavaScript engine largely eliminate the need to copy/paste code into multiple Pipelines.  Keeping a single copy reduces code maintenance problems.
  • Use a modern IDE for editing.  The code editor in Fusion is functional but it provides little help with code completion, syntax highlighting, identifying typos illuminating global variables or generally speeding development.
  • Use Unit Tests to help reduce bugs and ensure the health of a deployment.

Reusing Scripts

Lucidworks Fusion uses the standard Nashorn JavaScript engine which ships with Java 8.  The load() command, combined with an Immediately Invoked Function Expression (IIFE) allows a small pipeline script to load another script.  This allows common functionality to be shared across pipelines.  Here’s an example:

var loadLibrary = function(url){
    var lib = null;
    try{
      logger.info('\n\n*********\n*Try to library load from: ' + url);
      lib = load(url);// jshint ignore:line
      logger.info('\n\n**********\n* The library loaded from: ' + url);
    }catch(e){
      logger.error('\n\n******\n* The script at ' + url + ' is missing or invalid\n’ + e.message);
    }
    return lib;
  };

Get Help From an IDE

Any sort of JavaScript function or objects could be contained in the utilLib.js as shown above.  Below is a simple example of a library containing two handy functions.
Explanatory notes:

  • The wrapping structure i.e. (function(){…}).call(this); makes up the IIFE structure used to encapsulate the  util object.  While this is not strictly necessary, it provides a syntax easily understood by the IntelliJ IDE.
  • The globals comment at the top, as well as the jshint comment at the bottom, are hints to the JSHint code validation engine used in the IDE.  These suppress error conditions resulting from the Nashorn load() functionality and global variables set by the Java environment which invokes the JavaScript Pipeline Stage.
  • The IDE will have underlined potentially illegal code in red. The result is an opportunity to fix typos without having to repeatedly test-load the script and hunt thru a log file only to find a cryptic error message from the Nashorn engine.  Also, note the use of the “use strict” directive.  This tells JSHint to also look for things like the inadvertent declaration of global variables.
/* globals  Java,arguments*/
(function(){
    "use strict";
    var util = {};
    util.isJavaType = function(obj){
        return (obj && 
		typeof obj.getClass === 'function' && 
		typeof obj.notify === 'function' && 
		typeof obj.hashCode === 'function');

    }
    /**
     * For Java objects, return the short name, 
     * e.g. 'String' for a java.lang.String
     * 
     * JavaScript objects, usually use lower case.
     * e.g. 'string' for a JavaScript String
     *
     */
    util.getTypeOf = function getTypeOf(obj){
        'use strict';
        var typ = 'unknown';
        //test for java objects
        if( util.isJavaType(obj)){
            typ = obj.getClass().getSimpleName();
        }else if (obj === null){
            typ = 'null';
        }else if (typeof(obj) === typeof(undefined)){
            typ = 'undefined';
        }else if (typeof(obj) === typeof(String())){
            typ = 'string';
        }else if (typeof(obj) === typeof([])) {
            typ = 'array';
        }
        else if (Object.prototype.toString.call(obj) === '[object Date]'){
                typ = 'date';
        }else {
            typ = obj ? typeof(obj) :typ;
        }
        return typ;
    };


    //return util to make it publicly accessible
    return util;
}).call(this); // jshint ignore: line

Overview of Utility Functions

Here is a summary description of some of the utility functions included in utilLib.js

index.concatMV(doc, fieldName, delim) Return a delimited String containing all values for a given field. If the names field contains values for ‘James’, ‘Jim’, ‘Jamie’, and ‘Jim’, calling index.concatMV(doc, ‘names’, ‘, ‘) would return “James, Jim, Jamie”

index.getFieldNames(doc, pattern) Return an array of field names in doc which match the pattern regular expression.

index.trimField(doc, fieldName) Remove all whitespace from all values of the field specified.  Leading and trailing whitespace is truncated and redundant whitespace within values is replaced with a single space.

util.concat(varargs) Here varargs can be one or more arguments of String or String[].  They will all be concatenated into a single String and returned.

util.dateToISOString(date) Convert a Java Date or JavaScript Date into an ISO 8601 formatted String.

util.dedup(arr) Remove redundant elements in an array.

util. decrypt(toDecrypt) Decrypt an AES encrypted String.

util. encrypt(toEncrypt) Encrypt a string with AES encryption.

util. getFusionConfigProperties() Read in the default Fusion config/config.properties file and return it as a Java Properties object.

util.isoStringToDate(dateString) Convert an ISO 8601 formatted String into a Java Date.

util. queryHttp2Json(url) Perform an HTTP GET on a URL and parse the response into JSON.

util.stripTags(markupString) Remove markup tags from an HTML or XML string.

util.truncateString(text, len, useWordBoundary) Truncate text to a length of len.  If useWordBoundary is true break on the word boundary just before len.

Testing the Code

Automated unit testing of Fusion stages can be complicated.  Unit testing shared utility functions intended for use in Fusion stages is even more difficult.  A full test harness is beyond the scope of this Blog, but the essentials can be accomplished with the command-line curl utility or an REST client like Postman.

  • Start with a well-known state in the form of a pre-made PipelineDocument. To see an example of the needed JSON, look at what is produced by the Logging Stage which comes with Fusion.
  •  POST the PipelineDocument Fusion using the Index Pipelines API.  You will need to pass an ID, and Collection name as parameters as well as the trailing “/index” path in order to invoke the pipeline.
  • The POST operation should return the document as modified by the pipeline.  Inspect it and signal Pass or Fail events as needed.

Unit tests can also be performed manually by running the Pipeline within Fusion.  This could be part of a Workbench simulation or an actual Ingestion/Query operation.  The utilLib.js contains a rudimentary test harness for executing tests and comparing the results to an expected String value.  The results of tests are written both to the connections.log or api.log as well as being pushed into the Stage’s context map in the _runtime_test_results element as shown below.  The first test shows that util.dedup(‘a’, ‘b’, ‘c’, ‘a’, ‘b’) but the results do not contain the duplicates. Other common tests are also performed.  For complete details see the index.runTests() function in utilLib.js.

Summary

This article demonstrates how to load shareable JavaScript into Fusion’s Pipeline Stages so that common functions can be shared across pipelines.  It also contains several handy utility functions which can be used as-is or as a building blocks in more complex data manipulations.  Additionally, ways to avoid common pitfalls such as JavaScript syntax typos and unintended global variables were shown.  Finally, a Pipeline Simulation was run and the sample unit-test results were shown.

Acknowledgements

Special thanks to Carlos Valcarcel and Robert Lucarini of Lucidwoks as well as Patrick Hoeffel and Matt Kuiper at Polaris Alpha for their help and sample scripts.