Fusion Working for You: A Custom RSS Crawler JavaScript Stage

By on December 22, 2016

One of the most powerful features of Fusion is the built-in JavaScript stage.  However, you shouldn’t really think of this stage as merely JavaScript stage.   Fusion uses the Nashorn JavaScript engine, which means you have at your fingertips access to all the Java class libraries used in the application.  What this means is that you can effectively script your own customized Java and/or JavaScript processing stages, and really make the Indexing Pipeline work for you, the way you want it to work. 

So first off, we need to get our feet wet with Nashorn JavaScript.  Nashorn (German for Rhinoceros) was released around the end of 2012.   While much faster than it’s predecessor ‘Rhino’, it also incorporated one long-desired feature:  The combining of Java and JavaScript.  Though two entirely separate languages, these two have, as the saying goes, “had a date with destiny” for a long time. 


Hello World

To start with, let us take a look at the most fundamental function in a Fusion JavaScript Stage:

   logger.info("Doc ID: "+doc.getId());
return doc;

At the heart of things, this is the most basic function. Note that you will pass in a ‘doc’ argument, which will always be a PipelineDocument, and you will return that document (or, as an alternative, an array of documents, but that’s another story we’ll cover in separate article).  The ‘id’ of this document will be the URL being crawled; and, thanks to the Tika Parser, the ‘body’ attribute will be the raw XML of our RSS feed. 

To that end, the first thing you’ll want to do is open your Pipeline Editor and select the Apache Tika Parser.  Make sure the “Return parsed content as XML or HTML” and “Return original XML and HTML instead of Tika XML output” checkboxes are checked.  The Tika Parser in this case will really only be used to initially pull in the RSS XML from the feed you want to crawl.  Most of the remainder of the processing will be handled in our custom JavaScript stage. 

Now let’s add the stage to our Indexing Pipeline.  Click “Add Stage” and select “JavaScript” stage from the menu. 

Our function will operate in two phases.  The first phase will pull the raw xml from the document, and use Jsoup to parse the XML and create an java.util.ArrayList of urls to be crawled.  The second phase will use Jsoup to extract text from the various elements and set them in the PipelineDocument.  Our function will return the ArrayList of parsed documents and send the to the Solr Indexing Stage.   

So now that we’ve defined our processes, let’s show the work:


  Parsing the XML, Parsing the HTML

      function (doc) {
        var doclist = java.util.ArrayList;
        var Jsoup = org.jsoup.Jsoup;
        var jdoc = org.jsoup.nodes.Document;
        var ex = java.lang.Exception;
        var Parser = org.jsoup.parser.Parser;
        var element = org.jsoup.Element;
        var pipelineDoc = com.lucidworks.apollo.common.pipeline.PipelineDocument;
        var xmlstr = java.lang.String;
        var docurl = java.lang.String;
        var elements = org.jsoup.select.Elements;
        var ele = org.jsoup.Element;
        var outdocs = java.util.ArrayList;
        doclist = new java.util.ArrayList();
        outdocs = new java.util.ArrayList();
        var elementsToExtract = ["p","span","div","a"];
        var targetElement = "loc";
        try {

            xmlstr = doc.getFirstFieldValue("body");
            jdoc = Jsoup.parse(xmlstr, "", Parser.xmlParser());
            for each(element in jdoc.select(targetElement)) {
                docurl = element.ownText();
                if (docurl !== null && docurl !== "") {
                    logger.info("Parsed URL: " + element.ownText());
                    pipelineDoc = new com.lucidworks.apollo.common.pipeline.PipelineDocument(element.ownText());


        } catch (ex) {

        try {
            for each(pipelineDoc in doclist) {
                docurl = pipelineDoc.getId();
                jdoc = Jsoup.connect(docurl).get();
                extractedText = new java.lang.String();
                if (jdoc !== null) {
                    logger.info("FOUND a JSoup document for url  " + docurl);
                    var extractedText = new java.lang.String();
                    var metaDataText = new java.lang.String();
                    // get the title
                    ele = jdoc.select("title").first();
                    if(ele !== null && ele.ownText){
                      pipelineDoc.addField("title", ele.ownText());
                    // get the meta 
                     ele = jdoc.select("meta[keywords]").first();
                     if(ele !== null && ele.ownText){
                        pipelineDoc.addField("meta.keywords", ele.ownText());
                    ele = jdoc.select("meta[description]").first();
                     if(ele !== null && ele.ownText){
                        pipelineDoc.addField("meta.description", ele.ownText());
                    for each(var val in elementsToExtract){
                    elements = jdoc.select(val);
                    logger.info("ITERATE OVER ELEMENTS");
                    // then parse elements and pull just the text
                    for each (ele in elements) {
                        if (ele !== null) {
                            if (ele.ownText() !== null) {
                                extractedText += " "+ ele.ownText();
                    pipelineDoc.addField('body', extractedText);
                    logger.info("Extracted: " + extractedText);

                } else {
                    logger.warn("Jsoup Document was NULL **** ");
        } catch (ex) {

        return outdocs;

 So in the above function the first step is to parse the raw XML into a Jsoup Document. From there, we iterate over the elements found in the document (jdoc.select(“loc”))   Once we have an ArrayList of PipelineDocuments (using the url for the id), we pass that on to a bit of script that loops through this list and again uses the Jsoup selector syntax to extract all the text from the elements E. g. (jdoc.select(“p”)).  In this example, I create a simple JavaScript Array of 4 elements (a, p, div, span) and compile the extracted text.  I also extract the title of the document, and the meta keywords/description.

Once we’ve extracted the text, we set whatever fields are relevant to our collection.  For this example I created 4 fields: title,body,meta.keywords and meta.description.  Note that I’ve commented out saving the raw text.   You want to avoid putting raw text into your collection unless you have a specific need to do so.   It’s best to just extract all the critical data/metadata and discard the raw text. 

Now that we’ve populated our PipelineDocuments, the function returns the ‘outdocs’ ArrayList to the Pipeline process, and the documents are then persisted by the Solr Indexer Stage. 

And that’s really all there is to it.  This implementation has been tested on Fusion 2.4.2. 

Share on LinkedInShare on FacebookTweet about this on Twitter

Related Posts

Context Filtering with Solr Suggester

2016: The Year In Review

Analyzing Enron with Solr, Spark, and Fusion

11 Common Mistakes in Search

Top Posts

Using Word2Vec in Fusion for Better Search Results

Data Science for Dummies

The Data That Lies Beneath: A Dark Data Deep Dive

Visualizing Search Results in Solr: /browse and Beyond

Implementing Apache Solr at Target

Posted in Blog Posts, Lucidworks Fusion, Reference Materials, Tutorials and Documentation

Your email address will not be published. Required fields are marked *



Krijn Mossel

Thanks, Kevin! This is a very useful example of how to use the Javascript custom stages.

I have one question: in your code, you submit the document to Solr via a CloudSolrClient, but Fusion also has an out-of-the-box ‘Solr Indexer’ stage, which could be used for the same purpose. Can you speak to why you chose to do it directly from your code, and what the pros and cons of either approach are?

Thanks again!

Kevin Cowan

Hi Krijn,
Thanks very much! I’m happy that you found the article helpful. My initial intent with the article was to show that you could “do it all” from a single JavaScript stage. You are quite correct, however, in that there is the built in Solr Indexer Stage. Using this stage is definitely considered best practice, and in fact after some internal feedback I went back to the article and factored out the SolrCloudClient in favor of the Solr Indexer. As mentioned, using the built-in Solr Indexer stage is definitely best practice. The biggest ‘pro’ to using a custom, embedded SolrCloudClient would be if you want to a) do some type of massaging to your SolrInputDocument that would be outside the scope of the built-in version; or b) if you wanted to control batch commits at a more granular level. I hope this helps.