Getting Started with LucidWorks Big Data

By on January 29, 2013

LucidWorks Big Data (LWBD) is LucidWorks newest, developer focused platform, combining the power of search, via LucidWorks Search, with the big data processing capabilities of the Hadoop ecosystem and the machine learning and Natural Language Processing (NLP) capabilities of tools like Mahout, OpenNLP and UIMA, without all of the pain of having to figure out how to wire it together or make it scale. It is designed to help companies deliver deeper insight into data by bringing– in a single platform– Search, Discovery and Analytics.  At LucidWorks, we firmly believe that, when dealing with data, that one must take a multi-faceted approach to understanding the data, and we think search is a key part, as it is one of the most ubiquitious user interfaces on the planet.  It is not enough to just have Hadoop or just have Hive or just have search, when dealing with big data, you often need them all, or you at least need the ability to easily add them when you are ready.  

Moreover, when dealing with data, it isn’t just about the raw content or just about the logs produced by the system.  You need a platform that can tie together both, as that leads to both a deeper understanding of the raw content and the users who interact with the data.  For more details on LWBD’s features, please refer to the product description page and to the LWBD documentation.  If you’re interested in more details, contact us and we can show you a demo and discuss how we can help you save months of time off of your next big data implementation.  

I’ll use the rest of this post to focus in on getting started with the actual product by looking at a simple example of ingesting web content, making it searchable and then doing some aggregate analysis on that data using the platform.  Finally, I’ll finish up with some ideas of where to proceed to next. 

Getting Started

First off, note that this is a developer platform, so I am assuming you are comfortable on the command line. Second, the VM you are downloading is designed to help developers get started, not to be a production system. If you are interested in installing LWBD on your own cluster or in Amazon AWS, please contact our sales department.

To get started, here’s what you will need to do first:

  1. Download the LWBD Virtual Machine image (you’ll have to fill out a form).  It’s a 3 GB file, so get a cup of coffee.
  2. See the VM prequsites and start the VM using the install instructions.
  3. SSH into the machine or work from the command line in the VM (I find SSH easier to copy and paste from).  In this case, I did ssh ubuntu@192.168.1.84 to access the system.
  4. Consider creating a JSON formatting script for convenience.  This is not a requirement to make the examples work.
  5. Verify your system is running by executing the following on the command line in the VM (after you’ve logged in):
    curl -u administrator:foo http://localhost:8341/sda/v1/client/collections

    You should see something like (abbreviated for space):

    [
        {
            "status": "EXISTS",
            "createTime": 1359071751934,
            "collection": "collection1",
            "id": "collection1",
            "throwable": null,
            "children": [
                {
                    "status": "EXISTS",
                    "createTime": 1359071751934,
                    "collection": "collection1",
                    "children": [],
                    "id": "collection1",
                    "throwable": null,
                    "properties": {
                        "service-impl": "LucidWorksDataManagementService"
                    }
                },
                {
                    "status": "EXISTS",
                    "createTime": 1359071751937,
                    "collection": "collection1",
                    "children": [],
                    "id": "collection1",
                    "throwable": null,
                    "properties": {
                        "path": "hdfs://localhost:50001/data/collections/collection1",
                        "service-impl": "HadoopDataManagementService"
                    }
                }
            ]
        }
    ]
  6. If the last step does not work, please refer to support.lucidworks.com for help or to ask a question.

Content Acquisition and Search

Now that I have the prerequisites out of the way, let’s get started bringing some content into the system.  I’ll use curl for our examples here, but you can use any REST client you feel comfortable with, as LWBD speaks JSON over REST.
  1. Let’s create a collectionwhere I can organize all the data:
    curl -s -u administrator:foo -X POST -H 'Content-type: application/json' -d '{"collection":"searchhub"}' http://localhost:8341/sda/v1/client/collections

    The results should look like:

    {
        "status": "CREATED",
        "createTime": 1359200703437,
        "collection": "searchhub",
        "children": [
            {
                "status": "CREATED",
                "properties": {
                    "name": "searchhub",
                    "instance_dir": "searchhub_3"
                },
                "collection": "searchhub",
                "id": "searchhub",
                "throwable": null,
                "children": []
            },
            {
                "status": "CREATED",
                "properties": {
                    "path": "hdfs://localhost:50001/data/collections/searchhub",
                    "service-impl": "HadoopDataManagementService"
                },
                "collection": "searchhub",
                "id": "searchhub",
                "throwable": null,
                "children": []
            },
            {
                "status": "CREATED",
                "createTime": 1359200717458,
                "collection": "searchhub",
                "children": [],
                "id": "searchhub",
                "throwable": null,
                "properties": {
                    "service-impl": "HBaseDataManagementService"
                }
            }
        ],
        "id": "searchhub",
        "throwable": null,
        "properties": {}
    }
  2. Next, let’s create a Web Data source to crawl a web site:
    curl -X POST -H 'Content-type:application/json' -u administrator:foo -d '{"crawler":"lucid.aperture","type":"web","url":"http://old.searchhub.org","crawl_depth":-1,"name":"SearchHub", "bounds":"tree", "output_type":"com.lucid.crawl.impl.HBaseUpdateController", "output_args":"localhost:2181", "mapping":{"original_content":"true"}}' http://localhost:8341/sda/v1/client/collections/searchhub/datasources

    The results should look like:

    {
        "status": "CREATED",
        "createTime": 1359200742420,
        "collection": "searchhub",
        "children": [],
        "id": "14fd8c7ad12346b1a058d0f5e342d98b",
        "throwable": null,
        "properties": {
            "proxy_password": "",
            "parsing": true,
            "ignore_robots": false,
            "commit_on_finish": true,
            "max_bytes": 10485760,
            "id": "14fd8c7ad12346b1a058d0f5e342d98b",
            "add_failed_docs": false,
            "proxy_host": "",
            "verify_access": true,
            "log_extra_detail": false,
                    "pagecount": "pageCount",
                    "title": "title",
                    "fullname": "author",
                    "filelastmodified": "lastModified",
                    "content-type": "mimeType"
                },
                "verify_schema": true,
                "dynamic_field": "attr",
                "unique_key": "id",
                "lucidworks_fields": true,
                "multi_val": {
                    "body": false,
                    "mimeType": false,
                    "description": false,
                    "title": false,
                    "author": true,
                    "acl": true,
                    "fileSize": false,
                    "dateCreated": false
                },
                "datasource_field": "data_source",
                "default_field": null,
                "types": {
                    "date": "DATE",
                    "lastmodified": "DATE",
                    "filesize": "LONG",
                    "datecreated": "DATE"
                }
            },
            "output_args": "localhost:2181",
            "crawl_depth": -1,
            "commit_within": 900000,
            "include_paths": [],
            "collection": "searchhub",
            "fail_unsupported_file_types": false,
            "proxy_port": -1,
            "name": "SearchHub",
            "exclude_paths": [],
            "url": "http://old.searchhub.org/",
            "max_docs": -1,
            "bounds": "tree",
            "proxy_username": "",
            "caching": false,
            "output_type": "com.lucid.crawl.impl.HBaseUpdateController",
            "auth": [],
            "crawler": "lucid.aperture"
        }
    }

    Make a note of the “id” value, as I will use it later.

  3. Next, Kick off the ingestion of the data: 
    curl -u administrator:foo -X POST  http://localhost:8341/sda/v1/client/collections/searchhub/datasources/14fd8c7ad12346b1a058d0f5e342d98b

    The last bit of that URL is the ID from the previous step.  Your ID will be different. The result should be something like:

    {
        "id": "14fd8c7ad12346b1a058d0f5e342d98b",
        "createTime": 1359200823711,
        "status": "RUNNING",
        "collection": "searchhub",
        "children": [],
        "throwable": null
    }
  4. Let that run for a bit so there is data into the system or log into the LucidWorks Search admin (http://HOST:8989/ — username: admin, password: admin) and watch the documents flow into the system. I let mine run for a bit and here’s what the LWS admin panel looks like:LucidWorks Search admin panel
  5. Next, we need to run a workflow to extract text from the raw HTML in order to make it indexable:
    curl -X POST -H 'Content-type:application/json' -u administrator:foo -d '{"parentWfId":"searchhub","workingDir":"/data/collections/searchhub-subwf/tmp/","oozie.wf.application.path":"hdfs://localhost:50001/oozie/apps/_etl/sub_wf/extract","collection":"searchhub","zkConnect":"localhost:2181","tikaProcessorClass":"com.digitalpebble.behemoth.tika.TikaProcessor"}' http://localhost:8341/sda/v1/client/workflows/extract

    This will kick off a Hadoop job that processes all of the raw content through Tika. Since this can be a long running job when you have a lot of content, we simply return you a JobID that you can use to check the status of the results, something like:

    {
        "id": "0000006-130123050323975-oozie-hado-W",
        "workflowId": "extract",
        "createTime": 1359201918000,
        "status": "RUNNING",
        "children": [],
        "throwable": null
    }
  6. We should now have searchable content. You can searchvia curl or via the Admin option. Since I’ve been using the APIs, I’ll continue so here with the command:
    curl -u administrator:foo -X POST -H 'Content-type: application/json' -d '{"query":{"q":"*:*","rows":1, "fl":"id,title,score"}}' http://localhost:8341/sda/v1/client/collections/searchhub/documents/retrieval

    The results look like:

    {
        "QUERY": {
            "json": {
                "responseHeader": {
                    "status": 0,
                    "QTime": 7,
                    "params": {
                        "rows": "1",
                        "version": "2.2",
                        "collection": "searchhub",
                        "q": "*:*",
                        "wt": "json",
                        "fl": [
                            "id,title,score",
                            "id"
                        ]
                    }
                },
                "response": {
                    "start": 0,
                    "maxScore": 1,
                    "numFound": 2287,
                    "docs": [
                        {
                            "score": [
                                1
                            ],
                            "id": "http://old.searchhub.org/2013/01/24/apache-solr-4-1-is-here/",
                            "title": [
                                "Apache Lucene/Solr 4.1 is here!"
                            ]
                        }
                    ]
                },
                "requestToken": "SDA_USER~779187014188fc3a"
            }
        }
    }

    See the documentation for more details on how to write queries and process the results.

Digging Deeper

So far, we’ve covered the basics, let’s try running a workflow to extract Statistically Interesting Phrases (SIPs) automatically from the content.  What’s a SIP?  It is a phrase containing words that co-occur together more often than one would expect given a random distribution of words.  SIPs are often useful for exploring new data sets as they let you discover potentially important word combinations that you may not think of on your own.  Keep in mind, when dealing with SIPs, you will likely spend some time tuning your SIP process to improve the quality of the results.  This usually involves stopword analysis, data cleansing and more.  For the sake of the example here, I’m only doing a few basic things to clean up the data.  
To familiarize yourself with the workflows available in LWBD, run the following command, which will tell you all of the current workflows and the parameters they accept:

curl -u administrator:foo http://localhost:8341/sda/v1/client/workflows
Due to the length of the output, I will not include the results here, so please refer to the documentation.  For this example, I will be using a few of the ETL subworkflows.  The steps to run are:
  1. The ETL Vectorize Subworkflow:
    curl -X POST -H 'Content-type:application/json' -u administrator:foo -d '{"workingDir": "/data/collections/searchhub-subwf/tmp/","documentsAsText": "hdfs://localhost:50001/data/collections/searchhub-subwf/tmp/document-text","documentsAsVectors": "hdfs://localhost:50001/data/collections/searchhub-subwf/tmp/document-vectors","vec_nGrams": "2","vec_analyzer": "com.lucid.sda.hadoop.analysis.StandardStopwordAnalyzer","collection": "searchhub","zkConnect": "localhost:2181", "parentWfId":"searchhub","oozie.wf.application.path":"hdfs://localhost:50001/oozie/apps/_etl/sub_wf/vectorize"}' http://localhost:8341/sda/v1/client/workflows/vectorize

    The output should look something like:

    {"id":"0000000-130123050323975-oozie-hado-W","workflowId":"vectorize","createTime":1359153226000,"status":"RUNNING","children":[],"throwable":null}

    Note, this step will take a bit to complete. If you wish to monitor it, point your browser at the Oozie UI: http://HOST:11000/

  2. Once it is complete, kick off the SIPs workflow:
    curl -X POST -H 'Content-type:application/json' -u administrator:foo -d '{"parentWfId":"","workingDir":"/data/collections/searchhub-subwf/tmp/","input":"hdfs://localhost:50001/data/collections/searchhub-subwf/tmp/document-vectors/tokenized-documents","oozie.wf.application.path":"hdfs://localhost:50001/oozie/apps/_etl/sub_wf/sips","sips_minLLR":"1","sips_maxNGramSize":"2"}' http://localhost:8341/sda/v1/client/workflows/sips

    You should get a similar output to the last workflow. You can also watch along in the Oozie UI for this workflow.

  3. Last, we need to index the SIPs to LWS by calling the index_sips workflow:
    curl -X POST -H 'Content-type:application/json' -u administrator:foo -d '{"parentWfId":"","workingDir":"/data/collections/searchhub-subwf/tmp/","input":"hdfs://localhost:50001/data/collections/searchhub-subwf/tmp/sips-output/ngrams","oozie.wf.application.path":"hdfs://localhost:50001/oozie/apps/_etl/sub_wf/index_sips","solr_zkHost":"localhost:2181/solr","collection":"searchhub","solrFailedThresholdPercent":"25"}' http://localhost:8341/sda/v1/client/workflows/index_sips

When you are all done, you should be able to ask for SIPs via the API or view them in the LWS UI (see the searchhub_collocations collection.) For instance, the following API call returns SIPs about performance sorted by score:

 curl -u administrator:foo -X POST -H 'Content-type: application/json' -d '{"STATISTICALLY_INTERESTING_PHRASES":{"q":"sip:performance","rows":5, "sort":"sip_score desc"}}' http://localhost:8341/sda/v1/client/collections/searchhub/documents/retrieval

The results should look like:

{
    "STATISTICALLY_INTERESTING_PHRASES": {
        "json": {
            "responseHeader": {
                "status": 0,
                "QTime": 5,
                "params": {
                    "sort": "sip_score desc",
                    "rows": "5",
                    "version": "2.2",
                    "collection": "searchhub_collocations",
                    "q": "sip:performance",
                    "wt": "json"
                }
            },
            "response": {
                "start": 0,
                "numFound": 357,
                "docs": [
                    {
                        "timestamp": "2013-01-26T14:55:06.505Z",
                        "sip": "performance increases",
                        "id": "performance increases",
                        "sip_score": 2342.1557061168896,
                        "_version_": 1425237193790586880
                    },
                    {
                        "timestamp": "2013-01-26T14:55:20.845Z",
                        "sip": "query performance",
                        "id": "query performance",
                        "sip_score": 1352.6313966933812,
                        "_version_": 1425237208827166720
                    },
                    {
                        "timestamp": "2013-01-26T14:55:06.499Z",
                        "sip": "performance improvements",
                        "id": "performance improvements",
                        "sip_score": 1101.159630701979,
                        "_version_": 1425237193784295424
                    },
                    {
                        "timestamp": "2013-01-26T14:55:06.266Z",
                        "sip": "performance googles",
                        "id": "performance googles",
                        "sip_score": 1082.749393145892,
                        "_version_": 1425237193539977216
                    },
                    {
                        "timestamp": "2013-01-26T14:53:39.277Z",
                        "sip": "high performance",
                        "id": "high performance",
                        "sip_score": 893.2830382194115,
                        "_version_": 1425237102325399552
                    }
                ]
            }
        }
    }
}

Where to Next?

I’ve covered a lot of ground so far, ranging from setup and basic content acquisition through to running workflows and exploring statistically interesting phrases.  From here, I’d encourage you to explore:

  1. The variety of input paremeters available in the workflows 
  2. Running more queries (perhaps adding facets, etc.)
  3. Exploring the Log Analysis capabilities (or write your own!)
  4. Creating and running your own Pig scripts or Hadoop Map-Reduce jobs
  5. Running clustering or other machine learning algorithms
  6. Creating your own Mahout classifier and serving up classification results

Whatever you choose, keep in mind we are here to help via our forums or via the LucidWorks team!

 

Share on LinkedInShare on FacebookTweet about this on Twitter

Your email address will not be published. Required fields are marked *

*