LucidWorks Big Data & Oozie Workflow With VizOozie

By on February 7, 2013

In this post we will discuss how to create a visualized workflow graph for Oozie. Oozie is a workflow management system for Hadoop jobs. Oozie Workflow jobs are DAG (Directed Acyclical Graphs) of actions: http://oozie.apache.org

At LucidWorks we use Oozie in our LucidWorks Big Data product. The workflows which we provide with the platform are configured and run with Oozie. Developers create workflow.xml, workflow definition files for Oozie, and deploy them to Hadoop. A good explanation of how this works is provided here:http://www.infoq.com/articles/oozieexample

Some workflows get complicated pretty quickly and may include subworkflows, forks and joins and other actions which are hard to follow in xml. A visualization tool then would help streamlining workflow designs and quickly grasp the gist of what the workflow does.

VizOozie is an open source tool which helps converting your static xml workflow definitions into dot files, which can be used by graphviz dot program to create pdf or other formats: http://www.graphviz.org/

You will need a Unix like environment, python, and graphviz dot installed to run this.

Check it out from github and run:

python vizoozie/vizoozie.py example/workflow.xml example/workflow.dot

or use your own Oozie workflow xml file.

This will generate a dot file which can be easily converted to pdf with dot:

dot -Tpdf example/workflow.dot -o example/workflow.pdf

Standard workflow shapes are used for the start, end, process, join, fork and decision nodes. The action node backfill colors are configurable in the vizoozie.properties file (e.g. java action is in blue).

The code is pretty simple, it takes each node type and converts xml to dot string using xml.dom.minidom and writes it out. For example, given an XML snippet:

  <fork name="post-process">
    <path start="complex-math" />
    <path start="more-complex" />
    <path start="geek-candy-process" />
  </fork>

the code for a fork node looks like this:

    def processFork(self, doc):
        output = ''
        for node in doc.getElementsByTagName("fork"):
            name = self.getName(node)
            output += 'n' + name.replace('-', '_') + " [shape=octagon];n"
            for path in node.getElementsByTagName("path"):
                start = path.getAttribute("start")
                output += 'n' + name.replace('-', '_') + " -> " + start.replace('-', '_') + ";n"
        return output

In this method, there is just some node name normalization with name.replace(‘-‘, ‘_’) as well specific node shape insertion (shape=octagon). Then, it just looks for the fork’s start paths like these: <path start=”complex-math” />. From our example above, this method will produce an output like this:

post_process [shape=octagon];
post_process -> complex_math;
post_process -> more_complex;
post_process -> geek_candy_process;

When used with dot program, it will generate a fork node with three children nodes. I hope you find this explanation useful.

IP

Share on LinkedInShare on FacebookTweet about this on Twitter

Related Posts

Visualizing Search Results in Solr: /browse and Beyond

Quantifying Performance Gains When Batching Indexing Updates to Solr

Mining Events for Recommendations

Preliminary Data Analysis with Fusion 2: Look, Leap, Repeat

Data Analytics using Fusion and Logstash

Top Posts

Understanding Transaction Logs, Soft Commit and Commit in SolrCloud

Faceted Search with Solr

Nested Queries in Solr

Posted in Blog Posts with tags #Lucidworks #Lucidworks Big Data #Lucidworks Enterprise #Lucidworks Search

Your email address will not be published. Required fields are marked *

*