One of the goals of LucidWorks Big Data (LWBD) is to facilitate interoperation of components within the Hadoop ecosystem. One pathway we use commonly in LWBD is reading and writing data in HBase through Pig. At the most basic level, this is made possible through the HBaseStorage load/store backend provided by Pig. We have found there is a bit of an impedance mismatch between the way HBaseStorage returns records and the way Pig manipulates data. Specifically, HBaseStorage returns associative arrays (maps) corresponding to columns, but Pig is very record oriented and lacks support for map manipulation.

To solve this problem, we have written a simple Pig UDF (User Defined Function) called AddToMap which will add or update a map with one or more key-value pairs. Let’s see it in action.

First, let’s define two text files:


1, collection1, Record One
2, collection1, Record Two


1, extra, Some additional data for id=1
1, foo, bar

We will read in the first text file and use the built-in TOMAP function to construct a map which we will store in a column family “cf1” in HBase.

— Load from a text file
A = load ‘text1.txt’ using PigStorage(‘,’)
     as (id:chararray, collection:chararray, title:chararray);

— Map the fields to a single map[] field
B = foreach A generate id, TOMAP(‘title’, title, ‘collection’, collection);

— Store in HBase
store B into ‘hbase: //test-table’
     using org.apache.pig.backend.hadoop.hbase.HBaseStorage(‘cf1’);

If we scan “test-table” in HBase, we see:

hbase(main):003:0> scan ‘test-table’
1         column=cf1:collection, timestamp=1361806763289, value=Record One
1         column=cf1:title, timestamp=1361806763289, value=Record One
2         column=cf1:collection, timestamp=1361806763290, value=Record Two
2         column=cf1:title, timestamp=1361806763290, value=Record Two

Now we read these records from HBase, read the second text file, join the two, and update our existing records in HBase

— Load from HBase
C = load ‘hbase://test-table’
   using org.apache.pig.backend.hadoop.hbase.HBaseStorage(‘cf1, cf2’, ‘-loadKey true‘)
   as (rowkey:bytearray, cf1:map[], cf2:map[]);

— Load another text file and join to our data
D = load ‘text2.txt’ using PigStorage(‘,’) as (id:chararray, key:chararray, value:chararray);
E = join C by rowkey, D by id;
F = foreach E generate C::rowkey, com.lucid.sda.pig.AddToMap(C::cf1, D::key, D::value);

— Store in HBase
store F into ‘hbase://test-table’
   using org.apache.pig.backend.hadoop.hbase.HBaseStorage(‘cf1’);

Here we load two column families into maps, although only “cf1” has any data (mainly just to demonstrate the syntax for multiple column families). The “loadKey” flag indicates that you want to include the row key as a field in Pig (it is the 1st field and has type bytearray).

The interesting part of this can be seen when we create relation “F”. Here we are using the AddToMap UDF to modify our “cf1” map which will get mapped back to individual columns in HBase. AddToMap makes this kind of read/modify/write action possible.

If we scan “test-table” again, we now see the new fields:

hbase(main):004:0> scan ‘test-table’
1         column=cf1:collection, timestamp=1361815334160, value=collection1
1         column=cf1:extra, timestamp=1361815334160, value=Some additional data for id=1
1         column=cf1:foo, timestamp=1361815334160, value=bar
1         column=cf1:title, timestamp=1361815334160, value=Record One
2         column=cf1:collection, timestamp=1361815309532, value=collection1
2         column=cf1:title, timestamp=1361815309532, value=Record Two

The same technique can be applied to reading and writing documents in LWBD’s document service. The major difference is that we use complex row keys in HBase for efficient lookups. To enable encoding/decoding our custom row keys, we have provided two UDFs: ToDocumentRowKey and FromDocumentRowKey. These UDFs transform the non-human-readable row keys into a tuple of (collection, id) for convenient manipulation in Pig.

HBase gives you fast random access to your data and Pig makes it very easy to process heterogeneous data sets. This bridge between the two lays the foundation of a very powerful big data processing pipeline.