XMLLoader for Pig – Big Data

Big Data sources can come in various forms, it can be structured, semi-structured or unstructured. In this blog, we explore a way to process and parse XML source using Pig utility.

Pig is a procedural processing utility tool from Apache, which functions to achieve below:
a) Extract-Transform-Load (ETL)
b) Iterative Data processing

Pig uses PigLatin for scripting and runs on Hadoop using MapReduce and uses Hadoop Distributed File System (HDFS).

In order to load the XML data into the Pig structures, one needs to parse the XML so that Pig can understand it. Thus we shall use the XMLLoader() function which is present in PiggyBank of Apache. PiggyBank is a repository of Java user defined functions.

Follow the below steps in order to load the data into Pig:

1) Consider a sample XML file as below:
namesxml





2) Load XML to HDFS using:

> hadoop fs -put names.xml /piginput/

3) Register the piggybank.jar by:

> pig
grunt> register '/usr/lib/pig/piggybank.jar';

4) Define a function name for the XMLLoader to be used in your schema:
[psourcecode language=”plain”]grunt> define XMLLoader org.apache.pig.piggybank.storage.XMLLoader();[/sourcecode]

5) Load data into XML variable:

grunt> xmldata = load '/piginput/names.xml' USING org.apache.pig.piggybank.storage.XMLLoader('NAMES') as(doc:chararray); 

6) Parse it using:

grunt> data = foreach xmldata GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<NAMES>\\s*<NAME>(.*)</NAME>\\s*<JOB>(.*)</JOB>\\s*<COUNTRY>(.*)</COUNTRY>\\s*<SALARY>(.*)</SALARY>\\s*<CURRENCY>(.*)</CURRENCY>\\s*</NAMES>')) AS (name:chararray, job:chararray, country:chararray, salary:float, currency:int); 

7) The output can be seen with:

grunt> dump data;

xmlresult
All the source code is present in the Apache-Pig-XMLLoader of GitHub.

Tagged with: , , , , , , , , ,
0 comments on “XMLLoader for Pig – Big Data
1 Pings/Trackbacks for "XMLLoader for Pig – Big Data"

Leave a Reply