Big Data sources can come in various forms, it can be structured, semi-structured or unstructured. In this blog, we explore a way to process and parse XML source using Pig utility.
Pig is a procedural processing utility tool from Apache, which functions to achieve below:
a) Extract-Transform-Load (ETL)
b) Iterative Data processing
Pig uses PigLatin for scripting and runs on Hadoop using MapReduce and uses Hadoop Distributed File System (HDFS).
In order to load the XML data into the Pig structures, one needs to parse the XML so that Pig can understand it. Thus we shall use the XMLLoader() function which is present in PiggyBank of Apache. PiggyBank is a repository of Java user defined functions.
Follow the below steps in order to load the data into Pig:
2) Load XML to HDFS using:
> hadoop fs -put names.xml /piginput/
3) Register the piggybank.jar by:
> pig grunt> register '/usr/lib/pig/piggybank.jar';
4) Define a function name for the XMLLoader to be used in your schema:
[psourcecode language=”plain”]grunt> define XMLLoader org.apache.pig.piggybank.storage.XMLLoader();[/sourcecode]
5) Load data into XML variable:
grunt> xmldata = load '/piginput/names.xml' USING org.apache.pig.piggybank.storage.XMLLoader('NAMES') as(doc:chararray);
6) Parse it using:
grunt> data = foreach xmldata GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<NAMES>\\s*<NAME>(.*)</NAME>\\s*<JOB>(.*)</JOB>\\s*<COUNTRY>(.*)</COUNTRY>\\s*<SALARY>(.*)</SALARY>\\s*<CURRENCY>(.*)</CURRENCY>\\s*</NAMES>')) AS (name:chararray, job:chararray, country:chararray, salary:float, currency:int);
7) The output can be seen with:
grunt> dump data;
All the source code is present in the Apache-Pig-XMLLoader of GitHub.