Introduction to PIG

This article introduces you hadoop ecosystem "PIG".
I have explained the basic of hadoop and mapreduce functionality in a few of my articles now its time to move ahead and work with its ecosystems. 

PIG is an open source framework work over Hadoop cluster, which is developed by Yahoo and now it's a part of Apache organization. 
PIG can handle all kinds of data - Structured, unstructured and semi-structured data (like pig eats anything) , so its name is PIG.Its an abstraction over MapReduce like hive.PIG have more data processing flexibility than SQL and HiveQL.
Pig is good for both : declarative query language(SQL) and low-level procedural programming language that generates MapReduce code. PIG exploits the power of Hadoop.

Pig is divided into two parts:
  1. PigLatin - language for expressing data flow. Operates like ETL (Extract , Transform and Load)
  2. Execution environment to run PigLatin. Which divides into two parts:
    1.  Local execution in single JVM
    2. Distributed execution on Hadoop Cluster 
Pig Architecture:
   Shell (Grunt) - for interactive command execution.
   PigLatin - Data flow language.
   Execution engine - Translate PigLatin Commands into MapReduce execution plans and execute them. 
Fig. Pig Architecture
Data Types in Pig:
  1. Scalar Data type : int, long, float, double
  2. Array Data type : bytearray, chararray
  3. Composite Data Type : touple, bag, atom (table -> bag, row/record -> tuple, attribute/tuple -> atom)
Installation and running of Pig:
  1. Download pig form here.
  2. Unzip the tar file. (ex. unzip file at location Desktop pig-0.11.0)
  3. Move that file to /usr/local/ (mv /Desktop/pig /usr/local/)
  4. Set the PIG_HOME path in ~/.bashrc file. (gedit ~/.bashrc)
  5. Copy the following 2 lines in bashrc file and update bashrc using command (source ~/.bashrc)
          export PIG_HOME=/usr/local/pig
          export PATH=$PATH:$PIG_HOME/bin
     Check installation of pig by:

          $pig -h

Pig Execution types:
  1. local mode : $ pig -x local  //Stores file in local file system (not on hdfs) and then process it.
  2. Hadoop cluster mode : $ pig -x mapreduce // stores file in HDFS and then proceess it.
  3. Executing pig script : Script is a file contains pig commands for exection, it run in both mode.
          $ pig -x local demoscript.pig (local mode)
          $ pig demoscript.pig  (mapreduce mode) 
Running Pig Programs:
There are 3 ways to run pig programs all are work in both local and hadoop mode:
  1. Script : Pig can run script file that contains Pig commands. ex. pig.script.pig
  2. Grunt : Interactive shell for running pig commands.
  3. Embedded : Run Pig program from Java using the PigServer class like use JDBC to run SQL programs.
PIG Data: 
Scalar : ex. int 2, chararray two, double 2.0 

tuple : (fields are grouped together into tuples)
           ex. (101,ram,25000.0) 

bag : (tuples are grouped into bags)
         ex. (101,ram,25000.0)
     All tuples need not contain same fields. 

Map :  ex. [1#ram,2#sham]  (key-value format) 

Relation : (outer_bag{inner_bag})
                (Sachin {bat,ball,field})
                (Rahul {bat,field})
                (Kumar {ball,field})
INPUT functions in PIG:
 Several built in operation for input. All functions are case sensitive.
 1) BinStorage()
     Data in machine readable format. Used internally by pig.
2) PigStorage('fields_delimiter')
    Default delimiter is tab. Default input function.
3) TextLoader()
    Load unstructured text.
4) JsonLoader()
    Load data in a slandered JSON format.

For more reference:


Post a Comment