Introduction to PIG

This article introduces you hadoop ecosystem "PIG".
I have explained the basic of hadoop and mapreduce functionality in a few of my articles now its time to move ahead and work with its ecosystems. 



Introduction: 
PIG is an open source framework work over Hadoop cluster, which is developed by Yahoo and now it's a part of Apache organization. 
PIG can handle all kinds of data - Structured, unstructured and semi-structured data (like pig eats anything) , so its name is PIG.Its an abstraction over MapReduce like hive.PIG have more data processing flexibility than SQL and HiveQL.
Pig is good for both : declarative query language(SQL) and low-level procedural programming language that generates MapReduce code. PIG exploits the power of Hadoop.


Pig is divided into two parts:
  1. PigLatin - language for expressing data flow. Operates like ETL (Extract , Transform and Load)
  2. Execution environment to run PigLatin. Which divides into two parts:
    1.  Local execution in single JVM
    2. Distributed execution on Hadoop Cluster 
Pig Architecture:
   Shell (Grunt) - for interactive command execution.
   PigLatin - Data flow language.
   Execution engine - Translate PigLatin Commands into MapReduce execution plans and execute them. 
Fig. Pig Architecture
Data Types in Pig:
  1. Scalar Data type : int, long, float, double
  2. Array Data type : bytearray, chararray
  3. Composite Data Type : touple, bag, atom (table -> bag, row/record -> tuple, attribute/tuple -> atom)
Installation and running of Pig:
  1. Download pig form here.
  2. Unzip the tar file. (ex. unzip file at location Desktop pig-0.11.0)
  3. Move that file to /usr/local/ (mv /Desktop/pig /usr/local/)
  4. Set the PIG_HOME path in ~/.bashrc file. (gedit ~/.bashrc)
  5. Copy the following 2 lines in bashrc file and update bashrc using command (source ~/.bashrc)
          export PIG_HOME=/usr/local/pig
          export PATH=$PATH:$PIG_HOME/bin
     Check installation of pig by:

          $pig -h

Pig Execution types:
  1. local mode : $ pig -x local  //Stores file in local file system (not on hdfs) and then process it.
  2. Hadoop cluster mode : $ pig -x mapreduce // stores file in HDFS and then proceess it.
  3. Executing pig script : Script is a file contains pig commands for exection, it run in both mode.
          $ pig -x local demoscript.pig (local mode)
          $ pig demoscript.pig  (mapreduce mode) 
Running Pig Programs:
There are 3 ways to run pig programs all are work in both local and hadoop mode:
  1. Script : Pig can run script file that contains Pig commands. ex. pig.script.pig
  2. Grunt : Interactive shell for running pig commands.
  3. Embedded : Run Pig program from Java using the PigServer class like use JDBC to run SQL programs.
PIG Data: 
Scalar : ex. int 2, chararray two, double 2.0 

tuple : (fields are grouped together into tuples)
           ex. (101,ram,25000.0) 

bag : (tuples are grouped into bags)
         ex. (101,ram,25000.0)
               (102,sham,35000.0)
               (103,ajay)
     All tuples need not contain same fields. 

Map :  ex. [1#ram,2#sham]  (key-value format) 

Relation : (outer_bag{inner_bag})
                (Sachin {bat,ball,field})
                (Rahul {bat,field})
                (Kumar {ball,field})
  
INPUT functions in PIG:
 Several built in operation for input. All functions are case sensitive.
 1) BinStorage()
     Data in machine readable format. Used internally by pig.
2) PigStorage('fields_delimiter')
    Default delimiter is tab. Default input function.
3) TextLoader()
    Load unstructured text.
4) JsonLoader()
    Load data in a slandered JSON format.



For more reference:
http://pig.apache.org/docs/r0.10.0/start.html


2 comments:

Post a Comment