HIVE in Hadoop

What is hive in hadoop?
Hive is framework for data warehousing built on the top of hadoop. Its mainly used for data analysis. It is developed by facebook, targeting towards the people who are comfortable with sql. Now its a Open source project of Apache. Hive can handle only structured type of data. Its not a full database but its just alongside of RDBMS. To learn hive no need to learn java and hadoop api. Hive abstract complexity of hadooop. Hive provides SQL like query language called HiveQL. Hive query internally converted into MapReduce format. Using hive FaceBook analyze several TB of data every day. 


Where to use HIVE?/Hive Applications?
1) Log Processing
2) Data Mining
3) Customer-facing business Intelligence
4) Predictive Modeling
5) Document Indexing and so many...

Limitations of Hive:
1) Does not offer real time processing of data.
2) High Latency problem.
3) Hive lacks full SQL support and does not provide row level insert, update or delete.
4) Does not support transaction and has limited sub query support.

Hive setup contains 3 folders:
lib - contains all jar files. Collectively make up the functionality of Hive.
bin - location of hive scripts, launches various hive services.
conf - Hive configuration files.

Interface with Hive:
1) Hive Web Interface (hwi)
2) Using application over JDBC,ODBC or Thrift API
3) Command Line Interface. (Beenline updated version) 
4) MetaStore (Stores metadata of tables, partitions, also stores system catlog and so on)
5) HCatlog - based on the top of hive metastore, provides an interface for pig and mapreduce
6) Hive Driver, Compiler, Optimizer andExecutor all these work together to turn query into a MapReduce job.

MetaStore in hive:
There are 3 configurations for metadata
1) Embedded metastore : Its store metadata within same processe. Basically derby database.
2) Local metastore : databases in separate process.
3) Remote metastore : databases at remote process.  
 
Organization of data in hive:
Database-Tables-Partitions-Buckets

Data Types:
Primitive Data Types:
Integer: tinyint (1 Byte), smallint (2 Byte), int(4 Byte), bigint(8 Byte)
Boolean (True/False)
Floating: float, double, decimal
String: string, varchar
DateTime: timestamp (Y/M/D H:M:S)
Binary : binary
Complex Data Types:  
Arrays, structs, maps, union

Commands for Database:
Database name = mydb for refrence.
Default database in hive is 'default'.
Every command in hive ends with semicolon(;).


1) hive> create database mydb;

2) hive> create database mydb Loaction 'path in hdfs'; 

3) hive> create database mydb comment 'This is back up database';

4) hive> create database mydb with dbproperties ('createdby'='Kishor');

5) hive> show databases; //show all databases present in hive

6) hive> describe database mydb;

7) hive> database extended mydb; //detail description of database

8) hive> use mydb; //use database

9) hive> drop database if exists mydb; //delete database

10) hive> drop database if exists mydb cascade;

11) hive> alter database mydb set dbproperties ('createdby'='kishor');

12) To check which database you are using set property true, default value is false.

    hive> set  hive.cli.print.current.db=true;

    hive(mydb)> 

// Another way to set this property is edit hive-default.xml file present in conf folder in hive

  <property>
     <name>hive.cli.print.current.db</name>
     <value>true</value>
  </property>


Tables in Hive: 
There are two types of tables in hive:
 1) Managed Tables
 2) External Tables

Managed Tables: 
- Controls metadata and lifecycle of data.
- data stored in hive.metastore.warehouse.dir
- dropping manage tables delete all data and related metadata.

For practice of commands, before creating table first you have to create file which contain data in local file system. For example create file kb.txt in "/home/demo/kb.txt" which contains information like:
        101,raj
        102,sham
        103,dipak
        104,sachin
        105,ravi
create a table according to content of file i.e id and name.  


1) hive > create table student(id int, name string)
          > row format delimited fields terminated by ','
          > stored as textfile;

  ( 'student' table is created with 2 fields id and name. Type of each columns must be specified. Row format fields indicates that each row in the data file is , delimited text.)

2) hive > show tables;

3) hive > describe student;

4) hive > describe extended student;
   //show all details of table.(owner, creation type, data types, table type(managed/external), creation time, stored data format, location where the table file is stored and so on...)  

5) hive > describe formatted student;
  // show details of table in formatted manner.

6) Alter table commands have so many options:
  (rename, drop, add column, change column name, replace column)

   hive > alter  table student rename to stud;

7) hive > drop table if exists stud;

8) hive > load data local inpath '/home/demo/kb.txt' into table student; //file is present in local file system then use 'local' keyword.

9) hive > load data inpath '/input/kb.txt' into table student; // file kb.txt present in hdfs. so dont use 'local' keyword.

10) hive > select * from student;

11) select count(*) from student; 

External Tables:
- Stored in directory outside of hive.
- Useful if sharing data with other tools.
- dropping delete only metadata not actual data.
- must add external and location keywords to create statement.    

2 comments:

Post a Comment