What is hive in hadoop?
Hive is framework for data warehousing built on the top of hadoop. Its mainly used for data analysis. It is developed by facebook, targeting towards the people who are comfortable with sql. Now its a Open source project of Apache. Hive can handle only structured type of data. Its not a full database but its just alongside of RDBMS. To learn hive no need to learn java and hadoop api. Hive abstract complexity of hadooop. Hive provides SQL like query language called HiveQL. Hive query internally converted into MapReduce format. Using hive FaceBook analyze several TB of data every day.
Where to use HIVE?/Hive Applications?
1) Log Processing
2) Data Mining
3) Customer-facing business Intelligence
4) Predictive Modeling
5) Document Indexing and so many...
Limitations of Hive:
1) Does not offer real time processing of data.
2) High Latency problem.
3) Hive lacks full SQL support and does not provide row level insert, update or delete.
4) Does not support transaction and has limited sub query support.
Hive setup contains 3 folders:
lib - contains all jar files. Collectively make up the functionality of Hive.
bin - location of hive scripts, launches various hive services.
conf - Hive configuration files.
Interface with Hive:
1) Hive Web Interface (hwi)
2) Using application over JDBC,ODBC or Thrift API
3) Command Line Interface. (Beenline updated version)
4) MetaStore (Stores metadata of tables, partitions, also stores system catlog and so on)
5) HCatlog - based on the top of hive metastore, provides an interface for pig and mapreduce
6) Hive Driver, Compiler, Optimizer andExecutor all these work together to turn query into a MapReduce job.
MetaStore in hive:
There are 3 configurations for metadata
1) Embedded metastore : Its store metadata within same processe. Basically derby database.
2) Local metastore : databases in separate process.
3) Remote metastore : databases at remote process.
Organization of data in hive:
Primitive Data Types:
Integer: tinyint (1 Byte), smallint (2 Byte), int(4 Byte), bigint(8 Byte)
Floating: float, double, decimal
String: string, varchar
DateTime: timestamp (Y/M/D H:M:S)
Binary : binary
Complex Data Types:
Arrays, structs, maps, union
Commands for Database:
Database name = mydb for refrence.
Default database in hive is 'default'.
Every command in hive ends with semicolon(;).
Tables in Hive:
There are two types of tables in hive:
1) Managed Tables
2) External Tables
- Controls metadata and lifecycle of data.
- data stored in hive.metastore.warehouse.dir
- dropping manage tables delete all data and related metadata.
For practice of commands, before creating table first you have to create file which contain data in local file system. For example create file kb.txt in "/home/demo/kb.txt" which contains information like:
create a table according to content of file i.e id and name.
- Stored in directory outside of hive.
- Useful if sharing data with other tools.
- dropping delete only metadata not actual data.
- must add external and location keywords to create statement.