What is Hadoop

Introduction:

The rapid development of IT and social networking, data volume is growing exponentially. Such huge amount of data in terms of GB,TB,PB and ZB is nothing but "BigData" A number of an organization facing the problem of explosion of data. As data is growing day by day at an exponential rate, it is not possible to store and process data on existing relational databases technologies. The main drawback of a relational database is its not handling unstructured data. Out of 100% of data 70-80% data is unstructured and remaining 20-30% data is semi-structured or structured type.
Hadoop:
Hadoop is a solution for handling such BigData. Hadoop is the open source framework for storing and processing huge data with a cluster of commodity hardware. Hadoop is reliable, scalable, distributed computing framework.
History of Hadoop:
Hadoop was created by Doug Cutting in 2005 with the help of white paper written by Google on MapReduce(2004) and GFS(2003). After it becomes an apache open-source project. Hadoop is written in java. Hadoop logo charming yellow elephant is basically named after Doug’s son’s toy elephant. All components of Hadoop are available via Apache open source license (http://hadoop.apache.org). Yahoo has developed to 80% of the core Hadoop.
Hadoop Projects:
1. Avaro - Data Serialization System.
2. MapReduce - Distributed data processing model (Yahoo)
3. HDFS - Distributed File System (Yahoo)
4. Pig - Data flow language and parallel execution framework. (Yahoo)
5. Hive - Data warehouse infrastructure. (Facebook)
6. HBase - Column-oriented database
7. ZooKeeper - Distributed Coordination service
8. Sqoop - Data transfer tool between RDBMS and HDFS
9. Oozie - Service for running and scheduling workflows of Hadoop jobs.
HDFS:
The Hadoop Distributed File System (HDFS) is a specially designed file system to store very large data sets reliably, with a cluster of commodity hardware and streaming access pattern. HDFS stores meta-data and application data separately. The key components of HDFS are NameNode, DataNodes and Secondary NameNode.
Click here for more details on hdfs.
MapReduce:
MapReduce is a programming model for data processing. Hadoop can run MapReduce programs written in various languages. MapReduce programs are inherently parallel. A user-specify map function that processes a key/value pair to generate a set of intermediate key/value pairs, a reduce function that merges all intermediate values associated with the same key.
Map(k,v) -> list(k1,v1)
Reduce(k1,list(v1)) -> list(v2)
Click here for more details on mapreduce.
Apache Pig:
Pig is scripting language for exploring large dataset. Writing a program with map reduce is complex. In fact, Pig was initiated with the idea of creating and executing commands on Big Data sets. The basic attribute of Pig programs is ‘parallelization’ which helps them to manage large data sets. Apache Pig consists of a compiler that generates a series of MapReduce program and a ‘Pig Latin’ language layer that facilitates SQL-like queries to be run on distributed databases in Hadoop.
Click here for more details on Apache Pig.
Apache Hive:
Hive is a data warehouse system built on top of Hadoop that enables quick data summarization for Hadoop, handle queries and evaluate huge data sets which are located in HDFS and also maintains full support for map/reduce. Its mainly used for data analysis. Another striking feature of Apache Hive is to provide indexes such as bitmap indexes in order to speed up queries. Apache Hive was originally developed by Facebook.
Click here for more details on Apache Hive.
Apache HBase:
HBase is a subproject of Hadoop which belongs to Apache. HBase is opensource distributed column-oriented database built on top of HDFS. On one hand it manages batch style computations using MapReduce and on the other hand, it handles point que­ries (random reads). The key components of HBase are HBase Master and the RegionServer. HBase is used when you need random, realtime read/write access to your Big Data. It's similar to Google’s big table.
Click here for more details on HBase.
Apache Zookeeper:
Apache ZooKeeper is another Hadoop ecosystem. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. In fact, HBase is dependent upon ZooKeeper for its functioning.
Click here for more details on Zookeeper.
Advantages of Hadoop:
1. Highly scalable.
2. Cost effective.
3. Distributed processing.
4. Fault tolerant system.
5. Fast processing of data.
6. Handle all type of data. (Structured,Unstructured)
Limitations of Hadoop:
1. Not fit for small size data.
2. Not Secure.
3. Stability issues.
4. Limited SQL support.
5. Not suitable for streaming data.
6. Batch Processing.
7. Single Point of failure (Namenode).
Companies using Hadoop Facebook, Twitter, LinkedIn, Yahoo, AOL, eBay, Adobe and so many.

That's all for this article, very short and quick introduction of hadoop. In next post we will explain how to install hadoop on linux system. Please share your comments.

3 comments:

Post a Comment