MapReduce in hadoop

This article is about MapReduce in Hadoop. Doug cutting refers paper on mapreduce published by Google and implement it .
Hadoop is divided mainly into two parts:
1. MapReduce - Processing of data
2. HDFS - Storage of data
In this article we study "MapReduce" concept of hadoop in brief. Hadoop framework is written in java, but MapReduce function can written in any language.



Simple steps in MapReduce:
1. Main job is splitting into sub-jobs.
2. Map this sub-jobs to diffrerent CPUs/Processor
3.Collect the output from the different processors or Mapperes.
4.Reducing the output to produce the final result.

Real life mapping of MapReduce:
Sum of 1000 numbers.
How we can do it very fast and effectively?
We have 10 member with us.
1. Input Splitting – Divide 1000 numbers in 10 peoples.
2. Map Method – Each member perform addition of 100 numbers.
3. Reduce Method – After addition performed by all 10 members, Collect this addition to single person and again perform addition of this 10 collected numbers and we get final output.

another e.g. Voting count after election in india.

MapReduce Principles:

MapReduce concept is comming from functional programming language. In mapreduce "divide and conquer" concept is used. Its very powerful when subtasks are executed in parallel. MapReduce is a mechanism by which data is processed in parallel on distributed environment.
MapReduce framework consist of two functions:
1. Mapper
2. Reducer
MapReduce model impose key-value input/output
- map : (k1,v1) -> list(k2,v2)
- reduce : (k2, list(v2)) -> list(k3,v3)
1. Map functions is applied to every key-value pair
2. Map function generated intermediated key-value pairs.
3. Intermediate key-values are sorted and grouped by key
4. Reduce is applied to sorted and grouped intermediate key-values
5. Reduce emits result key-values

Lets see the Word Count example in Map-Reduce format:
MapReduce computation processes many tera,peta bytes of data on large cluster.
Hadoop framework manages all aspects of job execution,parallelization, and coordination.


Source code and execution of wordcout program in hadoop click here
For more study of MapReduce refer paper:
 



1 comments:

Post a Comment