Hadoop Series Part 18 - Hadoop wordcount in Python

This article describe how to write and execute simple MapReduce program for Hadoop in Python. Hadoop is written in Java, but hadoop provide api to MapReduce so we can write map and reduce function in any language. Hadoop Streaming uses Unix standard streams as the interface, so you can use any language that supports read from standard input and write to standard output.

Python MapReduce Code:
Steps for Mapper Code: 
  • Open terminal then "gedit mapper.py" and paste following code and save. I kept mapper.py file on desktop "/home/kb/Desktop/mapper.py".
#!/usr/bin/env python

import sys

for line in sys.stdin:
    
    line = line.strip() # remove leading and trailing whitespace
    
    words = line.split() # split the line into words
    
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1) 
  • Change the permission of mapper file:
> chmod +x /home/kb/Desktop/mapper.py
  •  Check your mapper.py program running properly:
root@kb:/home/kb# echo "kb mb kb kb sb sb" | /home/kb/Desktop/mapper.py | sort -k1,1
kb 1
kb 1
kb 1
mb 1
sb 1
sb 1  

Steps for Reducer Code:
  • Open terminal then "gedit reducer.py" and paste following code and save. I kept reducer.py file on desktop "/home/kb/Desktop/reducer.py".
#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()

    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    print '%s\t%s' % (current_word, current_count) 
  • Change the permission of reducer file:

> chmod +x /home/kb/Desktop/reducer.py
  • Check your reducer.py program running properly:

root@kb:/home/kb# echo "kb mb kb mb sb kb mb" | /home/kb/Desktop/mapper.py | /home/kb/Desktop/reducer.py
kb 3
sb 1
mb 3

Steps for Executing program:
  • For executing program we require jar file "hadoop-streaming-2.7.0.jar".

root@kb:/home/kb# hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.0.jar 
-files /home/kb/Desktop/mapper.py,/home/kb/Desktop/reducer.py 
-mapper /home/kb/Desktop/mapper.py -reducer /home/kb/Desktop/reducer.py 
-input /pcode/wcinput.txt 
-output /output 
  •  Output : 
 
Refrence : 
http://www.michael-noll.com/

1 comments:

Post a Comment