Hadoop Series Part 13 - WordCount in Hadoop

WordCount program :
This program counts the occurrences of the number of words in a file. This is a "Hello Word" program in hadoop. 
 
 There are three classes for writing any hadoop mapreduce program.
  1. Driver class i.e. main() method class
  2. Mapper class // map input to different systems
  3. Reducer class // collect output from different system
There are old API and new API for hadoop with small changes. To know difference click here.
This program is written in new API.
Input file: Sample text file containing any readable data.


Driver Code :


import java.io.IOException;

import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class Driver {

 public static void main(String[] args) throws IOException,
   ClassNotFoundException, InterruptedException {
  if (args.length != 2) {
   System.out.println("IO error");
   System.exit(-1);
  }
  @SuppressWarnings("deprecation")
  Job job = new Job();
  job.setJarByClass(Driver.class);
  job.setJobName("Driver");
  job.setMapperClass(WMapper.class);
  job.setReducerClass(WReducer.class);

  FileInputFormat.setInputPaths(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1]));

  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(Text.class);

  System.exit(job.waitForCompletion(true) ? 0 : 1);
 }

}

Mapper Code :


import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;


public class WMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
 private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();
 
 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
  String str = value.toString();
  StringTokenizer token = new StringTokenizer(str);
  while (token.hasMoreElements()) {
   word.set(token.nextToken());
   context.write(word, one);   
  }
 }
 
}

Reducer Code :


import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;


public class WReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
 int sum=0;
 public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{
  for (IntWritable val : values) {
   sum+= val.get();
  }
  context.write(key, new IntWritable(sum));
 }
}

How to run:
Steps for execution of word count program in hadoop.
a) Start hadoop and verify using jps command and check all services are running.
b) Move wcip.txt(any i/p file) file from local file system to hdfs
c) Compile all java files.
d) Create jar file.
e) Run the jar file using hadoop

For execution steps in detail click here.

For more codes on mapreduce click here  

 
       
 

3 comments:

Post a Comment