Hadoop practice projects


PROBLEM STATEMENT:

Apply Hadoop MapReduce to derive some statistics from White House Visitor Log.
Download data from the site given in below. There are currently 3 million+ records available at
http://www.whitehouse.gov/briefing-room/disclosures/visitor-records


Data is available as web only spreadsheet view and downloadable raw format in CSV (Comma Separated Value). In CSV format each column is separated by a comma in each line. The first line represents the heading for the corresponding columns in other lines. We are going to use this raw data for our MapReduce operation.

You are required to write efficient Hadoop MapReduce programs in Java to find the following information:
(i) The 10 most frequent visitors (NAMELAST, NAMEFIRST) to the White House.
(ii) The 10 most frequently visited people (visitee_namelast, visitee_namefirst) in the White House.
(iii) The 10 most frequent visitor-visitee combinations.
(iv) An average number of visitors by month of a year. Say what is the average number of visitors in January over last 4 years. Your program should produce an average number of visitors for 12 months. Consider APPT_START_DATE as the visit date value.

Submission:

Send your solution in following format:
create a folder with your "name" containing only 4 files
1. Driver Code
2. Mapper Code
3. Reducer Code
4. Output file
copy part-r-00000 output to text file as solution1.txt , solution2.txt...etc

** Make a tar file or tar.gz file of a folder and mail it to "kishorevbhosale@gmail.com"

** If you have github account - share link

1 comments:

Post a Comment