MapReduce Code

Input Data

2012-01-01    2:01    Omaha    Book    10.51    Visa

it’s tab delimited, values will be the date, the time, the store name, a description of the item, the cost, and the method of payment.
Mapper Code (mapper.py)

    for line in sys.stdin:
        data = line.strip().split("\t")
        if(len(data)==6):
            date, time, storename, productname, cost, paymethod = data
            print "{0}\t{1}".format(storename, cost)

Reducer Code (reducer.py)
In my case, i have a single Reducer, because that’s the Hadoop default, so it will get all the keys. If i had specified more than one Reducer, each would receive some of the keys, along with all the values from all the Mappers for those keys.

Hadoop Streaming allows you to write your Mappers and Reducers in pretty much any language, rather than forcing you to use Java.

salesTotal = 0
oldStore = None
for line in sys.stdin:
	data_sale = line.strip().split("\t")
	if len(data_sale) != 2: 
		continue

	thisStore, thisSale = data_sale
	if oldStore and oldStore!=thisStore:
    		print oldStore,"\t", salesTotal
    		salesTotal = 0

	oldStore = thisStore
	salesTotal += float(thisSale)
if oldStore != None:
	print oldStore,"\t", salesTotal

Once the Mapper is done, the Hadoop framework passes the intermediate data to the Reducers. What’s the process called that happens between Mappers and Reducers?

The process is called the Shuffle and Sort. It ensures that the values for any particular key are collected together, and sends the keys and their lists of values to the Reducer.

Output
Test with a small data set before you run your code on your entire, huge set of data.

Command line test

cd ..
ls code
./mapper.py
Filed1    f2    f3    f4    f5    f6
Fielda    fb    fc    fd    fe    ff
Ctrl+D
F3    f5
Fc    fe

Sample data file test

ls ../data
head -50 ../data/purchases.txt > testdata.txt
cat testdata.txt | ./mapper.py
cat testdata.txt | ./mapper.py | sort
cat testdata.txt | ./mapper.py | sort | ./reducer.py

Cluster test

hs mapper.py reducer.py myinput output

JobTracker
Open web browser, localhost:50030/jobtracker.jsp
Here you can see that there’s one running job, and when we click on it we can see the Mappers and Reducers running.

job tracker

If a Mapper or Reducer fails, you can actually drill down and view the logs from that particular piece of code.
if click on job name (running jobs), page will display the progress and other information.
If click map or reduce task,- see completed/ running task list

Putting it all together

data > hadoop -mkdir myinput
data > hadoop fs -put purchases.txt myinput
code > hs mapper.py reducer.py myinput output
code > hadoop fs -cat myoutput/part-00000 | less
code > hadoop fs -get myoutput/part-00000 localmyoutput.txt
code > hadoop fs -rm -r -f myoutput

MapReduce
http://wiki.apache.org/hadoop/HowManyMapsAndReduce
http://research.google.com/archive/mapreduce.html
http://research.google.com/archive/mapreduce-osdi04-slides/index.html
http://developer.yahoo.com/hadoop/tutorial/module4.html

Sqoop – The data would have come from database tables and we’d use Sqoop to import it into HDFS

http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-sqoop/
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_13.html
http://blog.cloudera.com/blog/2009/06/introducing-sqoop/
http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
http://sqoop.apache.org/
http://doc.mapr.com/display/MapR/Sqoop
http://gethue.com/move-data-in-out-your-hadoop-cluster-with-the-sqoop/

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: