Tag Archives: Mappers

MapReduce Code

Input Data

2012-01-01    2:01    Omaha    Book    10.51    Visa

it’s tab delimited, values will be the date, the time, the store name, a description of the item, the cost, and the method of payment.
Mapper Code (mapper.py)

    for line in sys.stdin:
        data = line.strip().split("\t")
        if(len(data)==6):
            date, time, storename, productname, cost, paymethod = data
            print "{0}\t{1}".format(storename, cost)

Reducer Code (reducer.py)
In my case, i have a single Reducer, because that’s the Hadoop default, so it will get all the keys. If i had specified more than one Reducer, each would receive some of the keys, along with all the values from all the Mappers for those keys. Read more of this post

Advertisement

Running a Hadoop Job

In my local directory, I have mapper.py and reducer.py, that’s the code for the mapper and reducer.
Running a job
Read more of this post

MapReduce – Mappers and Reducers

How that data is processed with MapReduce.

PTopToBottomrocessing a large file serially from the top to the bottom could take a long time.

MapReduce is designed to be a very parallelized way of managing data, meaning that your input data is split into many pieces, and each piece is processed MapReducesimultaneously.

RealReal-world scenario.  A ledger which contains all the sales from thousands of stores around the USA, organized by date. Calculate the total sales generated by each store over the last year.  Just to start at the beginning of the ledger and, for each entry, write the store name and the amount next to it. For the next entry, if store name is already there, add the amount to that store. If not, add a new store name and that first purchase. And so on, and so on.
Read more of this post