Tag Archives: sqoop

MapReduce Code

Input Data

2012-01-01    2:01    Omaha    Book    10.51    Visa

it’s tab delimited, values will be the date, the time, the store name, a description of the item, the cost, and the method of payment.
Mapper Code (mapper.py)

    for line in sys.stdin:
        data = line.strip().split("\t")
            date, time, storename, productname, cost, paymethod = data
            print "{0}\t{1}".format(storename, cost)

Reducer Code (reducer.py)
In my case, i have a single Reducer, because that’s the Hadoop default, so it will get all the keys. If i had specified more than one Reducer, each would receive some of the keys, along with all the values from all the Mappers for those keys. Read more of this post


Hadoop Cluster / Ecosystem

Hadoop Clusterbd1
Core Hadoop consists of a way to store data, known as the Hadoop Distributed File System, or HDFS, and a way to process the data, called MapReduce. Split the data up and store it across a collection of machines, known as a cluster.

Then, when we want to process the data, we process it where it’s actually stored. Rather than retrieving the data from a central server, instead it’s already on the cluster, and we can process it in place. You can add more machines to the cluster (make the cluster bigger) as the amount of data you’re storing grows. The machines in the cluster don’t need to be particularly high-end; although most clusters are built using rack-mount servers.
Read more of this post