Category Archives: Big Data

Length of a Post/Answer Map Reducer Code

Correlation between the length of a post and the length of answers.
Output the length of the post and the average answer length for each post.

Mapper result Reducet result
111 35 111
15084 237 15084
2 145 2
3778 164 3848
3778 69 3778
66193 60 66193
66193 34 66199
66193 302 66196
66193 288 66195
7185 86 7185
10000001 140 323.940451745
10000002 625 465.578947368
10000005 0 35.0
10000006 836 99.6666666667
10000007 4224 580.428571429
66193 59 154.666666667

Read more of this post


Top tag Map Reducer Code

The top 10 tags used in posts, ordered by the number of questions they appear in.
forum csv file contains
“id”    “title”    “tagnames”    “author_id”    “body”    “node_type”    “parent_id”    “abs_parent_id”    “added_at”    “score”    “state_string”    “last_edited_id”    “last_activity_by_id”    “last_activity_at”    “active_revision_id”    “extra”    “extra_ref_id”    “extra_count”    “marked”
Read more of this post

Study Group Map Reducer Code

Analysis the each forum thread would give us a list of students that have posted there – either asked the question, answered a question or added a comment.
forum csv file contains
“id”    “title”    “tagnames”    “author_id”    “body”    “node_type”    “parent_id”    “abs_parent_id”    “added_at”    “score”    “state_string”    “last_edited_id”    “last_activity_by_id”    “last_activity_at”    “active_revision_id”    “extra”    “extra_ref_id”    “extra_count”    “marked”
Read more of this post

Hadoop Combiners

Mapper/Reducer like this:
1.    Your mapper may have gone through the records and output a key; value pair that looked like: day of week; value.
2.    For each day of the week, your reducer kept a running total of the value as well as a count of the number of records.
3.    You divided the total value by the number of records to get the mean.

But there’s a problem here. That second step involves moving a lot of data around your network. What if we could do some of the reduction locally before sending the data to the reducers? Read more of this post

Hadoop Designing With Patterns

Filtering Patterns: Don’t change records. Only get a part of the data!. Examples:  Simple Filter

Bloom Filter (more efficient), Sampling, Random Sampling, Top K

Top K

In RDBMS you would normally first sort the data, then take top K records.  In mapreduce this kind of approach will not work, because the data is not sorted and is processed on several machines. Thus, the mappers will first have to find their own top K lists, without sorting the data, and then send the local lists to the reducers who then can find the global top K list.

Let’s find the top 5 longest posts in our forum! Read more of this post

MapReduce Code

Input Data

2012-01-01    2:01    Omaha    Book    10.51    Visa

it’s tab delimited, values will be the date, the time, the store name, a description of the item, the cost, and the method of payment.
Mapper Code (

    for line in sys.stdin:
        data = line.strip().split("\t")
            date, time, storename, productname, cost, paymethod = data
            print "{0}\t{1}".format(storename, cost)

Reducer Code (
In my case, i have a single Reducer, because that’s the Hadoop default, so it will get all the keys. If i had specified more than one Reducer, each would receive some of the keys, along with all the values from all the Mappers for those keys. Read more of this post

Running a Hadoop Job

In my local directory, I have and, that’s the code for the mapper and reducer.
Running a job
Read more of this post

MapReduce – Mappers and Reducers

How that data is processed with MapReduce.

PTopToBottomrocessing a large file serially from the top to the bottom could take a long time.

MapReduce is designed to be a very parallelized way of managing data, meaning that your input data is split into many pieces, and each piece is processed MapReducesimultaneously.

RealReal-world scenario.  A ledger which contains all the sales from thousands of stores around the USA, organized by date. Calculate the total sales generated by each store over the last year.  Just to start at the beginning of the ledger and, for each entry, write the store name and the amount next to it. For the next entry, if store name is already there, add the amount to that store. If not, add a new store name and that first purchase. And so on, and so on.
Read more of this post

Hadoop FS Shell Commands

Hadoop File System commands – start with Hadoop FS.
I want to put couple of local directory files into hdfs
FS Shell Guide

Read more of this post

Hadoop Cluster / Ecosystem

Hadoop Clusterbd1
Core Hadoop consists of a way to store data, known as the Hadoop Distributed File System, or HDFS, and a way to process the data, called MapReduce. Split the data up and store it across a collection of machines, known as a cluster.

Then, when we want to process the data, we process it where it’s actually stored. Rather than retrieving the data from a central server, instead it’s already on the cluster, and we can process it in place. You can add more machines to the cluster (make the cluster bigger) as the amount of data you’re storing grows. The machines in the cluster don’t need to be particularly high-end; although most clusters are built using rack-mount servers.
Read more of this post

Big Data

Every day, billion of gigabytes of high-velocity data are created in a variety of forms, such as social media posts, information gathered in sensors and medical devices, videos and transaction records.

Not a Big Data – Not everything is actually a big data problem. There are lots of cases where you can use traditional systems to store, manage, and process your data. For example order details for a purchase at a store or information about a person’s bank details

What is a Big Data? – All orders across hundreds of branches nationwide or  All bank transactions made on the Wisconsin state area during the year

Definition of Big Data – A data set of terabytes or more to be ‘big data’.  It’s data which can’t comfortably be processed on a single machine. Big Data is not just size of the data, but also data is created very fast and data from different sources is in different formats. Most data is not worthless but actually does have a lot of value. Read more of this post