How that data is processed with MapReduce.
Processing a large file serially from the top to the bottom could take a long time.
MapReduce is designed to be a very parallelized way of managing data, meaning that your input data is split into many pieces, and each piece is processed simultaneously.
Real-world scenario. A ledger which contains all the sales from thousands of stores around the USA, organized by date. Calculate the total sales generated by each store over the last year. Just to start at the beginning of the ledger and, for each entry, write the store name and the amount next to it. For the next entry, if store name is already there, add the amount to that store. If not, add a new store name and that first purchase. And so on, and so on.
Hashtables (Key -> Value)
In a traditional computing environment like associative array or hashtable to solve this problem by key is the store name and their amount is the value. For this approach, if you run this on 1 TB of data (millions of sales to process). So it’s going to take an awfully long time to first read the file from a disk and then to process. Also, it may even run out of memory to hold your array. So instead, how you would do this as a MapReduce job.
Mappers and Reducers
Break the ledger down into chunks, and give each chunk to one of the Mappers. All of the Mappers can work at the same time, and each one is working on just a small fraction of the overall data.
What a Mapper will do, take the first record in their chunk, and write the store name and sale amount for that record (index card). Then take the next record and do the same thing. By the end, each Mapper will have a pile of cards per store.
Once the Mappers have finished, The Reducers go to all the Mappers and retrieve the piles of cards for their own stores and create a large pile per store. We tell each Reducer which stores they’re responsible for. It’s fast, because each Mapper has separated the cards into a pile per store already. Then they start going through the piles, one at a time and add up all the amounts on all the cards in a pile, and that gives them the total sales for that store. And to keep things organized, each Reducer goes through in alphabetical order. That’s how MapReduce works.
The Mappers are programs which each deal with a relatively small amount of data, and they all work in parallel. The Mappers output what we call ‘intermediate records’ (index cards). Hadoop deals with all data in the form key-value pairs records. Once the Mappers have finished, a phase of MapReduce called the ‘Shuffle and Sort’ takes place. The shuffle is the movement of the intermediate data from the Mappers to the Reducers and the sort is the fact that the Reducers will organize the sets of records (order). Finally, the Reduce phase works on one set of records at a time; it gets the key, and then a list of all the values, it processes those values in some way (adding) and then it writes out its final data for that key.
Hadoop takes care of the Shuffle and Sort phase. You do not have to sort the keys in your reducer code, you get them in already sorted order.
Intermediate Records Example
The key-value in the intermediate results don’t have the sum in them. Instead, each key (the cities in our case) have list of their corresponding values from each of the mappers.
NYC : 123$, 45$, 77$ …
ROC: 12$,13$, 100$ ..
And this is the kind of intermediate result from the mappers to the reducers.
The reducer, on receiving the Intermediate records (each of the reducers will receive the records for which they are responsible for) from the mappers will group all the identical keys’ values into one and then process the keys (finding the sum of the key’s values in our case).
How could we get the final results in sorted order? You could either have a single reducer (doesn’t scale very well), or by an extra step that merge the result files after the job.
Which keys will go to the first of the two Reducers (NYC, WIS, NEB, IOWA)? There is no guarantee that each reducer will get same number of keys, it might be that one of them will get none.
Overview of partitioning in Hadoop
Daemons of MapReduce
When you run a MapReducejob, you submit the code to a machine called the JobTracker. That splits the work into Mappers and Reducers, and those Mappers and Reducers will run on the other cluster nodes.
Running the actual Map and Reduce tasks is handled by a daemon called the TaskTracker, which run on the same machines as the DataNodes, the Hadoop framework will be able to have Map tasks work on pieces of data that are stored on the same machine, which will save a lot of network traffic.
As we saw, each Mapper processes a portion of the input data known as an ‘InputSplit’. It will try to make sure that a Mapper works on data on the same machine. If this green block, for example, needs processing, then the TaskTracker on this machine will likely be the one chosen to process that block. That won’t always be possible, because the TaskTrackers on these three machines that have the green block could already be busy. In which case, a different node will be chosen to process the green block, and it will be streamed over the network. This actually happens rather rarely.
The Mappers read the input data, and produce intermediate data which the Hadoop framework then passes to the Reducers — that’s the shuffle and sort. The Reducers process that data, and write their final output back to HDFS.