Running a Hadoop Job

In my local directory, I have mapper.py and reducer.py, that’s the code for the mapper and reducer.
Running a job

Submit a job command.
hadoop jar, <a path to a jar> -mapper <mapper> -reducer <reducer> -file <mapper code file> -file <reducer code file> -input <input directory in HDFS> -output  <output directory to which the reducers will write their output data>

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input myinput -output joboutput

Once the job has finished, the last line tells the output directory is called joboutput.
job output

Look at the job output directory, you’ll see that it contains three things.
_SUCCESS –  which just tells me that the job has successfully completed.
_logs directory – which contains some log information about what happened during the job’s run.
part-00000 file –  it is the output from the one reducer that we had for this job.
3 dir

hadoop fs -cat joboutput/part-00000 | less

Display the output from our reducer. It’s the sum total sales broken down by store exactly as we want it.
result

Retrieve data from HFDS and put it onto your local disk,

hadoop fs –get  joboutput/part-0000 mylocalfile.txt

upload

An alias Hadoop job command.

hs mapper.py reducer.py myinput joboutput

hs code

# .bashrc

# User specific aliases and functions

alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'

run_mapreduce() {
    hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper $1 -reducer $2 -file $1 -file $2 -input $3 -output $4
}

run_mapreduce_with_combiner() {
    hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper $1 -reducer $2 -combiner $2 -file $1 -file $2 -input $3 -output $4
}

run_mapreduce_with_combiner2() {
    hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper $1 -combiner $2 -reducer $3 -file $1 -file $2 -file $3 -input $4 -output $5
}

alias hs=run_mapreduce
alias hsc=run_mapreduce_with_combiner
alias hsc2=run_mapreduce_with_combiner2


# Source global definitions
if [ -f /etc/bashrc ]; then
    . /etc/bashrc
fi

Important  Note –  When you’re running a Hadoop job, the output directory must not already exist. If exists,  Hadoop refuses to run the job. This is actually a feature of Hadoop. It’s designed to stop you inadvertently deleting or overwriting data that’s already in the cluster. So specify a different directory for output.

error
Delete a folder

hadoop fs -rm -r -f joboutput

Processing Logs
For example, log processing is really quite similar. Imagine you have a set of log files from a Web server which look like this, and you want to know how many times each page has been hit.  Well, it’s really similar to the sales per store. Your Mapper will read a line of the log file at a time, and will extract the name of the page — like index.html, for example.
log

Its intermediate data will have the name of the page as the key, and a 1 as the value. When all the Mappers are done, the Reducers will get the keys, and a list of all the values for each particular key. Finally we will get the total number of hits to that page on the Web site.

Number of Reducers
How many Reducers you want for your job?
The default is to have a single Reducer, but for large amounts of data it often makes sense to have many more than one. Otherwise, that one Reducer will end up having to process a huge amount of data from the Mappers. The Hadoop framework decides which keys get sent to each Reducer, and there’s no guarantee that each Reducer will get the same number of keys. The keys which go to a particular Reducer are sorted, but each Reducer writes its own file into HDFS. So if, for example, we had four keys: a, b, c, and d, and two Reducers, then one Reducer might get keys a and d, the other might get b and c. So the results would be sorted within each Reducer’s output, but just joining the files together wouldn’t produce completely sorted output.

which of the following types of problem do you think are good candidates to solve with MapReduce?
Detecting anomalous behavior from a log file
Calculating returns from a large number of stock portfolios
Very large matrix inversion

Matrix multiplication are good candidates to solve with MapReduce. The reason matrix inversion is not, is that matrix manipulation tends to require holding the entire contents of both matrices in memory at once, rather than processing individual portions. You can do it with MapReduce, but it turns out to be quite difficult.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: