Map Reduce Understanding (what-why-how)
What is Map Reduce
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data in-parallel on large clusters of hardware in a reliable, fault-tolerant manner.
The name itself explains a lot. MapReduce actually represents 2 phase of a typical data processing process – Map Task and Reduce Task.
Map Task
Map task implements a 1 to 1 mapping between input record and output intermediate record, the output of map task will be the input a Reduce Task
The output of map task can be interpreted as “Key - Value - Partition”. Then the records with same partition will be retrieved by same reducer.
Reduce Task
Reduce task takes all output which has similar key from Map task and aggregate them based on specified logic.
Why we need Map Reduce
Why? Let’s get back to the definition of MapReduce,
process vast amounts of data in-parallel on large clusters of hardware
Then no matter how performant a computer is, it’s impossible to load all content into memory and do any computation you want.
Say, we have a 1TB file contains different number and we want to sum them up. How to achieve it?
Remember, we cannot load all file content into memory, so one feasible solution is loading chunk of file into memory and repeat multiple times until we process the whole file.
This is doable, while it’s inevitably slow.
Then how about we split the file and store them on different nodes. Then each node process the file chunk separately and finally put the results together?
Here is philosophy of MapReduce – “Divide and Concur”.
How to use Map Reduce
I will briefly introduce how to run an example Map Reduce work provided officially through CLI.
To setup MapReduce job, first we need a running HDFS, you can follow this tutorial to setup a running HDFS. some following steps
- prepare a local text file
- prepare a input file:
hdfs dfs -mkdir -p /data/wc/input
- upload local text file to HDFS:
hdfs dfs -put <local_file> /data/wc/input
- go to find example jars provided by Apache:
cd $HADOOP/share/hadoop/mapreduce
- execute the wordcount jar:
hadoop jar hadoop-mapreduce-examples-2.6.5.jar wordcount /data/wc/input /data/wc/output
- check the generated files:
hdfs dfs -ls /data/wc/output
- check the content of generated file:
hdfs dfs -cat /data/wc/output/<file_name>
- get the generated file to local machine:
hdfs dfs -get /data/wc/output /<local_path>