MapReduce VS Spark - Wordcount Example

With MapReduce having clocked a decade since its introduction, and newer bigdata frameworks emerging, lets do a code comparo between Hadoop MapReduce and Apache Spark which is a general purpose compute engine for both batch and streaming data. 
We begin with hello world program of the big data world a.k.a wordcount on the Mark Twain's collected works dataset.

Problem Statement:
Solution:
We look at MapReduce as well as Spark code to see how the problem statement can be solved using each of the frameworks.

MapReduce:

WordCountMapper.java:
Using the TextInputFormat, with byte offset as the key and the record as the value, we create the WordCountMapper class that removes all special characters using the regex [^\\w\\s]. The endings "'s", "ly", "ed", "ing", "ness" are removed using the regex ('s|ly|ed|ing|ness) and all words are converted to lowercase. The mapper then emits the (word, 1) as the K,V pair. 

WordCountReducer.java:
The reducer logic aggregates the values for each key and outputs the K,V pair which are passed on to sortingMapper.java to sort the words in descending order of their counts.

As words have to be sorted in descending order of counts, results from the first mapreduce job should be sent to another mapreduce job which does the job.

SortingMapper.java:
The SortingMapper takes the (word, count) pair from the first mapreduce job and emits (count, word) to the reducer. As sorting happens only on keys in a mapreduce job, count is emitted as the key and word as the value.

sortingComparator.java:
As mapreduce sorts the results in ascending order by default, we need to write a custom sorting comparator to sort the keys coming from mapper in descending order before being passed on to the reducer.

Keys(counts) from every mapper are compared against one another with the logic int ret = v1.get() < v2.get() ? 1 : v1.get() == v2.get() ? 0 : -1


sortingReducer.java:
The reducer simply reverses the (K, V) pair and emits (word, count) as output.

WordCountDriver.java:
We specify two job objects. The first job i.e. wordcount performs the wordcount on the dataset and the second job i.e. sort is used to sort the output from the first job in descending order. 
The driver takes three arguments. First argument is the source directory, second is the target directory for the first mapreduce job and the third argument is the target directory for the second mapreduce job. Further the output from the first mapreduce job is used as input to the second mapreduce job. Rest of the code is self explanatory.

Run:
Submit the mapreduce job by passing the source, intermediate output directory and final output directory as arguments:
yarn jar wordcount-0.0.1-SNAPSHOT.jar com.stdatalabs.mapreduce.wordcount.WordCountDriver MarkTwain.txt mrWordCount mrWordCount_sorted

Output:


Spark:
As can be seen above, mapreduce restricts all the logic to mappers and reducers and we end up writing lot of boiler plate code rather than the actual data processing logic. This results in more lines of code for a simple use case like wordcount which brings us to Apache Spark. 
Spark provides a more flexible approach using RDDs which are efficient for iterative algorithms as the data can be cached in memory once it is read instead of multiple reads from disk. Spark also provides more transformations and actions compared to only Map and Reduce. 

Driver.scala
The entire wordcount logic can be written in one scala class.


Run:
Submit the spark application by passing the source and target directories as arguments and running the command:
spark-submit --class com.stdatalabs.SparkWordcount.Driver --master yarn-client SparkWordcount-0.0.1-SNAPSHOT.jar /user/cloudera/MarkTwain.txt /user/cloudera/sparkWordCount

Output:

The dataset and the complete project can be found in this github repo.

You may also like:
MapReduce VS Spark - Aadhaar dataset analysis

MapReduce VS Spark - Secondary Sort Example

MapReduce VS Spark - Inverted Index Example

Spark Streaming part 1: Real time twitter sentiment analysis

Labels: , ,