MapReduce
Learn MapReduce with Playing Cards
What
1. 前戏:functional constructs
In functional programming, functions are first-class objects that can be passed into a function as arguments; a function can also be returned from a function as output. JavaScript, Python etc. are languages that provide functional programming facilities.
Example
In Python, the built-in functions map(), filter() and reduce(), all accept a function as input.
MapReduce
MapReduce is a programming paradigm invented at Google, one which has become wildly popular since it is designed to be applied to Big Data in NoSQL DBs, in data and disk parallel fashion - resulting in dramatic processing gains.
Why
1. 传统数据库的缺陷与产生的需求
Modern databases store data as key/value pairs, resulting in explosive growth when it comes to number of rows and file sizes.
Traditional, sequential, single machine oriented access does NOT work at all - what is needed is a massively parallel way to process the data.
Note that we are talking about SIMD form of parallelism.
- 大数据时代单机器的处理不管用了,需要多台机器平行处理
How
1. MapReduce works like this:

1) [big] data is split into segments, held in a compute cluster
2) a mapper task is run in parallel on all the segments (ie. in each cluster, therefore on each segment); each mapper produces output in the form of (key,value) pairs
3) related key/value pairs from all mappers are forwarded to a shuffler (there are multiple shufflers); each shuffler consolidates its values into a list
4) shufflers forward their keys and lists, to reducer tasks; each reducer processes its list, emits a simgle value (for its key)
- The cluster user (programmer) only needs to supply a mapper task and a reducer task, the rest is automatically handled!
2. GFS(Google File System)
Why
Since MapReduce involves accessing (reading, writing) distributed (in clusters) data in parallel, there needs to be a high-performance, distributed file system that goes along with it
- MapReduce的从原理上就得需要一个分布式文件系统
How
GFS abstracts details of network file access so that remote reads/writes and local reads/writes are handled (in code) identically.
3. Hadoop & HDFS
what
Hadoop is modeled after the MapReduce paradigm, and is utilized identically (by having users run mappers and reducers on (big) data).
Hadoop Distributed File System(HDFS)is modeled after Google's GFS, but with some important differences.
4. Hadoop 'ecosystem' (Hadoop stack)

- Hadoop Layer: MapReduce & HDFS
- Tools Layer: Hive, Pig ...
4.1 Hadoop: Hive
Hive provides a SQL-like scripting language called HQL.
Hive translates most queries to MapReduce jobs, thereby exploiting the scalability of Hadoop, while presenting a familiar SQL abstraction.
可以方便进行MD的类SQL语言
4.2 Hadoop: Pig
- Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows
提供MR所需的平行处理的引擎
Pig Latin scripts are compiled into MR jobs that are then run on the cluster.
Pig Latin is a dataflow language. This means it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel.
To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges are data flows and the nodes are operators that process the data
方便进行MR的,可以描述平行数据流的脚本语言
4.3 Hadoop: Oozie

- A scalable workflow system, Oozie is integrated into the Hadoop stack, and is used to coordinate execution of multiple MapReduce jobs
4.4 Hadoop: Musketeer
5. MRv2: YARN
YARN(Yet Another Resource Negotiator) is "MapReduce v2"
The first version of MR/Hadoop was 'batch oriented', meaning that static, distributed data was processed via mapping, shuffling and reducing steps.

- YARN on the other hand makes non-MR applications (eg. graph processing, iterative modeling) possible (but is fully backwards compatible with v.1.0, ie. can run MapReduce jobs), and offers better scalability and cluster utilization (compared to MRv1). It also makes it possible to create (near) real-time applications.

6. Beyond MR
6.1 Spark
Why
Spark makes Big Data real-time and interactive.
Better efficiency: general execution graphs, in-memory data storage.
SparkSQL
MR could do deal with complex (multi-pass) processing, interactive (ad-hoc) queries or real-time (stream) processing.
HOW
Big idea: resilient distributed datasets (RDDs)
- distributed collections of objects that can be cached in memory across cluster
- manipulated through parallel operators
- automatically recomputed on failure
6.2 Flink
What
Similar to MR, Flink is a parallel data processing platform.
Flink offers Map and Reduce functions but also additional transformations like Join, CoGroup, Filter, and Iterations.
Flink's programming model is a super set of the MapReduce programming model
6.3 Storm
What
- Apache Storm is a free and open source distributed realtime computation system
Why
- Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
Usage
realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more
7. BSP: MR alternative
What
Bulk Synchronous Parallel
How
A BSP computation is executed on a set of processors which are connected in a communication network but work independently by each other.
The BSP computation consists of a sequence of iterations, called supersteps.
There are three steps (phases) in each superstep:
Local computation: every processor performs computations using data stored in local memory - independent of what happens at other processors; a processor can contain several processes (threads)
Communication: exchange of data between processes (put and get); one-sided communication
Barrier synchronization: all processes wait until everyone has finished the communication step
The following figure illustrates the actions applied in one superstep

8. BSP->Pregel
What
- Bulk Synchronous Parallel(BSP) Model inspired Prege.
How
Key: "Think like a vertex"
In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges' states, and mutate the graph's topology
Pregel-> Giraph
What
Giraph is an open source version of Pregel
Specifically designed for iterative graph computations (and nothing else!).
只做图形计算

Pregel-> HAMA
What
- Apache HAMA is a general-purpose Bulk Synchronous Parallel (BSP) computing engine on top of Hadoop. It provides a parallel processing framework for massive iterative algorithms.
How
HAMA performs a series of supersteps based on BSP - it is suitable for iterative computation, since it is possible that input data which can be saved in memory, is able to get transfered between supersteps (unlike MR).
But HAMA is not merely a graph computing engine - instead it is a general purpose BSP platform, so on top of it, any arbitrary computation (graph processing, machine learning, MRQL, matrix algorithms, network algorithms..) can be implemented
不只做图形计算,可以做很多