MapReduce

What

1. 前戏：functional constructs

In functional programming, functions are first-class objects that can be passed into a function as arguments; a function can also be returned from a function as output. JavaScript, Python etc. are languages that provide functional programming facilities.

Example

In Python, the built-in functions map(), filter() and reduce(), all accept a function as input.

MapReduce

MapReduce is a programming paradigm invented at Google, one which has become wildly popular since it is designed to be applied to Big Data in NoSQL DBs, in data and disk parallel fashion - resulting in dramatic processing gains.

Why

1. 传统数据库的缺陷与产生的需求

Modern databases store data as key/value pairs, resulting in explosive growth when it comes to number of rows and file sizes.
Traditional, sequential, single machine oriented access does NOT work at all - what is needed is a massively parallel way to process the data.

Note that we are talking about SIMD form of parallelism.

大数据时代单机器的处理不管用了，需要多台机器平行处理

How

1. MapReduce works like this:

1) [big] data is split into segments, held in a compute cluster

2) a mapper task is run in parallel on all the segments (ie. in each cluster, therefore on each segment); each mapper produces output in the form of (key,value) pairs

3) related key/value pairs from all mappers are forwarded to a shuffler (there are multiple shufflers); each shuffler consolidates its values into a list

4) shufflers forward their keys and lists, to reducer tasks; each reducer processes its list, emits a simgle value (for its key)

The cluster user (programmer) only needs to supply a mapper task and a reducer task, the rest is automatically handled!

2. GFS(Google File System)

Why

Since MapReduce involves accessing (reading, writing) distributed (in clusters) data in parallel, there needs to be a high-performance, distributed file system that goes along with it

MapReduce的从原理上就得需要一个分布式文件系统

How

GFS abstracts details of network file access so that remote reads/writes and local reads/writes are handled (in code) identically.

3. Hadoop & HDFS

What is Big Data and Hadoop?

what

Hadoop is modeled after the MapReduce paradigm, and is utilized identically (by having users run mappers and reducers on (big) data).
Hadoop Distributed File System(HDFS)is modeled after Google's GFS, but with some important differences.

4. Hadoop 'ecosystem' (Hadoop stack)

Hadoop Layer: MapReduce & HDFS
Tools Layer: Hive, Pig ...

4.1 Hadoop: Hive

Hive provides a SQL-like scripting language called HQL.
Hive translates most queries to MapReduce jobs, thereby exploiting the scalability of Hadoop, while presenting a familiar SQL abstraction.

可以方便进行MD的类SQL语言

4.2 Hadoop: Pig

Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows

提供MR所需的平行处理的引擎

Pig Latin scripts are compiled into MR jobs that are then run on the cluster.
Pig Latin is a dataflow language. This means it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel.
To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges are data flows and the nodes are operators that process the data

方便进行MR的，可以描述平行数据流的脚本语言

4.3 Hadoop: Oozie

A scalable workflow system, Oozie is integrated into the Hadoop stack, and is used to coordinate execution of multiple MapReduce jobs

4.4 Hadoop: Musketeer

5. MRv2: YARN

YARN(Yet Another Resource Negotiator) is "MapReduce v2"
The first version of MR/Hadoop was 'batch oriented', meaning that static, distributed data was processed via mapping, shuffling and reducing steps.

YARN on the other hand makes non-MR applications (eg. graph processing, iterative modeling) possible (but is fully backwards compatible with v.1.0, ie. can run MapReduce jobs), and offers better scalability and cluster utilization (compared to MRv1). It also makes it possible to create (near) real-time applications.

6. Beyond MR

6.1 Spark

Why

Spark makes Big Data real-time and interactive.
Better efficiency: general execution graphs, in-memory data storage.
SparkSQL
MR could do deal with complex (multi-pass) processing, interactive (ad-hoc) queries or real-time (stream) processing.

HOW

Big idea: resilient distributed datasets (RDDs)

distributed collections of objects that can be cached in memory across cluster
manipulated through parallel operators
automatically recomputed on failure

6.2 Flink

What

Similar to MR, Flink is a parallel data processing platform.
Flink offers Map and Reduce functions but also additional transformations like Join, CoGroup, Filter, and Iterations.
Flink's programming model is a super set of the MapReduce programming model

6.3 Storm

What

Apache Storm is a free and open source distributed realtime computation system

Why

Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

Usage

realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more

7. BSP: MR alternative

What

Bulk Synchronous Parallel

How

A BSP computation is executed on a set of processors which are connected in a communication network but work independently by each other.
The BSP computation consists of a sequence of iterations, called supersteps.

There are three steps (phases) in each superstep:

Local computation: every processor performs computations using data stored in local memory - independent of what happens at other processors; a processor can contain several processes (threads)
Communication: exchange of data between processes (put and get); one-sided communication
Barrier synchronization: all processes wait until everyone has finished the communication step

The following figure illustrates the actions applied in one superstep

8. BSP->Pregel

What

Bulk Synchronous Parallel(BSP) Model inspired Prege.

How

Key: "Think like a vertex"
In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges' states, and mutate the graph's topology

Pregel-> Giraph

What

Giraph is an open source version of Pregel
Specifically designed for iterative graph computations (and nothing else!).

只做图形计算

Pregel-> HAMA

What

Apache HAMA is a general-purpose Bulk Synchronous Parallel (BSP) computing engine on top of Hadoop. It provides a parallel processing framework for massive iterative algorithms.

How

HAMA performs a series of supersteps based on BSP - it is suitable for iterative computation, since it is possible that input data which can be saved in memory, is able to get transfered between supersteps (unlike MR).
But HAMA is not merely a graph computing engine - instead it is a general purpose BSP platform, so on top of it, any arbitrary computation (graph processing, machine learning, MRQL, matrix algorithms, network algorithms..) can be implemented

不只做图形计算，可以做很多

MapReduce

What

1. 前戏：functional constructs

Example

MapReduce

Why

1. 传统数据库的缺陷与产生的需求

How

1. MapReduce works like this:

2. GFS(Google File System)

Why

How

3. Hadoop & HDFS

what

4. Hadoop 'ecosystem' (Hadoop stack)

4.1 Hadoop: Hive

4.2 Hadoop: Pig

4.3 Hadoop: Oozie

4.4 Hadoop: Musketeer

5. MRv2: YARN

6. Beyond MR

6.1 Spark

Why

HOW

6.2 Flink

What

6.3 Storm

What

Why

Usage

7. BSP: MR alternative

What

How

8. BSP->Pregel

What

How

Pregel-> Giraph

What

Pregel-> HAMA

What

How

results matching ""

No results matching ""