Programming models, paradigms and languages; parallel programming models, process interaction (shared memory, message passing, implicit interaction), problem decomposition (task parallelism, data parallelism, implicit parallelism)
MapReduce: programming model (data parallelism, divide‐and‐conquer paradigm, map and reduce functions), cluster architecture (master, workers, message passing, data distribution), map and reduce functions (input arguments, emission and reduction of intermediate key-value pairs, final output), data flow phases (mapping, shuffling, reducing), input parsing (input file, split, record), execution steps (parsing, mapping, partitioning, combining, merging, reducing), combine function (commutativity, associativity), additional functions (input reader, partition, compare, output writer), implementation details (counters, fault tolerance, stragglers, task granularity), usage patterns (aggregation, grouping, querying, sorting, …)
Apache Hadoop: modules (Common, HDFS, YARN, MapReduce), related projects (Cassandra, HBase, …); HDFS module: data model (hierarchical namespace, directories, files, blocks, permissions), architecture (NameNode and DataNode nodes, HeartBeat messages, failures), replica placement (rackaware strategy), FsImage (namespace, mapping of blocks, system properties) and EditLog structures, FS commands (ls, mkdir, …); MapReduce module: architecture (JobTracker and TaskTracker nodes), job implementation (Configuration; Mapper, Reducer, and Combiner classes; Context, write method; Writable and WritableComparable interfaces), job execution schema
What’s a programming model?
programming model = abstraction of an underlying computer system
implementation details are hidden, public interface is exposed (= how the system expects us to behave and control it, we can create algorithms and data structures (within defined bounds))
parallel programming models
there are different ways how to share data in parallel processes:
shared memory between parallel processes, but this applies only on single machine (cannot be in the cluster)
message passing (in clusters)
types of parallelisms in programming models:
task parallelism - different tasks are done in parallel over the same data
data parallelism - the same task over different data