Data analytics refers to the ability to extract information from data. It has to cope with rapidly growing volumes of data as well as increasing complexity of analysis questions and methods. These trends are no longer matched by performance improvements of single processing units (CPU/GPU cores). As such, sequential processing of data on a single machine is no longer a viable option. Rather, systems for data analytics need to embrace parallel and distributed computation in order to achieve scalability by increasing the number of processing units.
This lecture introduces models and methods to build systems for distributed data processing. That includes foundational aspects, reaching from data models through encoding and replication schemes to notions of consistency and consensus. At the same time, the lecture covers practical implementations of distributed data processing based on infrastructures such as Akka, Spark, Flink, and Kafka.