Are you a data science practitioner who primarily uses Python and/or R? Have you found yourself in situations where your data grew too big and your code failed with an out-of-memory error, or your data processing pipeline brought your machine to its limit? You attempted to scale up but eventually faced the same problems or ran into other ones? If so, this workshop might be for you. We'll talk about scaling your analytics and specifically about how to leverage Apache Spark to scale out your analytics beyond a single machine. We’ll start with an overview of scaling options and the fundamentals of Apache Spark. After that, we’ll explore a simple data processing pipeline in Spark and will see how it compares to equivalent implementations in Python and R.
The workshop will focus on Spark's data frame API and primarily provide examples using Python/pyspark, but the concepts & considerations conveyed are equally applicable to R/SparkR/sparklyr. The workshop will not go into the specifics of Spark structured streaming, Spark's machine learning library (MLlib) and graph processing (GraphX).
Presentation: ~ 2.5h
ImpactHub Viadukt - Viaduktstrasse 93, 8005 Zürich
- 4:45 pm - Doors open
- 5:15 pm - Welcome / Start of workshop
- 7:45 pm - End of workshop / closing remarks
- 7:45 - 9:00 pm - Apéro at the bar
Basic knowledge of Python and/or R is highly recommended. No prior knowledge of Apache Spark and the corresponding language APIs is needed.
Workshop participants are not required to bring their laptops.