Skip to content

Latest commit

 

History

History

scale-your-analytics

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Scale Your Analytics - Leveraging Apache Spark in Python and R - 17th September 2024

Audience

Are you a data science practitioner who primarily uses Python and/or R? Have you found yourself in situations where your data grew too big and your code failed with an out-of-memory error, or your data processing pipeline brought your machine to its limit? You attempted to scale up but eventually faced the same problems or ran into other ones? If so, this workshop might be for you. We'll talk about scaling your analytics and specifically about how to leverage Apache Spark to scale out your analytics beyond a single machine. We’ll start with an overview of scaling options and the fundamentals of Apache Spark. After that, we’ll explore a simple data processing pipeline in Spark and will see how it compares to equivalent implementations in Python and R.

The workshop will focus on Spark's data frame API and primarily provide examples using Python/pyspark, but the concepts & considerations conveyed are equally applicable to R/SparkR/sparklyr. The workshop will not go into the specifics of Spark structured streaming, Spark's machine learning library (MLlib) and graph processing (GraphX).

Duration

Presentation: ~ 2.5h

Location

ImpactHub Viadukt - Viaduktstrasse 93, 8005 Zürich

Schedule

  • 4:45 pm - Doors open
  • 5:15 pm - Welcome / Start of workshop
  • 7:45 pm - End of workshop / closing remarks
  • 7:45 - 9:00 pm - Apéro at the bar

Prerequisites

Basic knowledge of Python and/or R is highly recommended. No prior knowledge of Apache Spark and the corresponding language APIs is needed.

Workshop participants are not required to bring their laptops.

References