Workshop for PyDay BCN 2019

These are the slides and notebook for a workshop The magic of PySpark for PyDay Barcelona 2019.

You can find the slides here (some images might look slightly blurry). I recommend you check the version with presenter notes.

This presentation is formatted in Markdown and prepared to be used with Deckset. The drawings were done on an iPad Pro using Procreate. Here only the final PDF and the source Markdown are available. Sadly the animated gifs are just static images in the PDF.

You can run the notebook in Binder

Note: The Arrow optimisation does not work in Binder. I'll try to fix it, but it won't be ready for the workshop. Check the output notebook to see the impact of these.

Or just read it here in Github by clicking here (no outputs) or here (with outputs)

Thanks

Thanks FiveThirtyEight for making their datasets available.
The Binder start scripts are based on Hyukjin Kwon's for his Spark Summit Europe 2019 talk
Some of the images of the slides and explanations come from previous talks, Internals of Speeding Up Pyspark with Arrow (Spark Summit Europe 2019) and Welcome to Apache Spark (SoCraTesUK 2018 with Carlos Peña)
PyBCN for the event and Affectv for supporting it

Details on running the notebook

To take full advantage of the workshop without using Binder (locally) you'll need

PySpark installed (anything more recent than 2.3 should be fine)
Jupyter installed
Pandas and Arrow installed
All able to talk to each other
One or more datasets

The TL;DR if you don't want to use Docker should just be:

pip install pyarrow pandas pyspark numpy jupyter

You can install pyspark using pip install pyspark, installing it in the same environment you have Jupyter installed should make them talk to each other just fine. You should also run pip install pyarrow, although if this one fails for some reason it's not a big problem. To make analysis more entertaining, also run pip install pandas, again, all in the same environment. You can also run these in conda, with conda install -c conda-forge pyspark although it might be more convenient to use pip (pyspark can get easily confused with many python environments)

If you are familiar enough with Docker, I recommend using a Docker container instead.

Run this before the workshop:

docker pull rberenguel/pyspark_workshop

During the workshop (or before) you can use this docker container with

docker run --name pyspark_workshop -d -p 8888:8888 -p 4040:4040 -p 4041:4041 -v "$PWD":/home/jovyan rberenguel/pyspark_workshop

in the folder you want to create your Jupyter notebook. To open it in your browser,

docker logs pyspark_workshop

and open the URL provided in the logs (should look like http://127.0.0.1:8888/?token=36a20c93f0ee8cab4699e2460261e3b16787a68fbb034aee)

This container installs Arrow on top of the usual jupyter/pyspark, to allow for some additional optimisations in Spark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Workshop for PyDay BCN 2019

Thanks

Details on running the notebook

Files

README.md

Latest commit

History

README.md

File metadata and controls

Workshop for PyDay BCN 2019

Thanks

Details on running the notebook