PySpark ML Crashcourse

This repository contains exercises and solutions for a one-day crash course for PySpark and Spark ML. The repository only contains Jupyter Notebooks which assume a working PySpark kernel with Python 3.5 and Spark 2.1.

Author

All notebooks have been create by Kaya Kupferschmidt @ dimajix. In case you have any questions, feel free to contact me at k.kupferschmidt@dimajix.de

01 - PySpark DataFrame Introduction

This notebook contains some simple snippets to get a basic understanding how to interact with Spark DataFrames in Python.

02 - From Pandas to Spark (skeleton + solution)

These notebooks provides some examples on the differences between Pandas and Spark on an API level.

03 - Weather Analysis Exercise (exercise + solution)

A small exercise using some more data for a simple weather analysis.

04 - Pandas UDF (skeleton + solution)

An introduction to the various types of Pandas Vectorized UDFs

05 - Grouped Regression (exercise + solution)

An non-trivial example for using Pandas UDFs

06 - House Prices (skeleton + solution)

These notebooks contain a simple linear regression exercise as an introduction to machine learning with Spark.

07 - House Prices (exercise + solution)

These notebooks builds on the last one, but creates more structure by using Spark ML pipeliens.

08 - Text Classification (exercise + solution)

After being exposed to a simple linear regression, these notebooks contain an exercise to perform a simple statistical text classification.

09 - Hyper Parameter Tuning (exercise + solution)

As with many complex algorithms and ML pipelines, the text classification has many hyper parameters. These notebooks show how to perform hyper parameter tuning with PySpark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PySpark ML Crashcourse

Author

01 - PySpark DataFrame Introduction

02 - From Pandas to Spark (skeleton + solution)

03 - Weather Analysis Exercise (exercise + solution)

04 - Pandas UDF (skeleton + solution)

05 - Grouped Regression (exercise + solution)

06 - House Prices (skeleton + solution)

07 - House Prices (exercise + solution)

08 - Text Classification (exercise + solution)

09 - Hyper Parameter Tuning (exercise + solution)

Files

README.md

Latest commit

History

README.md

File metadata and controls

PySpark ML Crashcourse

Author

01 - PySpark DataFrame Introduction

02 - From Pandas to Spark (skeleton + solution)

03 - Weather Analysis Exercise (exercise + solution)

04 - Pandas UDF (skeleton + solution)

05 - Grouped Regression (exercise + solution)

06 - House Prices (skeleton + solution)

07 - House Prices (exercise + solution)

08 - Text Classification (exercise + solution)

09 - Hyper Parameter Tuning (exercise + solution)