Skip to content

Latest commit

 

History

History
61 lines (30 loc) · 1.74 KB

README.md

File metadata and controls

61 lines (30 loc) · 1.74 KB

PySpark ML Crashcourse

This repository contains exercises and solutions for a one-day crash course for PySpark and Spark ML. The repository only contains Jupyter Notebooks which assume a working PySpark kernel with Python 3.5 and Spark 2.1.

Author

All notebooks have been create by Kaya Kupferschmidt @ dimajix. In case you have any questions, feel free to contact me at k.kupferschmidt@dimajix.de

01 - PySpark DataFrame Introduction

This notebook contains some simple snippets to get a basic understanding how to interact with Spark DataFrames in Python.

02 - From Pandas to Spark (skeleton + solution)

These notebooks provides some examples on the differences between Pandas and Spark on an API level.

03 - Weather Analysis Exercise (exercise + solution)

A small exercise using some more data for a simple weather analysis.

04 - Pandas UDF (skeleton + solution)

An introduction to the various types of Pandas Vectorized UDFs

05 - Grouped Regression (exercise + solution)

An non-trivial example for using Pandas UDFs

06 - House Prices (skeleton + solution)

These notebooks contain a simple linear regression exercise as an introduction to machine learning with Spark.

07 - House Prices (exercise + solution)

These notebooks builds on the last one, but creates more structure by using Spark ML pipeliens.

08 - Text Classification (exercise + solution)

After being exposed to a simple linear regression, these notebooks contain an exercise to perform a simple statistical text classification.

09 - Hyper Parameter Tuning (exercise + solution)

As with many complex algorithms and ML pipelines, the text classification has many hyper parameters. These notebooks show how to perform hyper parameter tuning with PySpark.