The materials in this repository accompany the session delivered on 12th May 2020.
-
Install the Databricks CLI and configure for access to your workspace
-
Clone the repository to your local environment and navigate to that folder on the command line
-
Create the demo cluster using the command
databricks clusters create --json-file cluster-spec.json
-
Upload the mmlspark python library to the cluster
databricks fs cp -r ./libraries dbfs:/FileStore/jars/datasci-overview
-
Install all of the libraries on your cluster using the cluster UI:
library name (or maven coords) version type repository / location fbprophet
latest PyPI default joblibspark
latest PyPI default plotly
latest PyPI default scikit-learn==0.21.3
0.21.3 PyPI default petastorm==0.7.2
0.7.2 PyPI default com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1
1.0.0-rc1 Maven https://mmlspark.azureedge.net/maven mmlspark-0.17
0.17 DBFS, Python Whl dbfs:/FileStore/jars/datasci-overview/mmlspark-0.17-py2.py3-none-any.whl -
Upload the repository to your workspace using e.g.
databricks workspace import_dir . /Users/email@company.com/datasci-overview
whereemail@company.com
is the address you use to log into your workspace -
Attach a notebook to your newly created cluster and try running it.
First example of accessing data, training and evaluating a simple model in a Databricks notebook.
Building on the first notebook, showing how to log parameters, metrics, models etc. to the MLflow tracking server.
Demonstrating different ways of using MLflow to deploy the models trained in the previous notebook.
Resets some of the MLflow components (for demo purposes)
Example of using pandas_udf to create forecasts for subgroups within a large dataset.
Simple example of hyperparameter tuning on a single-node or cluster environment using the Hyperopt package.
Example of data distributed training of a statistical machine learning model using the Microsoft LightGBM implementation.
Example of data distributed training of a deep neural network by pairing Keras and HorovodRunner.
Parallel model selection on scikit-learn models using the joblibspark backend.
Batch-at-a-time data loading for training of neural networks on datasets whose size exceeds available memory.
Slides for the session are located here and should be accessible immediately after the session.
Please contact me at stuart@databricks.com.