Discovery Health // Databricks

Data Science & Machine Learning: platform overview

The materials in this repository accompany the session delivered on 12th May 2020.

Environment set-up

Install the Databricks CLI and configure for access to your workspace
Clone the repository to your local environment and navigate to that folder on the command line
Create the demo cluster using the command databricks clusters create --json-file cluster-spec.json
Upload the mmlspark python library to the cluster databricks fs cp -r ./libraries dbfs:/FileStore/jars/datasci-overview

Install all of the libraries on your cluster using the cluster UI:

library name (or maven coords)	version	type	repository / location
`fbprophet`	latest	PyPI	default
`joblibspark`	latest	PyPI	default
`plotly`	latest	PyPI	default
`scikit-learn==0.21.3`	0.21.3	PyPI	default
`petastorm==0.7.2`	0.7.2	PyPI	default
`com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1`	1.0.0-rc1	Maven	https://mmlspark.azureedge.net/maven
`mmlspark-0.17`	0.17	DBFS, Python Whl	dbfs:/FileStore/jars/datasci-overview/mmlspark-0.17-py2.py3-none-any.whl

Upload the repository to your workspace using e.g. databricks workspace import_dir . /Users/email@company.com/datasci-overview where email@company.com is the address you use to log into your workspace
Attach a notebook to your newly created cluster and try running it.

Notebook descriptions

`01_Replicate-traditional-workflow`

First example of accessing data, training and evaluating a simple model in a Databricks notebook.

`02_Introduction-to-tracking`

Building on the first notebook, showing how to log parameters, metrics, models etc. to the MLflow tracking server.

`03_Deployment-management`

Demonstrating different ways of using MLflow to deploy the models trained in the previous notebook.

`00_Reset`

Resets some of the MLflow components (for demo purposes)

`Fine Grained Demand Forecasting`

Example of using pandas_udf to create forecasts for subgroups within a large dataset.

`Hyperopt + Spark demo`

Simple example of hyperparameter tuning on a single-node or cluster environment using the Hyperopt package.

`LightGBM - Quantile Regression for Drug Discovery`

Example of data distributed training of a statistical machine learning model using the Microsoft LightGBM implementation.

`mnist-tensorflow-keras`

Example of data distributed training of a deep neural network by pairing Keras and HorovodRunner.

`parallel-model-selection-joblib`

Parallel model selection on scikit-learn models using the joblibspark backend.

`petastorm`

Batch-at-a-time data loading for training of neural networks on datasets whose size exceeds available memory.

Other material

Slides for the session are located here and should be accessible immediately after the session.

Questions?

Please contact me at stuart@databricks.com.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
libraries		libraries
00_Reset.py		00_Reset.py
01_Replicate-traditional-workflow.py		01_Replicate-traditional-workflow.py
02_Introduction-to-tracking.py		02_Introduction-to-tracking.py
03_Deployment-management.py		03_Deployment-management.py
Fine Grained Demand Forecasting.py		Fine Grained Demand Forecasting.py
Hyperopt + Spark demo.py		Hyperopt + Spark demo.py
LightGBM - Quantile Regression for Drug Discovery.py		LightGBM - Quantile Regression for Drug Discovery.py
README.md		README.md
cluster-spec.json		cluster-spec.json
mnist-tensorflow-keras.py		mnist-tensorflow-keras.py
parallel-model-selection-joblib.py		parallel-model-selection-joblib.py
petastorm.py		petastorm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discovery Health // Databricks

Data Science & Machine Learning: platform overview

Environment set-up

Notebook descriptions

`01_Replicate-traditional-workflow`

`02_Introduction-to-tracking`

`03_Deployment-management`

`00_Reset`

`Fine Grained Demand Forecasting`

`Hyperopt + Spark demo`

`LightGBM - Quantile Regression for Drug Discovery`

`mnist-tensorflow-keras`

`parallel-model-selection-joblib`

`petastorm`

Other material

Questions?

About

Releases

Packages

Languages

sllynn/discovery-datasci-overview-materials

Folders and files

Latest commit

History

Repository files navigation

Discovery Health // Databricks

Data Science & Machine Learning: platform overview

Environment set-up

Notebook descriptions

01_Replicate-traditional-workflow

02_Introduction-to-tracking

03_Deployment-management

00_Reset

Fine Grained Demand Forecasting

Hyperopt + Spark demo

LightGBM - Quantile Regression for Drug Discovery

mnist-tensorflow-keras

parallel-model-selection-joblib

petastorm

Other material

Questions?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`01_Replicate-traditional-workflow`

`02_Introduction-to-tracking`

`03_Deployment-management`

`00_Reset`

`Fine Grained Demand Forecasting`

`Hyperopt + Spark demo`

`LightGBM - Quantile Regression for Drug Discovery`

`mnist-tensorflow-keras`

`parallel-model-selection-joblib`

`petastorm`

Packages