Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for deploying packaged Kedro projects on Databricks #2595

Merged
merged 72 commits into from
Jun 1, 2023
Merged
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
cf03914
Add deployment workflow page
jmholzer May 3, 2023
57e1d55
Add table of contents, entry point guide, data and conf upload guide
jmholzer May 16, 2023
66fd862
Add detailed instructions for creating a job on Databricks
jmholzer May 18, 2023
2b21797
Add images and automated deployment resources
jmholzer May 20, 2023
ea1142a
Merge branch 'main' into docs/add-databricks-deployment-workflow
jmholzer May 20, 2023
5c76bc5
Remove use of 'allows', add summary
jmholzer May 22, 2023
db5c404
Merge branch 'docs/add-databricks-deployment-workflow' of github.com:…
jmholzer May 22, 2023
0f20bb7
Remove link to missing image
jmholzer May 22, 2023
1baa5e2
Add deployment workflow to toctree
jmholzer May 22, 2023
ed73514
Lint and fix missing link
jmholzer May 22, 2023
6057994
Minor style, syntax and grammar improvements
jmholzer May 22, 2023
5eedb2b
Merge branch 'main' into docs/add-databricks-deployment-workflow
jmholzer May 22, 2023
3d22bd6
Fixes for correctness during validation
jmholzer May 23, 2023
384aebb
Merge branch 'docs/add-databricks-deployment-workflow' of github.com:…
jmholzer May 23, 2023
4d27dad
Add instructions for creating log output location
jmholzer May 23, 2023
8b246d1
Lint
jmholzer May 23, 2023
25da49d
Lint databricks_run
jmholzer May 23, 2023
c384f62
Merge branch 'main' into docs/add-databricks-deployment-workflow
jmholzer May 23, 2023
c83107b
Minor wording change in reference to logs
jmholzer May 23, 2023
df5e743
Merge branch 'docs/add-databricks-deployment-workflow' of github.com:…
jmholzer May 23, 2023
cfbe80b
Modify reference to Pyspark-Iris
jmholzer May 23, 2023
7e5406a
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
fdfe925
Fix linter errors to enable docs build for inspection
stichbury May 24, 2023
3ac02cc
Update build-docs.sh
stichbury May 24, 2023
04ca9c0
Fix broken link
stichbury May 24, 2023
93997e2
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
74ab283
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
caa9ecc
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
9901b40
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
dd9b4a0
Remove spurious word
jmholzer May 24, 2023
bd2ce14
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
f5cd5be
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
f3cf261
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
edbee92
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
f67d7f4
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
5c60a4b
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
1b45757
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
a7aacb5
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
a11624c
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
40c6e62
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
f37a8e8
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
dc8c9ed
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
28c7b87
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
f392625
Merge branch 'docs/add-databricks-deployment-workflow' of github.com:…
jmholzer May 24, 2023
1e9a01d
Add advantages subheading
jmholzer May 24, 2023
c2b406e
Update docs/source/integrations/databricks_deployment_workflow.md
jmholzer May 24, 2023
ee58551
Merge branch 'docs/add-databricks-deployment-workflow' of github.com:…
jmholzer May 24, 2023
0407991
Add alternative ways to upload data to DBFS
jmholzer May 24, 2023
2149bfd
Move note on unpackaged config and data
jmholzer May 24, 2023
672c3d0
Fix broken links
jmholzer May 24, 2023
d939fba
Move databricks back into deployment section
stichbury May 25, 2023
89760f1
Remove references to PySpark Iris (pyspark-iris) starter
jmholzer May 25, 2023
adbca32
Merge branch 'docs/add-databricks-deployment-workflow' of github.com:…
jmholzer May 25, 2023
a7dc9e8
Graphics links fixes, revise titles
stichbury May 25, 2023
2a43dca
Merge branch 'docs/add-databricks-deployment-workflow' of https://git…
stichbury May 25, 2023
acae7c3
Fix broken internal link
stichbury May 25, 2023
21a1097
Merge branch 'main' into docs/add-databricks-deployment-workflow
stichbury May 25, 2023
b24c062
Fix links broken by new folder
stichbury May 25, 2023
ba0c184
Remove logs directory
jmholzer May 26, 2023
10e7010
Merge branch 'docs/add-databricks-deployment-workflow' of github.com:…
jmholzer May 26, 2023
65ccd68
Update image of final job configuration
jmholzer May 26, 2023
e88612b
Add full stops in list.
jmholzer May 30, 2023
085e9e5
Fix conda environment name.
jmholzer May 30, 2023
d8bd527
Modify wording and image for creating a new job cluster
jmholzer May 30, 2023
1777daa
Modify wording in guide to create new job cluster
jmholzer May 30, 2023
62a4b9a
Merge branch 'docs/add-databricks-deployment-workflow' of github.com:…
jmholzer May 30, 2023
1af24e4
Remove --upgrade option
jmholzer May 30, 2023
7be8166
Add both ways of creating a new job
jmholzer May 30, 2023
593e418
Merge branch 'main' into docs/add-databricks-deployment-workflow
jmholzer May 30, 2023
0b3eb4e
Merge branch 'docs/add-databricks-deployment-workflow' of github.com:…
jmholzer May 30, 2023
6454959
Merge branch 'main' into docs/add-databricks-deployment-workflow
jmholzer May 31, 2023
221a77a
Merge branch 'main' into docs/add-databricks-deployment-workflow
astrojuanlu Jun 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/contribution/development_for_databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Many Kedro users deploy their projects to [Databricks](https://www.databricks.co
## How to deploy a development version of Kedro to Databricks

```{note}
This page is for **contributors** developing changes to Kedro that need to test them on Databricks. If you are a Kedro user working on an individual or team project and need more information about workflows, consult the [documentation for developing a Kedro project on Databricks](../integrations/databricks_workspace.md).
This page is for **contributors** developing changes to Kedro that need to test them on Databricks. If you are a Kedro user working on an individual or team project and need more information about workflows, consult the [documentation pages for developing a Kedro project on Databricks](../deployment/databricks/index.md).
```

## Prerequisites
Expand Down
4 changes: 0 additions & 4 deletions docs/source/deployment/databricks.md

This file was deleted.

311 changes: 311 additions & 0 deletions docs/source/deployment/databricks/databricks_deployment_workflow.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Note your Databricks **username** and **host** as you will need it for the remai

Find your Databricks username in the top right of the workspace UI and the host in the browser's URL bar, up to the first slash (e.g., `https://adb-123456789123456.1.azuredatabricks.net/`):

![Find Databricks host and username](../meta/images/find_databricks_host_and_username.png)
![Find Databricks host and username](../../meta/images/find_databricks_host_and_username.png)

```{note}
Your databricks host must include the protocol (`https://`).
Expand Down Expand Up @@ -90,7 +90,7 @@ Create a new repo on Databricks by navigating to `New` tab in the Databricks wor

In this guide, you will not sync your project with a remote Git provider, so uncheck `Create repo by cloning a Git repository` and enter `iris-databricks` as the name of your new repository:

![Create a new repo on Databricks](../meta/images/databricks_repo_creation.png)
![Create a new repo on Databricks](../../meta/images/databricks_repo_creation.png)

### Sync code with your Databricks repo using dbx

Expand Down Expand Up @@ -128,15 +128,15 @@ Kedro requires your project to have a `conf/local` directory to exist to success

Open the Databricks workspace UI and using the panel on the left, navigate to `Repos -> <databricks_username> -> iris-databricks -> conf`, right click and select `Create -> Folder` as in the image below:

![Create a conf folder in Databricks repo](../meta/images/databricks_conf_folder_creation.png)
![Create a conf folder in Databricks repo](../../meta/images/databricks_conf_folder_creation.png)

Name the new folder `local`. In this guide, we have no local credentials to store and so we will leave the newly created folder empty. Your `conf/local` and `local` directories should now look like the following:

![Final conf folder](../meta/images/final_conf_folder.png)
![Final conf folder](../../meta/images/final_conf_folder.png)

### Upload project data to DBFS

When run on Databricks, Kedro cannot access data stored in your project's directory. Therefore, you will need to upload your project's data to an accessible location. In this guide, we will store the data on the Databricks File System (DBFS). The PySpark Iris starter contains an environment that is set up to access data stored in DBFS (`conf/databricks`). To learn more about environments in Kedro configuration, see the [configuration documentation](../configuration/configuration_basics.md#configuration-environments).
When run on Databricks, Kedro cannot access data stored in your project's directory. Therefore, you will need to upload your project's data to an accessible location. In this guide, we will store the data on the Databricks File System (DBFS). The PySpark Iris starter contains an environment that is set up to access data stored in DBFS (`conf/databricks`). To learn more about environments in Kedro configuration, see the [configuration documentation](../../configuration/configuration_basics.md#configuration-environments).

There are several ways to upload data to DBFS. In this guide, it is recommended to use [Databricks CLI](https://docs.databricks.com/dev-tools/cli/dbfs-cli.html) because of the convenience it offers. At the command line in your local environment, use the following Databricks CLI command to upload your locally stored data to DBFS:

Expand Down Expand Up @@ -169,7 +169,7 @@ Now that your project is available on Databricks, you can run it on a cluster us

To run the Python code from your Databricks repo, [create a new Python notebook](https://docs.databricks.com/notebooks/notebooks-manage.html#create-a-notebook) in your workspace. Name it `iris-databricks` for traceability and attach it to your cluster:

![Create a new notebook on Databricks](../meta/images/databricks_notebook_creation.png)
![Create a new notebook on Databricks](../../meta/images/databricks_notebook_creation.png)

### Run your project

Expand Down Expand Up @@ -201,15 +201,15 @@ session.run()

After completing these steps, your notebook should match the following image:

![Databricks completed notebook](../meta/images/databricks_finished_notebook.png)
![Databricks completed notebook](../../meta/images/databricks_finished_notebook.png)

Run the completed notebook using the `Run All` bottom in the top right of the UI:

![Databricks notebook run all](../meta/images/databricks_run_all.png)
![Databricks notebook run all](../../meta/images/databricks_run_all.png)

On your first run, you will be prompted to consent to analytics, type `y` or `N` in the field that appears and press `Enter`:

![Databricks notebook telemetry consent](../meta/images/databricks_telemetry_consent.png)
![Databricks notebook telemetry consent](../../meta/images/databricks_telemetry_consent.png)

You should see logging output while the cell is running. After execution finishes, you should see output similar to the following:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Visualise a Kedro project in Databricks notebooks

[Kedro-Viz](../visualisation/kedro-viz_visualisation.md) is a tool that enables you to visualise your Kedro pipeline and metrics generated from your data science experiments. It is a standalone web application that runs on a web browser, it can be run on a local machine or in Databricks notebooks.
[Kedro-Viz](../../visualisation/kedro-viz_visualisation.md) is a tool that enables you to visualise your Kedro pipeline and metrics generated from your data science experiments. It is a standalone web application that runs on a web browser, it can be run on a local machine or in Databricks notebooks.

For Kedro-Viz to run with your Kedro project, you need to ensure that both the packages are installed in the same scope (notebook-scoped vs. cluster library). This means that if you `%pip install kedro` from inside your notebook then you should also `%pip install kedro-viz` from inside your notebook.
If your cluster comes with Kedro installed on it as a library already then you should also add Kedro-Viz as a [cluster library](https://docs.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries).
Expand All @@ -15,8 +15,8 @@ Kedro-Viz can then be launched in a new browser tab with the `%run_viz` line mag

This command presents you with a link to the Kedro-Viz web application.

![databricks_viz_link](../meta/images/databricks_viz_link.png)
![databricks_viz_link](../../meta/images/databricks_viz_link.png)

Clicking this link opens a new browser tab running Kedro-Viz for your project.

![databricks_viz_demo](../meta/images/databricks_viz_demo.png)
![databricks_viz_demo](../../meta/images/databricks_viz_demo.png)
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Develop a project with Databricks Workspace and Notebooks
# Databricks notebooks workflow

This tutorial uses the [PySpark Iris Kedro Starter](https://github.com/kedro-org/kedro-starters/tree/main/pyspark-iris) to illustrate how to bootstrap a Kedro project using Spark and deploy it to a [Databricks cluster on AWS](https://databricks.com/aws).

```{note}
If you are using [Databricks Repos](https://docs.databricks.com/repos/index.html) to run a Kedro project then you should [disable file-based logging](../logging/logging.md#disable-file-based-logging). This prevents Kedro from attempting to write to the read-only file system.
If you are using [Databricks Repos](https://docs.databricks.com/repos/index.html) to run a Kedro project then you should [disable file-based logging](../../logging/logging.md#disable-file-based-logging). This prevents Kedro from attempting to write to the read-only file system.
```

```{note}
If you are a Kedro contributor looking for information on deploying a custom build of Kedro to Databricks, see the [development guide](../contribution/development_for_databricks.md).
If you are a Kedro contributor looking for information on deploying a custom build of Kedro to Databricks, see the [development guide](../../contribution/development_for_databricks.md).
```

## Prerequisites
Expand Down Expand Up @@ -144,11 +144,11 @@ The project has now been pushed to your private GitHub repository, and in order
3. Press `Edit`
4. Go to the `Advanced Options` and then `Spark`

![](../meta/images/databricks_cluster_edit.png)
![](../../meta/images/databricks_cluster_edit.png)

Then in the `Environment Variables` section add your `GITHUB_USER` and `GITHUB_TOKEN` as shown on the picture:

![](../meta/images/databricks_cluster_env_vars.png)
![](../../meta/images/databricks_cluster_env_vars.png)


```{note}
Expand Down Expand Up @@ -227,16 +227,16 @@ You should get a similar output:

Your complete notebook should look similar to this (the results are hidden):

![](../meta/images/databricks_notebook_example.png)
![](../../meta/images/databricks_notebook_example.png)


### 9. Using the Kedro IPython Extension

You can interact with Kedro in Databricks through the Kedro [IPython extension](https://ipython.readthedocs.io/en/stable/config/extensions/index.html), `kedro.ipython`.

The Kedro IPython extension launches a [Kedro session](../kedro_project_setup/session.md) and makes available the useful Kedro variables `catalog`, `context`, `pipelines` and `session`. It also provides the `%reload_kedro` [line magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html) that reloads these variables (for example, if you need to update `catalog` following changes to your Data Catalog).
The Kedro IPython extension launches a [Kedro session](../../kedro_project_setup/session.md) and makes available the useful Kedro variables `catalog`, `context`, `pipelines` and `session`. It also provides the `%reload_kedro` [line magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html) that reloads these variables (for example, if you need to update `catalog` following changes to your Data Catalog).

The IPython extension can be used in a Databricks notebook in a similar way to how it is used in [Jupyter notebooks](../notebooks_and_ipython/kedro_and_notebooks.md).
The IPython extension can be used in a Databricks notebook in a similar way to how it is used in [Jupyter notebooks](../../notebooks_and_ipython/kedro_and_notebooks.md).

If you encounter a `ContextualVersionConflictError`, it is likely caused by Databricks using an old version of `pip`. Hence there's one additional step you need to do in the Databricks notebook to make use of the IPython extension. After you load the IPython extension using the below command:

Expand Down
11 changes: 11 additions & 0 deletions docs/source/deployment/databricks/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Databricks


```{toctree}
:maxdepth: 1

databricks_workspace.md
databricks_visualisation
databricks_development_workflow
databricks_deployment_workflow
```
4 changes: 2 additions & 2 deletions docs/source/deployment/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ This following pages provide information for deployment to, or integration with,
* [AWS Step functions](aws_step_functions.md)
* [Azure](azure.md)
* [Dask](dask.md)
* [Databricks](../integrations/databricks_workspace.md)
* [Databricks](./databricks/index.md)
* [Kubeflow Workflows](kubeflow.md)
* [Prefect](prefect.md)
* [Vertex AI](vertexai.md)
Expand All @@ -55,7 +55,7 @@ amazon_sagemaker
aws_step_functions
azure
dask
databricks
databricks/index
kubeflow
prefect
vertexai
Expand Down
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ Welcome to Kedro's documentation!
.. toctree::
:maxdepth: 2

integrations/index.md
integrations/pyspark_integration.md

.. toctree::
:maxdepth: 2
Expand Down
21 changes: 0 additions & 21 deletions docs/source/integrations/index.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/source/integrations/pyspark_integration.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Build a Kedro pipeline with PySpark
# PySpark integration

This page outlines some best practices when building a Kedro pipeline with [`PySpark`](https://spark.apache.org/docs/latest/api/python/index.html). It assumes a basic understanding of both Kedro and `PySpark`.

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.