Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a databricks-iris starter that enables packaged deployment on Databricks #129

Merged

Conversation

jmholzer
Copy link
Contributor

@jmholzer jmholzer commented May 23, 2023

Motivation and Context

The guide on deploying packaged projects to Databricks proposed in kedro-org/kedro#2595 uses the databricks-iris starter. This PR adds this starter. The databricks-iris starter is a duplicate of the pyspark-iris starter with a few changes.

  • databricks_run.py: a module for running the project on Databricks, as Click causes us to be unable to run projects with the default entry point on Databricks.
  • Project logs are written directly to DBFS (conf/base/logging.yml).
  • All datasets in conf/base/catalog.yml are saved in /dbfs/FileStore.

This PR has a large diff because it is a brand new starter, only the following files have been changed from pyspark-iris:

  • {{ cookiecutter.repo_name }}/src/setup.py: contains an entry point definition databricks_run.
  • {{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/databricks_run.py: contains a script needed to run a packaged Kedro project on Databricks.
  • {{ cookiecutter.repo_name }}/src/conf/base/logging.yml: config for writing logs to DBFS.
  • {{ cookiecutter.repo_name }}/src/conf/base/catalog.yml: points to datasets on DBFS.

How has this been tested?

Manually on Databricks in conjunction with the new guide.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Assigned myself to the PR
  • Added tests to cover my changes

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
@jmholzer jmholzer marked this pull request as ready for review May 23, 2023 21:03
@jmholzer jmholzer added the enhancement New feature or request label May 23, 2023
@jmholzer jmholzer self-assigned this May 23, 2023
@jmholzer jmholzer changed the title Modify the PySpark Iris starter to enable packaged deployment on Databricks Create a databricks-iris starter that enables packaged deployment on Databricks May 25, 2023
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, I haven't done any manual testing.

jmholzer and others added 3 commits May 26, 2023 15:21
….yml

Co-authored-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
…' of github.com:kedro-org/kedro-starters into feat/modify-pyspark-iris-databricks-packaged-deployment

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
@jmholzer jmholzer requested a review from noklam May 26, 2023 14:57
@astrojuanlu
Copy link
Member

To test this:

kedro new --starter git+https://github.com/kedro-org/kedro-starters.git --directory databricks-iris --checkout feat/modify-pyspark-iris-databricks-packaged-deployment

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, see more at kedro-org/kedro#2595

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
@jmholzer
Copy link
Contributor Author

To test this:

kedro new --starter git+https://github.com/kedro-org/kedro-starters.git --directory databricks-iris --checkout feat/modify-pyspark-iris-databricks-packaged-deployment

Thanks for figuring this out @astrojuanlu!

@jmholzer jmholzer requested a review from astrojuanlu May 31, 2023 13:28
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! LGTM

@jmholzer jmholzer merged commit c6a1f30 into main Jun 1, 2023
@jmholzer jmholzer deleted the feat/modify-pyspark-iris-databricks-packaged-deployment branch June 1, 2023 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants