Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add databricks asset bundles docs #4265

Merged
merged 27 commits into from
Nov 26, 2024

Conversation

noklam
Copy link
Contributor

@noklam noklam commented Oct 28, 2024

Description

Partially #3360

Development notes

  • Move the old dbx docs to a separate page
  • take over the go-to ide integration page with Databricks Asset bundle using kedro-databricks

Build: https://kedro--4265.org.readthedocs.build/en/4265/deployment/databricks/index.html

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
@@ -0,0 +1,272 @@
# Use an IDE, dbx and Databricks Repos to develop a Kedro project
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not new docs so no need to review, I move the original databricks_ide_developmenet.md page to a new file and use that file for DAB instead.

Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
@noklam noklam force-pushed the noklam/databricks-asset-bundles-docs branch from 8eec4ba to 2733196 Compare October 29, 2024 15:07
@noklam noklam marked this pull request as ready for review October 31, 2024 01:02
@noklam noklam requested review from DimedS and removed request for yetudada October 31, 2024 01:03
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also update the index page https://docs.kedro.org/en/stable/deployment/databricks/index.html and mention asset bundles?

@noklam noklam linked an issue Nov 12, 2024 that may be closed by this pull request
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
@astrojuanlu
Copy link
Member

Also could be useful to clarify where the user can configure the cluster. For example, I have an existing cluster, but by default databricks bundle run will try to create a new one and fail with Unexpected user error while preparing the cluster for the job. Cause: PERMISSION_DENIED: You are not authorized to create clusters. Please contact your administrator.

@noklam
Copy link
Contributor Author

noklam commented Nov 12, 2024

@astrojuanlu I believe you are trying to run this with our internal Azure Databricks? From my understanding a job cluster is always created fresh.

Job Cluster has been designed to be unique for each run of a job. So, each run of your job would run against a new job cluster.

But it seems like it is also possible to run job on all-purpose cluster,, though it is not recommended. https://docs.databricks.com/en/jobs/compute.html#all-purpose.

Is the tricks here using the existing_cluster_id?
https://docs.databricks.com/en/dev-tools/bundles/settings.html#examples, pinging @JenspederM to see if you know something about it.

https://community.databricks.com/t5/data-engineering/databricks-job-scheduling-continuous-mode/td-p/38861#:~:text=Job%20Cluster%20has%20been%20designed,use%20a%20dedicated%20interactive%20cluster.

@astrojuanlu
Copy link
Member

I remember having this problem the first time I tested the plugin and I forgot where did I write down the solution 😂

VS Code is showing me an error because the conf/base/databricks.yml is not actually a bundle definition file (it's a "bundle override configuration")

image

In fact I think I found a bug: after I tweak the config override, kedro databricks bundle will not update the resources/ files @JenspederM

@astrojuanlu
Copy link
Member

astrojuanlu commented Nov 13, 2024

Currently the best I could do is to specify the existing_cluster_id in the task overrides:

default:
    job_clusters:
        - job_cluster_key: default
          new_cluster:
              spark_version: ...
    tasks:
        - task_key: default
          existing_cluster_id: 1111-...  # <----

This seems to work.

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a couple more dbx references

Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
@JenspederM
Copy link

@astrojuanlu @noklam

Sorry for the late response. I see that you have resolved some of the issues by your self. You are correct that you should use existing_cluster_id if you do not have permissions to create job clusters.

Generally, the api for bundle resources follows the schema for creating a new job. See here.

@noklam
Copy link
Contributor Author

noklam commented Nov 15, 2024

I have add an existing cluster id section and test it on databricks and added a few more screenshots for the UI flow.

@noklam noklam requested review from merelcht and astrojuanlu and removed request for DimedS, astrojuanlu and merelcht November 19, 2024 06:31
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, but otherwise this looks great! 👍

Don't forget to add a note in the release notes.

noklam and others added 2 commits November 21, 2024 22:01
…orkflow.md

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
…orkflow.md

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
@noklam noklam enabled auto-merge (squash) November 21, 2024 14:05
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with a couple of minor comments, please also address the 2 outstanding review commits from @merelcht cc @noklam

Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
auto-merge was automatically disabled November 26, 2024 09:18

Pull Request is not mergeable

@noklam noklam mentioned this pull request Nov 26, 2024
9 tasks
@astrojuanlu
Copy link
Member

In the interest of getting this merged, given that it has 2 approvals already and that @noklam had addressed all the comments, I'm hitting the button.

@astrojuanlu astrojuanlu merged commit cd4a7b8 into main Nov 26, 2024
41 checks passed
@astrojuanlu astrojuanlu deleted the noklam/databricks-asset-bundles-docs branch November 26, 2024 15:46
@astrojuanlu
Copy link
Member

Pending redirect

Redirect from
/$lang/$version/deployment/databricks/databricks_ide_development_workflow.html
Redirect to
/$lang/$version/deployment/databricks/databricks_dbx_workflow.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Databricks docs
4 participants