Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run dbt python models as full-fledged jobs (not one-time job submission) #756

Open
kdazzle opened this issue Aug 6, 2024 · 2 comments
Open
Labels
enhancement New feature or request

Comments

@kdazzle
Copy link
Contributor

kdazzle commented Aug 6, 2024

Describe the feature

I'd like to leverage Databricks' job running functionality but still use dbt to manage the orchestration and development workflow. So, from the Databricks side, instead of running my python jobs using the one-time submission API, I would want a full-fledged Workflow Job to be created for certain dbt python models. This would be a new submission_method. I think it will fill in some gaps in the dbt-databricks integration and make it easier for heavy Databricks users to migrate pipelines into dbt/dbt cloud.

One of the reasons is that Dbt Cloud's job management features are limited compared to Databricks. Ex: natively, dbt doesn't offer retries out of the box, timeout warnings, runs timeout after a max of 24 hours, their notifications need a lot of work, the Databricks run history is great, etc. All things that Databricks has nailed down pretty well over the years.

Another reason is that, even if dbt had feature parity, there are a bunch of people who are using these features from the Databricks side, and who are happy with it. I'd like to leverage dbt for the dev lifecycle/orchestration improvements, and continue to leverage Databricks for other things. Allowing models to materialize as Workflows accomplishes both of those goals. This makes it easier for people to move their workloads into dbt.

Implementation

See #762

TLDR; Creates a Databricks Workflow for each model called my-database-my-schema_my-model__dbt. It gets triggered to run in the usual dbt manner (dbt run --select=my_model).

Also allows post_hook_tasks to be added to the workflow.

Describe alternatives you've considered

The current setup, via the dbt-databricks all-purpose_cluster/job_cluster methods. But these are somewhat limiting because:

  • One-time runs (docs) are difficult to find in the UI, which makes debugging more difficult
  • Backfills + initial runs - due to a maximum dbt job runtime of 24 hours, it is difficult to run jobs that may take several days to complete
  • Using a workflow prevents the same model from running twice at the same time (where the same model is running in two different pipelines - ex: in a modified+ and in a scheduled job)
  • We use Databricks' built-in job run history quite a bit to see how well a job performs over time.
  • People often want to run their job outside of a schedule for a variety of reasons (some good, some less so). Being able to do this in Databricks by hitting "play" is much easier than going into dbt to create an ad hoc job each time this comes up.
  • Alerting/retries - built-in email notification, retry, and tagging support for workflows
    • (API docs + UI suggest these work generally for one-time job runs in Databricks. BUT in dbt-databricks, the way they are implemented prevents us from using those features since you can only really modify the cluster settings (unless I'm missing something)
      • we use tagging to notify job owners that their job has failed

I've also considered running dbt cloud jobs via Workflows (docs). However, I'd still like to run all of my jobs through dbt Cloud/dbt core. This seems like it would make my setup more complicated: do I create a separate dbt cloud job for every workflow that I want to run? How would this work in development/CI? What if I want to run several models at once - do I run them each separately in Databricks?

Additional context

Please include any other relevant context here.

Who will this benefit?

Anyone using python models in dbt-databricks

Are you interested in contributing this feature?

Let us know if you want to write some code, and how we can help.

Yes - I'd be happy to help. I've got some time I could set aside for this in late August/september

@kdazzle kdazzle added the enhancement New feature or request label Aug 6, 2024
@benc-db
Copy link
Collaborator

benc-db commented Aug 6, 2024

What is the advantage of doing it this way, over using Databricks Workflow with dbt task type?

@kdazzle
Copy link
Contributor Author

kdazzle commented Aug 6, 2024

Hey @benc-db good point, I forgot about the dbt task type. What I'm proposing is to be able to integrate a Databricks Workflow job into the rest of the DAG, even if other models are being run via dbt Cloud. In which case, if a model were created with dbt task type, that would be great, too (assuming you can use more than just a SQL Warehouse).

My impression is that the dbt task type isn't meant for an individual model - it's more for the whole workflow. But I haven't used that task type, tbh.

Having to use Databricks Workflows for all dbt models wouldn't be ideal, as we wouldn't be able to hook into features from dbt cloud, among other reasons

@kdazzle kdazzle changed the title Run dbt models as full-fledged jobs (not one-time job submission) Run dbt python models as full-fledged jobs (not one-time job submission) Aug 7, 2024
kdazzle pushed a commit to kdazzle/dbt-databricks that referenced this issue Aug 8, 2024
benc-db added a commit that referenced this issue Oct 10, 2024
Signed-off-by: Kyle Valade <kylevalade@rivian.com>
Co-authored-by: Kyle Valade <kylevalade@rivian.com>
Co-authored-by: Ben Cassell <ben.cassell@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants