-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run dbt python models as full-fledged jobs (not one-time job submission) #756
Comments
What is the advantage of doing it this way, over using Databricks Workflow with dbt task type? |
Hey @benc-db good point, I forgot about the dbt task type. What I'm proposing is to be able to integrate a Databricks Workflow job into the rest of the DAG, even if other models are being run via dbt Cloud. In which case, if a model were created with dbt task type, that would be great, too (assuming you can use more than just a SQL Warehouse). My impression is that the dbt task type isn't meant for an individual model - it's more for the whole workflow. But I haven't used that task type, tbh. Having to use Databricks Workflows for all dbt models wouldn't be ideal, as we wouldn't be able to hook into features from dbt cloud, among other reasons |
Describe the feature
I'd like to leverage Databricks' job running functionality but still use dbt to manage the orchestration and development workflow. So, from the Databricks side, instead of running my python jobs using the one-time submission API, I would want a full-fledged Workflow Job to be created for certain dbt python models. This would be a new submission_method. I think it will fill in some gaps in the dbt-databricks integration and make it easier for heavy Databricks users to migrate pipelines into dbt/dbt cloud.
One of the reasons is that Dbt Cloud's job management features are limited compared to Databricks. Ex: natively, dbt doesn't offer retries out of the box, timeout warnings, runs timeout after a max of 24 hours, their notifications need a lot of work, the Databricks run history is great, etc. All things that Databricks has nailed down pretty well over the years.
Another reason is that, even if dbt had feature parity, there are a bunch of people who are using these features from the Databricks side, and who are happy with it. I'd like to leverage dbt for the dev lifecycle/orchestration improvements, and continue to leverage Databricks for other things. Allowing models to materialize as Workflows accomplishes both of those goals. This makes it easier for people to move their workloads into dbt.
Implementation
See #762
TLDR; Creates a Databricks Workflow for each model called
my-database-my-schema_my-model__dbt
. It gets triggered to run in the usual dbt manner (dbt run --select=my_model
).Also allows
post_hook_tasks
to be added to the workflow.Describe alternatives you've considered
The current setup, via the dbt-databricks all-purpose_cluster/job_cluster methods. But these are somewhat limiting because:
I've also considered running dbt cloud jobs via Workflows (docs). However, I'd still like to run all of my jobs through dbt Cloud/dbt core. This seems like it would make my setup more complicated: do I create a separate dbt cloud job for every workflow that I want to run? How would this work in development/CI? What if I want to run several models at once - do I run them each separately in Databricks?
Additional context
Please include any other relevant context here.
Who will this benefit?
Anyone using python models in dbt-databricks
Are you interested in contributing this feature?
Let us know if you want to write some code, and how we can help.
Yes - I'd be happy to help. I've got some time I could set aside for this in late August/september
The text was updated successfully, but these errors were encountered: