-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-1095] Support Jobs Cluster for python models #444
Comments
@ChenyuLInx I think the way to square this circle would be by allowing users to configure def model(dbt, session):
dbt.config(cluster = "1234-abcd-wxyz") As far as I understand it, the names of interactive clusters are not secret, and there is no risk of checking these into version control. I'll ask the Databricks folks to make sure that's not a faulty assumption! Then I think the way to support this in the plugin would be with logic like:
Eventually, we probably do want to allow per-model configuration of One note here: I think this is different from actually supporting Jobs Clusters, which are technically different (and less $$) in Databricks than all-purpose interactive clusters. I imagine those would require a different submission method, with (I imagine) much higher latency — they are not, by definition, interactive. |
@jtcohen6 yes, that similar logic has been used across places for different configs in python model(looks like you copied a piece of timeout code LOL). I think I might have some extra cycle to play with Jobs cluster soon, I will take a look and report back |
you caught me... |
@ChenyuLInx That sounds right to me. And actually faster than I expected! For long-running models performing complex transformations, that upfront cost (time) might be worth it, in exchange for the lower cost ($$) of per-minute compute. Should we reclassify this issue as more of a spike, or would the code to implement it actually be relatively straightforward? |
@jtcohen6 It's pretty straightforward, just need to figure out what options do we want to provide our user with, As in the example PR, I just copied some example config from Databricks tutorial, I am assuming users would want to configure that in Since we now still run jobs using the notebook, we are still requiring a user for it. I feel like we should probably switch to upload to a dbfs file for job_cluster submission method so that users don't need to provide an extra Although this brings another things, maybe submission_method should be a combination? We got 2 types of clusters(interactive, job_cluster), 3 ways to upload file(notebook, dbfs, command), should we actually update the UX a little bit? |
Is this already available if we call something like the above in adapters? If so, that would be great!
I like the idea. Actually we had a request to provide a way to select endpoints for each model. databricks/dbt-databricks#59. |
@ueshin I think this is not available now, but def happy to change things to that format. This is how currently that cluster is provided for using jobs API. (We also did some refactor of submission code recently, I am going to open an issue in dbt-databricks about it shortly) What would do you expect user to provide in that cluster if they want to use a job cluster? Should user just write the whole dictionary for configurations there? |
@ChenyuLInx I'd think/hope this should just work for the code in models:
- name: my_python_model
config:
cluster: 1234-abcd-happy One note: It looks like |
@ChenyuLInx @lostmygithubaccount Bringing back the main topic in this thread:
We're imagining If this means shipping up the PySpark code to dbfs first, that feels fine & doable. Makes sense that the user may wish to supply other / additional configuration! Do we need a whole dict of optional key-value pairs? |
is the cluster already defined? could/would we allow specifying Spark details (executor/worker hardware specs) as part of the job submission?
on this, are we then managing some filesystem? would we have something like |
I believe jobs clusters are transient — created, used, and then spun down.
This is definitely a possibility. It's something we've talked about for Dataproc, too. Do something sensible by default. Otherwise, allow user to specify a bundle of key-value config pairs. dbt passes them through, without any validation.
Maybe? For the notebook method, we're using the
On subsequent runs within the same environment, dbt just updates (overwrites) that notebook. That feels closer to the experience of |
Describe the feature
We are using the
cluster
provided in credentials to submit python jobs to a given cluster. But current setup means the Python and SQL models can only be submitted to the same cluster. This means user can only use all purpose cluster if they want to run both Python and SQL jobs now.We want a way to specify a separate cluster for python models so that user can run python models on jobs cluster while keep the sql models running on SQL endpoint to reduce running cost.
Describe alternatives you've considered
Not really
Who will this benefit?
Users currently running SQL models on SQL endpoint
Are you interested in contributing this feature?
Yes
The text was updated successfully, but these errors were encountered: