Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1095] Support Jobs Cluster for python models #444

Closed
ChenyuLInx opened this issue Aug 30, 2022 · 13 comments · Fixed by #467
Closed

[CT-1095] Support Jobs Cluster for python models #444

ChenyuLInx opened this issue Aug 30, 2022 · 13 comments · Fixed by #467
Labels
enhancement New feature or request python_models issues related to python model

Comments

@ChenyuLInx
Copy link
Contributor

Describe the feature

We are using the cluster provided in credentials to submit python jobs to a given cluster. But current setup means the Python and SQL models can only be submitted to the same cluster. This means user can only use all purpose cluster if they want to run both Python and SQL jobs now.

We want a way to specify a separate cluster for python models so that user can run python models on jobs cluster while keep the sql models running on SQL endpoint to reduce running cost.

Describe alternatives you've considered

Not really

Who will this benefit?

Users currently running SQL models on SQL endpoint

Are you interested in contributing this feature?

Yes

@ChenyuLInx ChenyuLInx added enhancement New feature or request triage refinement Product or leadership input needed python_models issues related to python model and removed triage labels Aug 30, 2022
@github-actions github-actions bot changed the title Support Jobs Cluster for python models [CT-1095] Support Jobs Cluster for python models Aug 30, 2022
@ChenyuLInx
Copy link
Contributor Author

ChenyuLInx commented Aug 30, 2022

@jtcohen6 @Fleid This should be a relative easy lift after you all decide how should this separate cluster ID being specified by the user.

EDIT: but if we are moving towards having multiple profiles per project it will be a larger lift.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 14, 2022

@ChenyuLInx I think the way to square this circle would be by allowing users to configure endpoint in profiles.yml (still used for SQL models), and then adding a cluster config for a specific Python model. Just in the same way we're adding support for dataproc_cluster_name as a model-level config in dbt-bigquery.

def model(dbt, session):
    dbt.config(cluster = "1234-abcd-wxyz")

As far as I understand it, the names of interactive clusters are not secret, and there is no risk of checking these into version control. I'll ask the Databricks folks to make sure that's not a faulty assumption!

Then I think the way to support this in the plugin would be with logic like:

self.cluster = self.parsed_model["config"].get("cluster", self.credentials.cluster)

Eventually, we probably do want to allow per-model configuration of endpoint and cluster for SQL models, too. I think that would require a much bigger lift in our codebase — there isn't a simple use cluster / use warehouse mechanism, as in other adapters, and there's more distance between our standard connection logic and node configurations — but it feels technically possible.

One note here: I think this is different from actually supporting Jobs Clusters, which are technically different (and less $$) in Databricks than all-purpose interactive clusters. I imagine those would require a different submission method, with (I imagine) much higher latency — they are not, by definition, interactive.

(cc @Fleid @lostmygithubaccount)

@ChenyuLInx
Copy link
Contributor Author

@jtcohen6 yes, that similar logic has been used across places for different configs in python model(looks like you copied a piece of timeout code LOL).

I think I might have some extra cycle to play with Jobs cluster soon, I will take a look and report back

@jtcohen6
Copy link
Contributor

looks like you copied a piece of timeout code LOL

you caught me...

@ChenyuLInx
Copy link
Contributor Author

@jtcohen6 turns out my understanding of jobs cluster was wrong(I think???), looked up the docs(link1, link2) and it is just you create a new cluster to run a job each time, did some local test and running one job took 290s ish. Actual job 47s, rest was cluster provision stuff

@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 16, 2022

@ChenyuLInx That sounds right to me. And actually faster than I expected! For long-running models performing complex transformations, that upfront cost (time) might be worth it, in exchange for the lower cost ($$) of per-minute compute.

Should we reclassify this issue as more of a spike, or would the code to implement it actually be relatively straightforward?

@jtcohen6 jtcohen6 removed the refinement Product or leadership input needed label Sep 16, 2022
@ChenyuLInx
Copy link
Contributor Author

ChenyuLInx commented Sep 16, 2022

@jtcohen6 It's pretty straightforward, just need to figure out what options do we want to provide our user with, As in the example PR, I just copied some example config from Databricks tutorial, I am assuming users would want to configure that in project.yml, then define a submission method(job_cluster) to use it?

Since we now still run jobs using the notebook, we are still requiring a user for it. I feel like we should probably switch to upload to a dbfs file for job_cluster submission method so that users don't need to provide an extra user config.

Although this brings another things, maybe submission_method should be a combination? We got 2 types of clusters(interactive, job_cluster), 3 ways to upload file(notebook, dbfs, command), should we actually update the UX a little bit?

@ueshin
Copy link
Contributor

ueshin commented Sep 16, 2022

@jtcohen6

self.cluster = self.parsed_model["config"].get("cluster", self.credentials.cluster)

Is this already available if we call something like the above in adapters? If so, that would be great!

Eventually, we probably do want to allow per-model configuration of endpoint and cluster for SQL models, too.

I like the idea.

Actually we had a request to provide a way to select endpoints for each model. databricks/dbt-databricks#59.

@ChenyuLInx
Copy link
Contributor Author

self.cluster = self.parsed_model["config"].get("cluster", self.credentials.cluster)
# Is this already available if we call something like the above in adapters?

@ueshin I think this is not available now, but def happy to change things to that format. This is how currently that cluster is provided for using jobs API. (We also did some refactor of submission code recently, I am going to open an issue in dbt-databricks about it shortly)

What would do you expect user to provide in that cluster if they want to use a job cluster? Should user just write the whole dictionary for configurations there?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 21, 2022

@jtcohen6

self.cluster = self.parsed_model["config"].get("cluster", self.credentials.cluster)

Is this already available if we call something like the above in adapters? If so, that would be great!

@ueshin I think this is not available now, but def happy to change things to that format.

@ChenyuLInx I'd think/hope this should just work for the code in dbt-databricks, if the user has specified cluster in their config:

models:
  - name: my_python_model
    config:
      cluster: 1234-abcd-happy

One note: It looks like dbt-databricks calls this cluster_id, instead of cluster. Could we shoot for consistency here? Perhaps cluster + cluster_id as supported aliases in both plugins?

@jtcohen6
Copy link
Contributor

@ChenyuLInx @lostmygithubaccount Bringing back the main topic in this thread:

Since we now still run jobs using the notebook, we are still requiring a user for it. I feel like we should probably switch to upload to a dbfs file for job_cluster submission method so that users don't need to provide an extra user config.

Although this brings another things, maybe submission_method should be a combination? We got 2 types of clusters(interactive, job_cluster), 3 ways to upload file(notebook, dbfs, command), should we actually update the UX a little bit?

We're imagining submission_method: job_cluster, configurable per-model, whereby dbt would ship up the Python code to run on a jobs cluster instead of an interactive cluster.

If this means shipping up the PySpark code to dbfs first, that feels fine & doable.

Makes sense that the user may wish to supply other / additional configuration! Do we need a whole dict of optional key-value pairs?

@lostmygithubaccount
Copy link

Makes sense that the user may wish to supply other / additional configuration! Do we need a whole dict of optional key-value pairs?

is the cluster already defined? could/would we allow specifying Spark details (executor/worker hardware specs) as part of the job submission?

If this means shipping up the PySpark code to dbfs first, that feels fine & doable.

on this, are we then managing some filesystem? would we have something like dbfs://dbt/runs/<invocation_id>/<model_file>?

@jtcohen6
Copy link
Contributor

is the cluster already defined?

I believe jobs clusters are transient — created, used, and then spun down.

could/would we allow specifying Spark details (executor/worker hardware specs) as part of the job submission?

This is definitely a possibility. It's something we've talked about for Dataproc, too. Do something sensible by default. Otherwise, allow user to specify a bundle of key-value config pairs. dbt passes them through, without any validation.

on this, are we then managing some filesystem? would we have something like dbfs://dbt/runs/<invocation_id>/<model_file>?

Maybe? For the notebook method, we're using the <schema> to disambiguate (plus user value, since notebooks are in user-level workspaces):

work_dir = f"/Users/{self.credentials.user}/{self.schema}/"

On subsequent runs within the same environment, dbt just updates (overwrites) that notebook.

That feels closer to the experience of dbt run overwriting a table in your schema (environment). We could add in a totally unique identifier like <invocation_id>. Although git + artifacts/logs have been the traditionally sufficient answer for preserving older versions of dbt code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python_models issues related to python model
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants