[CT-1095] Support Jobs Cluster for python models #444

ChenyuLInx · 2022-08-30T15:06:23Z

Describe the feature

We are using the cluster provided in credentials to submit python jobs to a given cluster. But current setup means the Python and SQL models can only be submitted to the same cluster. This means user can only use all purpose cluster if they want to run both Python and SQL jobs now.

We want a way to specify a separate cluster for python models so that user can run python models on jobs cluster while keep the sql models running on SQL endpoint to reduce running cost.

Describe alternatives you've considered

Not really

Who will this benefit?

Users currently running SQL models on SQL endpoint

Are you interested in contributing this feature?

Yes

The text was updated successfully, but these errors were encountered:

ChenyuLInx · 2022-08-30T15:07:52Z

@jtcohen6 @Fleid This should be a relative easy lift after you all decide how should this separate cluster ID being specified by the user.

EDIT: but if we are moving towards having multiple profiles per project it will be a larger lift.

jtcohen6 · 2022-09-14T11:25:23Z

@ChenyuLInx I think the way to square this circle would be by allowing users to configure endpoint in profiles.yml (still used for SQL models), and then adding a cluster config for a specific Python model. Just in the same way we're adding support for dataproc_cluster_name as a model-level config in dbt-bigquery.

def model(dbt, session):
    dbt.config(cluster = "1234-abcd-wxyz")

As far as I understand it, the names of interactive clusters are not secret, and there is no risk of checking these into version control. I'll ask the Databricks folks to make sure that's not a faulty assumption!

Then I think the way to support this in the plugin would be with logic like:

self.cluster = self.parsed_model["config"].get("cluster", self.credentials.cluster)

Eventually, we probably do want to allow per-model configuration of endpoint and cluster for SQL models, too. I think that would require a much bigger lift in our codebase — there isn't a simple use cluster / use warehouse mechanism, as in other adapters, and there's more distance between our standard connection logic and node configurations — but it feels technically possible.

One note here: I think this is different from actually supporting Jobs Clusters, which are technically different (and less $$) in Databricks than all-purpose interactive clusters. I imagine those would require a different submission method, with (I imagine) much higher latency — they are not, by definition, interactive.

(cc @Fleid @lostmygithubaccount)

ChenyuLInx · 2022-09-14T22:36:21Z

@jtcohen6 yes, that similar logic has been used across places for different configs in python model(looks like you copied a piece of timeout code LOL).

I think I might have some extra cycle to play with Jobs cluster soon, I will take a look and report back

jtcohen6 · 2022-09-15T04:26:45Z

looks like you copied a piece of timeout code LOL

you caught me...

ChenyuLInx · 2022-09-16T00:44:17Z

@jtcohen6 turns out my understanding of jobs cluster was wrong(I think???), looked up the docs(link1, link2) and it is just you create a new cluster to run a job each time, did some local test and running one job took 290s ish. Actual job 47s, rest was cluster provision stuff

jtcohen6 · 2022-09-16T12:07:33Z

@ChenyuLInx That sounds right to me. And actually faster than I expected! For long-running models performing complex transformations, that upfront cost (time) might be worth it, in exchange for the lower cost ($$) of per-minute compute.

Should we reclassify this issue as more of a spike, or would the code to implement it actually be relatively straightforward?

ChenyuLInx · 2022-09-16T16:51:08Z

@jtcohen6 It's pretty straightforward, just need to figure out what options do we want to provide our user with, As in the example PR, I just copied some example config from Databricks tutorial, I am assuming users would want to configure that in project.yml, then define a submission method(job_cluster) to use it?

Since we now still run jobs using the notebook, we are still requiring a user for it. I feel like we should probably switch to upload to a dbfs file for job_cluster submission method so that users don't need to provide an extra user config.

Although this brings another things, maybe submission_method should be a combination? We got 2 types of clusters(interactive, job_cluster), 3 ways to upload file(notebook, dbfs, command), should we actually update the UX a little bit?

ueshin · 2022-09-16T17:40:43Z

@jtcohen6

self.cluster = self.parsed_model["config"].get("cluster", self.credentials.cluster)

Is this already available if we call something like the above in adapters? If so, that would be great!

Eventually, we probably do want to allow per-model configuration of endpoint and cluster for SQL models, too.

I like the idea.

Actually we had a request to provide a way to select endpoints for each model. databricks/dbt-databricks#59.

ChenyuLInx · 2022-09-20T19:15:43Z

self.cluster = self.parsed_model["config"].get("cluster", self.credentials.cluster)
# Is this already available if we call something like the above in adapters?

@ueshin I think this is not available now, but def happy to change things to that format. This is how currently that cluster is provided for using jobs API. (We also did some refactor of submission code recently, I am going to open an issue in dbt-databricks about it shortly)

What would do you expect user to provide in that cluster if they want to use a job cluster? Should user just write the whole dictionary for configurations there?

jtcohen6 · 2022-09-21T07:59:30Z

@jtcohen6

self.cluster = self.parsed_model["config"].get("cluster", self.credentials.cluster)

Is this already available if we call something like the above in adapters? If so, that would be great!

@ueshin I think this is not available now, but def happy to change things to that format.

@ChenyuLInx I'd think/hope this should just work for the code in dbt-databricks, if the user has specified cluster in their config:

models:
  - name: my_python_model
    config:
      cluster: 1234-abcd-happy

One note: It looks like dbt-databricks calls this cluster_id, instead of cluster. Could we shoot for consistency here? Perhaps cluster + cluster_id as supported aliases in both plugins?

jtcohen6 · 2022-09-21T18:45:49Z

@ChenyuLInx @lostmygithubaccount Bringing back the main topic in this thread:

Since we now still run jobs using the notebook, we are still requiring a user for it. I feel like we should probably switch to upload to a dbfs file for job_cluster submission method so that users don't need to provide an extra user config.

Although this brings another things, maybe submission_method should be a combination? We got 2 types of clusters(interactive, job_cluster), 3 ways to upload file(notebook, dbfs, command), should we actually update the UX a little bit?

We're imagining submission_method: job_cluster, configurable per-model, whereby dbt would ship up the Python code to run on a jobs cluster instead of an interactive cluster.

If this means shipping up the PySpark code to dbfs first, that feels fine & doable.

Makes sense that the user may wish to supply other / additional configuration! Do we need a whole dict of optional key-value pairs?

lostmygithubaccount · 2022-09-22T01:26:29Z

Makes sense that the user may wish to supply other / additional configuration! Do we need a whole dict of optional key-value pairs?

is the cluster already defined? could/would we allow specifying Spark details (executor/worker hardware specs) as part of the job submission?

If this means shipping up the PySpark code to dbfs first, that feels fine & doable.

on this, are we then managing some filesystem? would we have something like dbfs://dbt/runs/<invocation_id>/<model_file>?

jtcohen6 · 2022-09-22T08:24:58Z

is the cluster already defined?

I believe jobs clusters are transient — created, used, and then spun down.

could/would we allow specifying Spark details (executor/worker hardware specs) as part of the job submission?

This is definitely a possibility. It's something we've talked about for Dataproc, too. Do something sensible by default. Otherwise, allow user to specify a bundle of key-value config pairs. dbt passes them through, without any validation.

on this, are we then managing some filesystem? would we have something like dbfs://dbt/runs/<invocation_id>/<model_file>?

Maybe? For the notebook method, we're using the <schema> to disambiguate (plus user value, since notebooks are in user-level workspaces):

dbt-spark/dbt/adapters/spark/python_submissions.py

Line 133 in e07b8a2

work_dir = f"/Users/{self.credentials.user}/{self.schema}/"

On subsequent runs within the same environment, dbt just updates (overwrites) that notebook.

That feels closer to the experience of dbt run overwriting a table in your schema (environment). We could add in a totally unique identifier like <invocation_id>. Although git + artifacts/logs have been the traditionally sufficient answer for preserving older versions of dbt code.

ChenyuLInx added enhancement New feature or request triage refinement Product or leadership input needed python_models issues related to python model and removed triage labels Aug 30, 2022

github-actions bot changed the title ~~Support Jobs Cluster for python models~~ [CT-1095] Support Jobs Cluster for python models Aug 30, 2022

ChenyuLInx mentioned this issue Sep 16, 2022

More flexible cluster configuration #467

Merged

6 tasks

jtcohen6 removed the refinement Product or leadership input needed label Sep 16, 2022

ueshin mentioned this issue Sep 16, 2022

Can't get Python models running on 1.3.0b0 databricks/dbt-databricks#175

Closed

ChenyuLInx closed this as completed in #467 Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1095] Support Jobs Cluster for python models #444

[CT-1095] Support Jobs Cluster for python models #444

ChenyuLInx commented Aug 30, 2022

ChenyuLInx commented Aug 30, 2022 •

edited

Loading

jtcohen6 commented Sep 14, 2022 •

edited

Loading

ChenyuLInx commented Sep 14, 2022

jtcohen6 commented Sep 15, 2022

ChenyuLInx commented Sep 16, 2022

jtcohen6 commented Sep 16, 2022 •

edited

Loading

ChenyuLInx commented Sep 16, 2022 •

edited

Loading

ueshin commented Sep 16, 2022 •

edited

Loading

ChenyuLInx commented Sep 20, 2022

jtcohen6 commented Sep 21, 2022 •

edited

Loading

jtcohen6 commented Sep 21, 2022

lostmygithubaccount commented Sep 22, 2022

jtcohen6 commented Sep 22, 2022

[CT-1095] Support Jobs Cluster for python models #444

[CT-1095] Support Jobs Cluster for python models #444

Comments

ChenyuLInx commented Aug 30, 2022

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

ChenyuLInx commented Aug 30, 2022 • edited Loading

jtcohen6 commented Sep 14, 2022 • edited Loading

ChenyuLInx commented Sep 14, 2022

jtcohen6 commented Sep 15, 2022

ChenyuLInx commented Sep 16, 2022

jtcohen6 commented Sep 16, 2022 • edited Loading

ChenyuLInx commented Sep 16, 2022 • edited Loading

ueshin commented Sep 16, 2022 • edited Loading

ChenyuLInx commented Sep 20, 2022

jtcohen6 commented Sep 21, 2022 • edited Loading

jtcohen6 commented Sep 21, 2022

lostmygithubaccount commented Sep 22, 2022

jtcohen6 commented Sep 22, 2022

ChenyuLInx commented Aug 30, 2022 •

edited

Loading

jtcohen6 commented Sep 14, 2022 •

edited

Loading

jtcohen6 commented Sep 16, 2022 •

edited

Loading

ChenyuLInx commented Sep 16, 2022 •

edited

Loading

ueshin commented Sep 16, 2022 •

edited

Loading

jtcohen6 commented Sep 21, 2022 •

edited

Loading