[ADAP-658] [Feature] Spark Connect as connection method #814

timvw · 2023-06-24T18:34:17Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt-spark functionality, rather than a Big Idea better suited to a discussion

Describe the feature

I would like to be able to use dbt (spark) via the Spark Connect api

Describe alternatives you've considered

We could decide not to support this

Who will this benefit?

All users that have a Spark Connect endpoint available

Are you interested in contributing this feature?

Yes

Anything else?

https://spark.apache.org/docs/latest/spark-connect-overview.html

dataders · 2023-07-04T15:22:58Z

@timvw I agree this could unlock quite a bit for us over time. 👁️ @Fokko do you know much about this new feature?

Fokko · 2023-07-07T09:26:56Z

@dataders Thanks for pinging me. I worked with Databricks' Spark connect quite a bit, and it is great to see that it is now part of Spark Open Source. I think it makes a lot of sense to add this.

ssabdb · 2023-07-07T17:53:48Z

@Fokko - I would be interested in your take on my interpretation of spark-connect's suitability in #821?

I have no experience with spark connect, but if the objective is to support the execution of SQL from DBT I can see how this would work.

I'm not sure it would support python models as presently implemented but this is perhaps not the intent of this issue.

timvw · 2023-08-28T15:10:13Z

Closed (as the possibility to connect to Live seems more favorable for now)

vakarisbk · 2023-10-04T18:06:32Z

Hi! I would like to reopen this discussion as I have made a PR #899 introducing support for Spark Connect SQL models (well probably should have done this before the PR, but water under the bridge now :) ).

I believe it makes sense to introduce support for Spark Connect SQL models because it unlocks an additional way of using DBT with open source Spark without much code changes from DBT side (the implementation is based on the existing Spark Session code). Currently the only way to run DBT with open source Spark in production is using a Thrift connection, so adding at least another alternative would open up dbt to more users.

Livy as an alternative was also discussed in #821 issue. Livy would work well for SQL models, but the Livy open source project is pretty much dead. Though some cloud providers (AWS EMR, Gcloud DataProc, maybe some others) still expose Livy compatible APIs, so users using those cloud providers would benefit from dbt livy support. There is also a fairly new open source project called Lighter, which aims to replace Livy and has a Livy compatible API.

But I don't think the question should be Spark Connect OR Livy. I think we can support both, especially since supporting Spark Connect would probably not require a lot of additional effort, since the implementation is highly tied to Spark Session, which dbt already supports.

I would like to hear what dbt and the community think about introducing Spark connect SQL models and whether it's worth supporting this feature.

vakarisbk · 2023-10-04T18:13:11Z

Regarding Python models:

Livy would be much better suited for dbt Python models as it would stick to dbt philosophy of generating the code locally and then shipping it somewhere else to execute. And it would support running arbitrary Python code remotely, not just a subset of APIs that are supported by Spark Connect, but again the open source Livy project is pretty much dead.

Spark Connect on the other hand is a fairly good alternative. It is limited in that it only supports Dataframe API and in the latest Spark release - Pandas on Pyspark API and PyTorch, but maybe that's enough for most use cases? And there are always UDFs which are also executed remotely AFAIK.
It also allows easier local development as spinning up a local Spark connect cluster is very easy.

I think it would make sense to split the discussion on Python models on Spark Connect into a separate issues if anyone wants to continue discussing it.

ben-schreiber · 2024-02-06T14:07:25Z

@vakarisbk I agree 100%. I would also add two points:

Since the SparkSession used for executing SQL with Spark Connect is exactly the one we would use to execute Python, the additional work needed to add support for DBT Python models on Spark Connect as well is low hanging.
Based on what I've seen (and you mentioned), the Livy project is an older technology which are dying and Thrift supports only SQL. Additionally, Spark Connect seems to the incoming generation of technology for remotely connecting to a Spark application.

ssabdb · 2024-02-09T00:30:20Z

I proposed #821 and agree with the recommendation to split them into two separate sets of requirements, one for spark connect as a method to support SQL and one for a means (spark connect or whatever) to implement python dbt models in OSS spark.

This ticket focusses on ising spark connect as an alternative to the thriftserver method, which only supports SQL would still bring advantages

I've not tried it but might if I get around to it, but it may well be possible to do this without any changes at all just by setting

export SPARK_REMOTE="sc://localhost" source

However, that would bring SQL support only but would improve the current basic spark session implementation.

@ben-schreiber to be clear, I think there would be a limitation of spark connect which is highlighted by @vakarisbk

Livy would be much better suited for dbt Python models as it would stick to dbt philosophy of generating the code locally and then shipping it somewhere else to execute. And it would support running arbitrary Python code remotely, not just a subset of APIs that are supported by Spark Connect

Or to put it another way, spark connect cannot run arbitrary python remotely - AFAIK, there's no way to access an available python interpreter, and no requirement for one to be available. That's different to the approach taken by the other connectors which have all the relevant bits of python executed on the remote server. Quite possibly that's an acceptable limitation but a potentially confusing one - packages would only be installed locally, for example, whilst the configuration makes it clear this is for remote installation.

I do share the concerns around Livy's aliveness as well.

ben-schreiber · 2024-02-11T07:55:51Z

@ssabdb Agreed that the there is a limitation; I think this is the key point:

Quite possibly that's an acceptable limitation but a potentially confusing one

Additionally, since there are numerous ways to connect to and use Spark, I'm not sure a "one size fits all" approach to Python DBT models for OSS Spark is the correct one. In any event, let's leave the Python model discussion for a dedicated issue (#415 ?)

GeorgiiKolpakov · 2024-11-22T16:54:59Z

@ben-schreiber @vakarisbk
I'm tentatively curious what is the resolution of the discussion? Is it fine to introduce spark-connect as SQL-only solution as a solution to this issue?
If yes, then what are the remaining hurdles of merging existing PR? I see a review with one comment that had been resolved. Isn't it possible to merge that PR and to close this issue?

timvw added enhancement New feature or request triage labels Jun 24, 2023

github-actions bot changed the title ~~[Feature] Spark Connect as connection method~~ [ADAP-658] [Feature] Spark Connect as connection method Jun 24, 2023

dataders mentioned this issue Jul 4, 2023

[ADAP-667] [Feature] python model support via livy #821

Closed

3 tasks

dataders removed the triage label Jul 4, 2023

Fokko mentioned this issue Jul 7, 2023

[ADAP-677] [Feature] Add a BaseClass to the ConnectionWrapper #829

Closed

3 tasks

timvw closed this as completed Aug 28, 2023

vakarisbk linked a pull request Oct 3, 2023 that will close this issue

Add support for Spark Connect (SQL models) #899

Open

4 tasks

timvw reopened this Feb 6, 2024

Fleid added the help_wanted Extra attention is needed label Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAP-658] [Feature] Spark Connect as connection method #814

[ADAP-658] [Feature] Spark Connect as connection method #814

timvw commented Jun 24, 2023

dataders commented Jul 4, 2023

Fokko commented Jul 7, 2023

ssabdb commented Jul 7, 2023

timvw commented Aug 28, 2023

vakarisbk commented Oct 4, 2023 •

edited

Loading

vakarisbk commented Oct 4, 2023 •

edited

Loading

ben-schreiber commented Feb 6, 2024 •

edited

Loading

ssabdb commented Feb 9, 2024 •

edited

Loading

ben-schreiber commented Feb 11, 2024

GeorgiiKolpakov commented Nov 22, 2024

[ADAP-658] [Feature] Spark Connect as connection method #814

[ADAP-658] [Feature] Spark Connect as connection method #814

Comments

timvw commented Jun 24, 2023

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

dataders commented Jul 4, 2023

Fokko commented Jul 7, 2023

ssabdb commented Jul 7, 2023

timvw commented Aug 28, 2023

vakarisbk commented Oct 4, 2023 • edited Loading

vakarisbk commented Oct 4, 2023 • edited Loading

ben-schreiber commented Feb 6, 2024 • edited Loading

ssabdb commented Feb 9, 2024 • edited Loading

ben-schreiber commented Feb 11, 2024

GeorgiiKolpakov commented Nov 22, 2024

vakarisbk commented Oct 4, 2023 •

edited

Loading

vakarisbk commented Oct 4, 2023 •

edited

Loading

ben-schreiber commented Feb 6, 2024 •

edited

Loading

ssabdb commented Feb 9, 2024 •

edited

Loading