Add connection method for running dbt against a Spark session #272

JCZuurmond · 2021-12-28T10:54:05Z

Describe the feature

Add a connection method for running dbt against a Spark session.

Describe alternatives you've considered

N.A.

Additional context

I want to run dbt against a (local) Spark session for testing. This would allow me to set-up a Spark session in a CI, then run dbt against it. It is more lightweight than setting up a Thirft server (or similar).

I did a similar thing for soda-spark. The trick is to mock a Python database API - the connection - that sends the SQL to a pyspark.sql.SparkSession.

The implementation would roughly be:

Implement a custom Connection that:
1. Gets the spark session: SparkSession.builder.getOrCreate()
2. Executes SQL: spark_session.sql(sql)
3. Converts the results to the expected format.
Add a new connection method called pyspark, session or local.
Return the custom Connection in the open method.

The tricky thing here: is to what extend should the Python database API be mocked. Does dbt-core for example access the underlying cursor of such a connection? Or does it only use the execute method?

Who will this benefit?

Me, some other (advanced) dbt Spark users who want to run dbt-spark programmatically.

Are you interested in contributing this feature?

Sure.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2022-01-04T09:06:42Z

@JCZuurmond Thanks for opening! This is an interesting idea, and I see how it could be especially useful in testing (following the proposal laid out in dbt-labs/spark-utils#22). In particular:

It is more lightweight than setting up a Thirft server (or similar).

In practice:

Running a Spark cluster in a local Docker container, and connecting to it as if to an external database (such as via Thrift), feels both workable and more reflective of the actual experience of using dbt-spark with a real Spark cluster in development/production.
I'm really hesitant to add import pyspark anywhere within dbt-spark. As a rule, dbt does not process/compute data transformation itself, and pyspark does. That's a pretty signficant guardrail, and I really hesitate to cross it.

I'm not exceptionally well versed in this stuff, so I do want to throw that up as a big caveat. Am I saying things that make sense? Maybe we could still add this functionality for ease of development/CI, but make exceptionally clear that it's for testing only, by wrapping it into a module separate from dbt-spark itself (but still within this repository)?

JCZuurmond · 2022-01-13T08:24:37Z

Hi @jtcohen6 , thanks for your response!

I completely get your concerns. Like you said, a fundamental assumption of dbt's is that it does not run where the data and compute is. This issue goes against that, so we should be careful with just adding this functionality.

Still, to motivate my case:

Dbt assumes there is data inside a data warehouse (the sources). The user defines transformations (models). Then we run dbt which (i) compiles SQL and (ii) executes the SQL on a data warehouse. This set-up requires an integration - with a data warehouse containing some data sources.

I like that dbt separates logic and data/compute! It makes sense to me (after dbt showed it to me).

However, if we want to add unit testing capabilities to dbt. (See discussion for more on this.) Then this is not really possible, since the dbt set-up requires an integration, so you will be doing integration tests.

Even with a (lightweight) data warehouse in a Docker container that you spin up in a CI you still get integration-like tests: first set-up a data warehouse, add some mock data sources, create a connection, run some dbt models against it and verify output.

My argumentation is that: for unit testing we should be able to run dbt where the (mock) data and compute is.

With many data warehouse you can't do this (easily?), but with Spark you can! So, I would like to add this functionality so that I can unit test my dbt logic (as demonstrated in this PR).

We could implement this functionality first in a separate package that is intended only for unit testing of dbt-spark projects. Then it becomes more clear how it would look, and maybe decide to move some parts into this repo. I could have a first go at this at the end of this month.

JCZuurmond · 2022-01-28T13:22:58Z

@jtcohen6 : I have opened a PR to add this connection method. Let's tackle details about the exact implementation and how to communicate about this connection method there.

After giving another thought, I think it is best to add the connection method described in this issue to this repo instead of in a separate package as said in the previous comment. Main motivation: that package pytest-dbt-core does not have to be limited to dbt-spark per se. Other dbt adapters can - and should - also benefit from unit testing functionality.

Also, all spark connection methods are situated in this package. So it makes sense to add it here and not implement some tricks to add it in the pytest-dbt-core package.

And, I think that it is really nice to use a Spark session for unit testing: no need to set-up a database; use pytest directly; no custom scripts to spin up Docker images; etc. However, if people prefer to test their dbt logic with a warehouse in the cloud, a docker container or where ever they like to host a warehouse, that package should not prevent them to do so.

I would like to hear how you feel about adding the session connection method here - after having another look at it.

jtcohen6 · 2022-02-07T15:07:42Z

@JCZuurmond Thanks for your patience on this, and for continuing to take the initiative!

I agree that:

Testing functionality (shared utilities, suite of common functional tests, etc) should live independently of any one adapter plugin
Adapter-specific connection methods should live in that adapter plugin's repository
Apache Spark offers us a capability around unit testing that other databases simply cannot—and that's not a good reason to dismiss it out of hand, so long as we firmly tie that capability to unit testing only

All of which is to say, I think I've come around to your reasoning.

So, I think the right approach would be very much as you have it in #279:

pyspark is an optional ("extra") dependency
The session method is officially documented as serving non-production testing purposes only. In all likelihood, dbt Cloud will forbid use of the session method when setting up connections

dasbipulkumar · 2022-06-23T09:08:29Z

@JCZuurmond @jtcohen6 I think not just for testing, it would be great to run dbt as a spark job, which will remove the dependency on the thrift connections which can be a bottleneck, failure point and security concern.

dasbipulkumar · 2022-06-23T11:25:50Z

@JCZuurmond Can I use it programatically with an existing pyspark session?

JCZuurmond · 2022-06-23T14:53:45Z

@dasbipulkumar : Yes, you can use an existing pyspark session. The active session is used.

About the use case you describe for the Spark session connector, you can of course (try to) use the Spark session connector, it is there to be (miss)used. Keep in mind what is discussed above: this connector against the dbt assumption that dbt is running somewhere else than where the compute is.

Keep in mind: dbt's user interface is a command line tool. If you are using it programmatically, you use entry points for which dbt does not guarantee a stable, versioned API.

- Run test examples from docs ([issue](#14), [PR](#17)) - Add target flag ([issue](#11), [PR](#13)) - Delete session module [is included in dbt-spark](dbt-labs/dbt-spark#272) - Add Github templates

JCZuurmond added enhancement New feature or request triage labels Dec 28, 2021

JCZuurmond mentioned this issue Dec 28, 2021

Add testing dbt-labs/spark-utils#22

Merged

jtcohen6 removed the triage label Jan 4, 2022

JCZuurmond mentioned this issue Jan 28, 2022

Add spark session connection #279

Merged

5 tasks

jtcohen6 closed this as completed in #279 Mar 26, 2022

JCZuurmond mentioned this issue Jun 1, 2022

Remove session module godatadriven/pytest-dbt-core#9

Merged

JCZuurmond mentioned this issue Jul 22, 2022

Release 0.1.0 godatadriven/pytest-dbt-core#18

Merged

This was referenced Nov 28, 2024

Koala 1423 support session connection YotpoLtd/dbt-databricks#1

Merged

Introduce PySpark Session support ( enables the adapter usage for job clusters) databricks/dbt-databricks#862

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add connection method for running dbt against a Spark session #272

Add connection method for running dbt against a Spark session #272

JCZuurmond commented Dec 28, 2021 •

edited

Loading

jtcohen6 commented Jan 4, 2022

JCZuurmond commented Jan 13, 2022 •

edited

Loading

JCZuurmond commented Jan 28, 2022 •

edited

Loading

jtcohen6 commented Feb 7, 2022

dasbipulkumar commented Jun 23, 2022

dasbipulkumar commented Jun 23, 2022

JCZuurmond commented Jun 23, 2022

Add connection method for running dbt against a Spark session #272

Add connection method for running dbt against a Spark session #272

Comments

JCZuurmond commented Dec 28, 2021 • edited Loading

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?

jtcohen6 commented Jan 4, 2022

JCZuurmond commented Jan 13, 2022 • edited Loading

JCZuurmond commented Jan 28, 2022 • edited Loading

jtcohen6 commented Feb 7, 2022

dasbipulkumar commented Jun 23, 2022

dasbipulkumar commented Jun 23, 2022

JCZuurmond commented Jun 23, 2022

JCZuurmond commented Dec 28, 2021 •

edited

Loading

JCZuurmond commented Jan 13, 2022 •

edited

Loading

JCZuurmond commented Jan 28, 2022 •

edited

Loading