Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add connection method for running dbt against a Spark session #272

Closed
JCZuurmond opened this issue Dec 28, 2021 · 7 comments · Fixed by #279
Closed

Add connection method for running dbt against a Spark session #272

JCZuurmond opened this issue Dec 28, 2021 · 7 comments · Fixed by #279
Labels
enhancement New feature or request

Comments

@JCZuurmond
Copy link
Collaborator

JCZuurmond commented Dec 28, 2021

Describe the feature

Add a connection method for running dbt against a Spark session.

Describe alternatives you've considered

N.A.

Additional context

I want to run dbt against a (local) Spark session for testing. This would allow me to set-up a Spark session in a CI, then run dbt against it. It is more lightweight than setting up a Thirft server (or similar).

I did a similar thing for soda-spark. The trick is to mock a Python database API - the connection - that sends the SQL to a pyspark.sql.SparkSession.

The implementation would roughly be:

  1. Implement a custom Connection that:
    1. Gets the spark session: SparkSession.builder.getOrCreate()
    2. Executes SQL: spark_session.sql(sql)
    3. Converts the results to the expected format.
  2. Add a new connection method called pyspark, session or local.
  3. Return the custom Connection in the open method.

The tricky thing here: is to what extend should the Python database API be mocked. Does dbt-core for example access the underlying cursor of such a connection? Or does it only use the execute method?

Who will this benefit?

Me, some other (advanced) dbt Spark users who want to run dbt-spark programmatically.

Are you interested in contributing this feature?

Sure.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 4, 2022

@JCZuurmond Thanks for opening! This is an interesting idea, and I see how it could be especially useful in testing (following the proposal laid out in dbt-labs/spark-utils#22). In particular:

It is more lightweight than setting up a Thirft server (or similar).

In practice:

  • Running a Spark cluster in a local Docker container, and connecting to it as if to an external database (such as via Thrift), feels both workable and more reflective of the actual experience of using dbt-spark with a real Spark cluster in development/production.
  • I'm really hesitant to add import pyspark anywhere within dbt-spark. As a rule, dbt does not process/compute data transformation itself, and pyspark does. That's a pretty signficant guardrail, and I really hesitate to cross it.

I'm not exceptionally well versed in this stuff, so I do want to throw that up as a big caveat. Am I saying things that make sense? Maybe we could still add this functionality for ease of development/CI, but make exceptionally clear that it's for testing only, by wrapping it into a module separate from dbt-spark itself (but still within this repository)?

@jtcohen6 jtcohen6 removed the triage label Jan 4, 2022
@JCZuurmond
Copy link
Collaborator Author

JCZuurmond commented Jan 13, 2022

Hi @jtcohen6 , thanks for your response!

I completely get your concerns. Like you said, a fundamental assumption of dbt's is that it does not run where the data and compute is. This issue goes against that, so we should be careful with just adding this functionality.

Still, to motivate my case:

Dbt assumes there is data inside a data warehouse (the sources). The user defines transformations (models). Then we run dbt which (i) compiles SQL and (ii) executes the SQL on a data warehouse. This set-up requires an integration - with a data warehouse containing some data sources.

I like that dbt separates logic and data/compute! It makes sense to me (after dbt showed it to me).

However, if we want to add unit testing capabilities to dbt. (See discussion for more on this.) Then this is not really possible, since the dbt set-up requires an integration, so you will be doing integration tests.

Even with a (lightweight) data warehouse in a Docker container that you spin up in a CI you still get integration-like tests: first set-up a data warehouse, add some mock data sources, create a connection, run some dbt models against it and verify output.

My argumentation is that: for unit testing we should be able to run dbt where the (mock) data and compute is.

With many data warehouse you can't do this (easily?), but with Spark you can! So, I would like to add this functionality so that I can unit test my dbt logic (as demonstrated in this PR).

We could implement this functionality first in a separate package that is intended only for unit testing of dbt-spark projects. Then it becomes more clear how it would look, and maybe decide to move some parts into this repo. I could have a first go at this at the end of this month.

@JCZuurmond
Copy link
Collaborator Author

JCZuurmond commented Jan 28, 2022

@jtcohen6 : I have opened a PR to add this connection method. Let's tackle details about the exact implementation and how to communicate about this connection method there.

After giving another thought, I think it is best to add the connection method described in this issue to this repo instead of in a separate package as said in the previous comment. Main motivation: that package pytest-dbt-core does not have to be limited to dbt-spark per se. Other dbt adapters can - and should - also benefit from unit testing functionality.

Also, all spark connection methods are situated in this package. So it makes sense to add it here and not implement some tricks to add it in the pytest-dbt-core package.

And, I think that it is really nice to use a Spark session for unit testing: no need to set-up a database; use pytest directly; no custom scripts to spin up Docker images; etc. However, if people prefer to test their dbt logic with a warehouse in the cloud, a docker container or where ever they like to host a warehouse, that package should not prevent them to do so.

I would like to hear how you feel about adding the session connection method here - after having another look at it.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 7, 2022

@JCZuurmond Thanks for your patience on this, and for continuing to take the initiative!

I agree that:

  • Testing functionality (shared utilities, suite of common functional tests, etc) should live independently of any one adapter plugin
  • Adapter-specific connection methods should live in that adapter plugin's repository
  • Apache Spark offers us a capability around unit testing that other databases simply cannot—and that's not a good reason to dismiss it out of hand, so long as we firmly tie that capability to unit testing only

All of which is to say, I think I've come around to your reasoning.

So, I think the right approach would be very much as you have it in #279:

  • pyspark is an optional ("extra") dependency
  • The session method is officially documented as serving non-production testing purposes only. In all likelihood, dbt Cloud will forbid use of the session method when setting up connections

@dasbipulkumar
Copy link

@JCZuurmond @jtcohen6 I think not just for testing, it would be great to run dbt as a spark job, which will remove the dependency on the thrift connections which can be a bottleneck, failure point and security concern.

@dasbipulkumar
Copy link

@JCZuurmond Can I use it programatically with an existing pyspark session?

@JCZuurmond
Copy link
Collaborator Author

@dasbipulkumar : Yes, you can use an existing pyspark session. The active session is used.

About the use case you describe for the Spark session connector, you can of course (try to) use the Spark session connector, it is there to be (miss)used. Keep in mind what is discussed above: this connector against the dbt assumption that dbt is running somewhere else than where the compute is.

Keep in mind: dbt's user interface is a command line tool. If you are using it programmatically, you use entry points for which dbt does not guarantee a stable, versioned API.

JCZuurmond added a commit to godatadriven/pytest-dbt-core that referenced this issue Jul 22, 2022
- Run test examples from docs ([issue](#14), [PR](#17))
- Add target flag ([issue](#11), [PR](#13))
- Delete session module [is included in dbt-spark](dbt-labs/dbt-spark#272)
- Add Github templates
JCZuurmond added a commit to godatadriven/pytest-dbt-core that referenced this issue Jul 22, 2022
- Run test examples from docs ([issue](#14), [PR](#17))
- Add target flag ([issue](#11), [PR](#13))
- Delete session module [is included in dbt-spark](dbt-labs/dbt-spark#272)
- Add Github templates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants