-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add connection method for running dbt against a Spark session #272
Comments
@JCZuurmond Thanks for opening! This is an interesting idea, and I see how it could be especially useful in testing (following the proposal laid out in dbt-labs/spark-utils#22). In particular:
In practice:
I'm not exceptionally well versed in this stuff, so I do want to throw that up as a big caveat. Am I saying things that make sense? Maybe we could still add this functionality for ease of development/CI, but make exceptionally clear that it's for testing only, by wrapping it into a module separate from |
Hi @jtcohen6 , thanks for your response! I completely get your concerns. Like you said, a fundamental assumption of dbt's is that it does not run where the data and compute is. This issue goes against that, so we should be careful with just adding this functionality. Still, to motivate my case: Dbt assumes there is data inside a data warehouse (the sources). The user defines transformations (models). Then we run dbt which (i) compiles SQL and (ii) executes the SQL on a data warehouse. This set-up requires an integration - with a data warehouse containing some data sources. I like that dbt separates logic and data/compute! It makes sense to me (after dbt showed it to me). However, if we want to add unit testing capabilities to dbt. (See discussion for more on this.) Then this is not really possible, since the dbt set-up requires an integration, so you will be doing integration tests. Even with a (lightweight) data warehouse in a Docker container that you spin up in a CI you still get integration-like tests: first set-up a data warehouse, add some mock data sources, create a connection, run some dbt models against it and verify output. My argumentation is that: for unit testing we should be able to run dbt where the (mock) data and compute is. With many data warehouse you can't do this (easily?), but with Spark you can! So, I would like to add this functionality so that I can unit test my dbt logic (as demonstrated in this PR). We could implement this functionality first in a separate package that is intended only for unit testing of |
@jtcohen6 : I have opened a PR to add this connection method. Let's tackle details about the exact implementation and how to communicate about this connection method there. After giving another thought, I think it is best to add the connection method described in this issue to this repo instead of in a separate package as said in the previous comment. Main motivation: that package Also, all spark connection methods are situated in this package. So it makes sense to add it here and not implement some tricks to add it in the And, I think that it is really nice to use a Spark session for unit testing: no need to set-up a database; use pytest directly; no custom scripts to spin up Docker images; etc. However, if people prefer to test their dbt logic with a warehouse in the cloud, a docker container or where ever they like to host a warehouse, that package should not prevent them to do so. I would like to hear how you feel about adding the |
@JCZuurmond Thanks for your patience on this, and for continuing to take the initiative! I agree that:
All of which is to say, I think I've come around to your reasoning. So, I think the right approach would be very much as you have it in #279:
|
@JCZuurmond @jtcohen6 I think not just for testing, it would be great to run dbt as a spark job, which will remove the dependency on the thrift connections which can be a bottleneck, failure point and security concern. |
@JCZuurmond Can I use it programatically with an existing pyspark session? |
@dasbipulkumar : Yes, you can use an existing pyspark session. The active session is used. About the use case you describe for the Spark session connector, you can of course (try to) use the Spark session connector, it is there to be (miss)used. Keep in mind what is discussed above: this connector against the dbt assumption that dbt is running somewhere else than where the compute is. Keep in mind: dbt's user interface is a command line tool. If you are using it programmatically, you use entry points for which dbt does not guarantee a stable, versioned API. |
- Run test examples from docs ([issue](#14), [PR](#17)) - Add target flag ([issue](#11), [PR](#13)) - Delete session module [is included in dbt-spark](dbt-labs/dbt-spark#272) - Add Github templates
- Run test examples from docs ([issue](#14), [PR](#17)) - Add target flag ([issue](#11), [PR](#13)) - Delete session module [is included in dbt-spark](dbt-labs/dbt-spark#272) - Add Github templates
Describe the feature
Add a connection method for running dbt against a Spark session.
Describe alternatives you've considered
N.A.
Additional context
I want to run dbt against a (local) Spark session for testing. This would allow me to set-up a Spark session in a CI, then run dbt against it. It is more lightweight than setting up a Thirft server (or similar).
I did a similar thing for
soda-spark
. The trick is to mock a Python database API - theconnection
- that sends the SQL to apyspark.sql.SparkSession
.The implementation would roughly be:
Connection
that:SparkSession.builder.getOrCreate()
spark_session.sql(sql)
pyspark
,session
orlocal
.Connection
in theopen
method.The tricky thing here: is to what extend should the Python database API be mocked. Does
dbt-core
for example access the underlyingcursor
of such a connection? Or does it only use theexecute
method?Who will this benefit?
Me, some other (advanced) dbt Spark users who want to run
dbt-spark
programmatically.Are you interested in contributing this feature?
Sure.
The text was updated successfully, but these errors were encountered: