Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Spark 3.1.1 with testing #349

Merged
merged 24 commits into from
Jun 28, 2022
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
DBT_INVOCATION_ENV: circle
docker:
- image: fishtownanalytics/test-container:10
- image: godatadriven/spark:2
- image: godatadriven/spark:3.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move to the apache image in docker-compose, I would suggest doing that here as well 👍🏻

Copy link
Contributor Author

@nssalian nssalian May 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. that's the part that's being tested at the moment since the tests are failing in thrift. Was trying to eliminate possible reasons for failure. I'm trying with 3.0 right now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider trying with 3.1 or 3.2. There were absolutely some bugs in Spark 3.0.0, but I'm admittedly somewhat distrustful of the whole of Spark 3.0.x.

Just throwing that out there in case you're still stuck.

environment:
WAIT_FOR: localhost:5432
command: >
Expand All @@ -44,9 +44,11 @@ jobs:
--conf spark.hadoop.javax.jdo.option.ConnectionPassword=dbt
--conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.jars.packages=org.apache.hudi:hudi-spark-bundle_2.11:0.9.0
--conf spark.jars.packages=org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
--conf spark.driver.userClassPathFirst=true
--conf spark.driver.memory=2g
--conf spark.executor.memory=2g
--conf spark.hadoop.datanucleus.autoCreateTables=true
--conf spark.hadoop.datanucleus.schema.autoCreateTables=true
--conf spark.hadoop.datanucleus.fixedDatastore=false
Expand All @@ -55,6 +57,9 @@ jobs:
--hiveconf hoodie.datasource.hive_sync.mode=hms
--hiveconf datanucleus.schema.autoCreateAll=true
--hiveconf hive.metastore.schema.verification=false
--hiveconf hive.metastore.sasl.enabled=true
--hiveconf hive.server2.thrift.port=10000
--hiveconf hive.server2.thrift.bind.host=localhost

- image: postgres:9.6.17-alpine
environment:
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ more information, consult [the docs](https://docs.getdbt.com/docs/profile-spark)

## Running locally
A `docker-compose` environment starts a Spark Thrift server and a Postgres database as a Hive Metastore backend.
Note that this is spark 2 not spark 3 so some functionalities might not be available.
Note: Spark has moved to Spark 3 (formerly on Spark 2).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean dbt / dbt-spark moved to spark-3 right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this statement and I'm a bit confused.
If dbt-spark moved to 3, the testing module still is on 2 and this line in the docs doesn't add up.
Since this is a Draft PR, I'll finish up the testing and clean this up in either this PR or a separate one.
But @rvacaru do you know the context behind why the compose file has spark:3 but the testing didn't move over ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read it initially, the impression I got was more that the compose file was moved to Spark 3. So maybe the same warning applies given the library has been / is being updated for spark 3 functionality?


The following command would start two docker containers
```
Expand Down
4 changes: 2 additions & 2 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
version: "3.7"
services:

dbt-spark2-thrift:
image: godatadriven/spark:3.0
dbt-spark3-thrift:
image: apache/spark:v3.1.3
ports:
- "10000:10000"
- "4040:4040"
Expand Down
2 changes: 2 additions & 0 deletions docker/spark-defaults.conf
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
spark.driver.memory 2g
spark.executor.memory 2g
spark.hadoop.datanucleus.autoCreateTables true
spark.hadoop.datanucleus.schema.autoCreateTables true
spark.hadoop.datanucleus.fixedDatastore false
Expand Down
5 changes: 3 additions & 2 deletions tests/functional/adapter/test_basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ def project_config_update(self):
}


#hese tests were not enabled in the dbtspec files, so skipping here.
# These tests were not enabled in the dbtspec files, so skipping here.
# Error encountered was: Error running query: java.lang.ClassNotFoundException: delta.DefaultSource
@pytest.mark.skip_profile('apache_spark', 'spark_session')
class TestSnapshotTimestampSpark(BaseSnapshotTimestamp):
Expand All @@ -79,5 +79,6 @@ def project_config_update(self):
}
}


class TestBaseAdapterMethod(BaseAdapterMethod):
pass
pass