Skip to content

Conversation

@shivsood
Copy link
Contributor

What changes were proposed in this pull request?

This is a Work in Progress PR for DataSourceV2 based connector for JDBC. The goal is a MVP for both read and write path based on latest data source V2 apis. As of now the PR is not complete, but provided here for visibility on this work and comments to set us in the right direction.

Another PR on related work is #21861. That uses older V2 apis, but some of the work there may be still relevant. Have requested author to consider merge if possible.
@tengpeng @xianyin as FYI as they volunteered for contribution to the work going forward.

Readme.md added for high level work items. Find it at org/apache/spark/sql/execution/datasources/v2/jdbc/Readme.md

(Please fill in changes proposed in this fix)
The current PR implements the following ( will keep this updated we make progress on this)

  • Scaffolding for read/write paths.
  • First draft implementation of dataframe write(append) flow. Connector name is "jdbcv2". df.write.format("jdbcv2").mode("append") appends to Table if table exists. Create table not supported as of now.
  • E2E test cases added in MsSqlServerIntegrationSuite.scala
  • JDBCUtils is reused as and when easiliy possible, but further scope of refactoring to make it work for both V1 and V2 flows.

How was this patch tested?

  • Validation with SQLServer 2017 only.
  • No unit test cases added for now.

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
The path was mainly integration tested for write ( append) path.

Please review https://spark.apache.org/contributing.html before opening a pull request.

shivsood added 5 commits July 15, 2019 21:30
…E2E test case added in MssQLServerIntegrationSuite
…ourcev2.

- df.write.format("jdbcv2").mode("append") appends to Table if table exists. create table not supported as of now.
- validation with SQLServer 2017 only.
- Good level of logging to help understand flows
- E2E test cases added in MsSqlServerIntegrationSuite.scala
@shivsood shivsood changed the title [SPARK-24907][SQL][WIP] DataSourceV2 based connector for JDBC [WIP][SPARK-24907][SQL] DataSourceV2 based connector for JDBC Jul 22, 2019
@shivsood
Copy link
Contributor Author

MVP read and write path is in place now. I have a few issues/questions that i will add to org/apache/spark/sql/execution/datasources/v2/jdbc/Readme.md and start some discussion on the mailing list.
@tengpeng @xianyin @priyanka-gomatam please review and contribute as relevant. All of you should contributor rights to this repo. Also please note #25291.

@rdblue @cloud-fan @gengliangwang @brkyvz . Not ready for complete review, but a directional review will help greatly ( is this on the right track?).

…ionSuite.scala

Scemantics are TRUNCATE TABLE and then overwrite with new data. Existing table schema is preserved.

Overwrite(w/o truncate) - Scaffolding in place. Utils::CreateTable is dummy. Still need to be implemented.
Scemantics are DROP TABLE, CREATE TABLE with new passed schema and then overwrite with new data.
Problems
- FW keep calling WriteBuilder::truncate() even when the
truncate option is not specified or truncate explicitly set to false. Test update with truncate=false.
- Added test df.filter and then overwrite(w/o) truncate to only write set of rows that match filter. FW still calls truncate

Read path fixed to return schema with pruned columns as suggsted in Scan::readSchema
Select with pruned columns still does not work.
…and CREATE TABLE. Reuses JDBCUtils to DROP and CREATE.

JDBCUtils had to be refactored to take schema rather than dataframe. Functions that Dataframe are retained V1 compatibility.
The V2 implementtion is not e2e tested as FW continues to send truncate rather than overwrite.

V1 Regression test following JDBCUtils change
UnitTest (./build/mvn -pl :spark-sql_2.12 clean install) were run. Test passed with regular failures that are see on master branch also.
Total number of tests run: 5896
Suites: completed 288, aborted 0
Tests: succeeded 5893, failed 3, canceled 1, ignored 45, pending 0

V1 Integration Test (./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12) were run and all passed
Run completed in 36 seconds, 352 milliseconds.
Total number of tests run: 22
Suites: completed 5, aborted 1
Tests: succeeded 22, failed 0, canceled 0, ignored 6, pending 0
@shivsood
Copy link
Contributor Author

shivsood commented Aug 2, 2019

Have a first draft of DataSourceV2 based JDBC connector available now ( PR#25211) . The goal was a MVP implementation with support for batch read/write. I am looking forward for your review comments to help guide direction. Note that i am still understanding/addressing some issues. The plan, status issues is capture in the Readme.md

Summary of changes

  • V2 connector changes are under under sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc. The implementation heavily reuses infra provided by JDBCUtils.
  • JDBCUtils(sql/core/../datasources/jdbc/JdbcUtils.scala) file is refactored ( for few functions) to suite V2 needs.
  • E2E test cases are in external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Dec 26, 2019
@github-actions github-actions bot closed this Dec 27, 2019
@baibaichen
Copy link
Contributor

@shivsood any progress about this PR ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants