[SPARK-28554][SQL] implement basic catalog functionalities for JDBC v2 with a DS v1 fallback API #25291

cloud-fan · 2019-07-29T14:45:49Z

What changes were proposed in this pull request?

This PR implements the basic catalog functionalities for JDBC v2, so that users can register a JDBC catalog and create/access JDBC tables directly.

To make the PR small, this PR introduces a data source v1 fallback API so that we don't need to implement ScanBuilder and WriteBuilder for JDBC v1 right now.

Note that, this PR is orthogonal to #25211, which implements ScanBuilder, WriterBuilder for JDBC v2.

How was this patch tested?

a new test suite.

cloud-fan · 2019-07-29T14:54:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+      InsertIntoDataSourceCommand(fallbackToV1Relation(t, output), query, overwrite = false)
+
+    case OverwriteByExpression(
+        DataSourceV2Relation(t: DataSourceV1Table, output, _), Literal(true, _), query, false) =>


The Data Source V1 API don't support static partition overwrite (e.g. INSERT OVERWRITE t PARTITION(a=1) SELECT ...). We must fully migrate JDBC to Data Source V2 before supporting it.

Here we only need to match the OverwriteByExpression(..., deleteExpr=Literal(true)), as this is the only thing DS v1 supports.

maybe also add a @TODO here?

cloud-fan · 2019-07-29T14:58:44Z

cc @brkyvz @rdblue @shivsood @gengliangwang

SparkQA · 2019-07-29T15:06:57Z

Test build #108333 has finished for PR 25291 at commit f3257ee.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CreateV1TableAsSelectExec(
case class JDBCTable(
class JDBCTableCatalog extends DataSourceV1TableCatalog with Logging
trait DataSourceV1Table extends Table

SparkQA · 2019-07-29T15:15:54Z

Test build #108334 has finished for PR 25291 at commit 2689293.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CreateV1TableAsSelectExec(
case class JDBCTable(
class JDBCTableCatalog extends DataSourceV1TableCatalog with Logging
trait DataSourceV1Table extends Table

shivsood · 2019-07-29T16:41:40Z

@cloud-fan @brkyvz @rdblue @gengliangwang

Thanks for adding me here. I understand that #25211 is still relevant and i should continue that work (as of now it implements a basic read/write for JDBC V2). Can you please review #25211 and let me know if i am on the right path or should i change something in view of #25291.

cloud-fan · 2019-07-29T17:44:18Z

IIUC #25211 is invisible to end-users: the way to use JDBC data source is still the same. But it does have values in the long term: code clean up, more optimization opportunity, performance, etc.

This PR is to allow users to use JDBC source in the data source v2 way, i.e. register custom catalog and create/access tables directly. It only implements the basic catalog functionalities.

In general, we need both PRs but this one is smaller and visible to end-users, which probably should be reviewed first.

SparkQA · 2019-07-29T20:07:57Z

Test build #108340 has finished for PR 25291 at commit acde8b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CreateV1TableAsSelectExec(
case class JDBCTable(
class JDBCTableCatalog extends DataSourceV1TableCatalog with Logging
trait DataSourceV1Table extends Table

SparkQA · 2019-07-30T11:39:41Z

Test build #108379 has finished for PR 25291 at commit cff3d44.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CreateV1TableAsSelectExec(
case class JDBCTable(
class JDBCTableCatalog extends DataSourceV1TableCatalog with Logging
trait DataSourceV1Table extends Table

SparkQA · 2019-07-30T16:01:48Z

Test build #108387 has finished for PR 25291 at commit 8072fc5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CreateV1TableAsSelectExec(
case class JDBCTable(
class JDBCTableCatalog extends DataSourceV1TableCatalog with Logging
trait DataSourceV1Table extends Table

rdblue · 2019-08-02T19:24:10Z

@cloud-fan, why does this convert from v2 plans to v1 plans?

I don't think that v1 plans should be supported. To use v1 plans, the existing v1 code paths and support should be used. There's no need to mix v2 into it.

The support that I think is the goal of this effort is to provide a v2 implementation that calls a v1 writer for append and read operations.

I suggest clarifying the goal of this and possibly discussing it in the next v2 sync.

brkyvz · 2019-08-02T21:00:10Z

@rdblue The biggest issue is that V1 writers take logical plans and V2 writers take SparkPlans (which is much better IMO, because then you can actually see the executed query in the SQL Tab)

I can see us adding .buildForV1Writer as part of the WriteBuilder API, but that means that the V2WriteExec nodes also take LogicalPlans in addition to SparkPlans. Would that be gross? I don't know.

Having separate physical nodes seems reasonable to me, but it is certainly more work (mostly copy-paste though) than adding LogicalPlans to V2 nodes. What do you think?

rdblue · 2019-08-02T21:35:09Z

@brkyvz, I don't understand. I thought the idea was not to use v1 plans, but to use the v1 read/write API. Does the API require using a v1 plan?

brkyvz · 2019-08-02T22:03:14Z

The API doesn't require V1 plans. V1 writers just need LogicalPlans as inputs rather than SparkPlans

rdblue · 2019-08-03T19:50:59Z

@brkyvz, then why does this add analyzer rules that convert v2 plans to v1 plans? If we just need to pass a plan to the source, then we don't need to convert the whole plan back to v1 right?

brkyvz · 2019-08-03T20:15:07Z

I wasn't talking about the current state of the PR. We don't need these analyzer rewrites to rewrite to V1 plans.

brkyvz · 2019-08-03T22:25:58Z

FYI, I made an alternative proposal in #25348

cloud-fan · 2019-08-05T15:25:52Z

Note that, this PR extends the data source v1 API, to make it support part of the v2 functionalities (catalog APIs). As a result, this PR only touches the DS v1 code path. The only exception is in v2 strategy. The table creation is v2 API and table inserting is v1 API, so I have to create a special physical plan that calls both v1 and v2 APIs.

#25348 goes in a different direction. It extends the DS v2 API, to make it support some v1 functionalities (write a logical plan directly). As a result, it only touches the DS v2 code path. It updates the v2 physical plan to include the input logical in order to call v1 APIs.

I don't have a strong preference here. I picked the first direction because DS v1 is in sql/core which can access DS v2 directly. If we all agree that the second direction is better, I'm happy to go with that too.

rdblue · 2019-08-05T15:58:21Z

@cloud-fan, can you be more specific about what you are proposing for the v1 update and separate it from JDBC? Mixing those concerns together doesn't seem like a good path forward to me.

cloud-fan · 2019-08-05T18:10:53Z

I put JDBC v2 here because I want to give a concrete example and prove that my proposal works. I can surely create a new PR for API changes only, as soon as we agree on the approach.

More specifically, JDBC v2 wants to

implement catalog APIs (TableCatalog), so that users can register it as a catalog and access its tables directly.
not implement ScanBuilder, WriteBuilder stuff right now, as it's a lot of work but the benefit is not very big.

The proposal here is:

have a special version of the v2 Table interface (DataSourceV1Table), which can return a v1 BaseRelation.
have a special version of the v2 TableCatalog interface (DataSourceV1TableCatalog), which can return a DataSourceV1Table.

How it works:

JDBC v2 implements v2 APIs, so the analyzer treats it as a v2 source, and use v2 plans to process it.
The v1 analyzer rule catches v2 plans and converts them to v1 plans, if the table is DataSourceV1Table.

rdblue · 2019-08-05T18:42:07Z

have a special version of the v2 Table interface (DataSourceV1Table), which can return a v1 BaseRelation

CatalogTableAsV2 wraps a CatalogTable so that it can be converted to an UnresolvedCatalogRelation. Why is a second way to do this necessary?

have a special version of the v2 TableCatalog interface (DataSourceV1TableCatalog), which can return a DataSourceV1Table

Any catalog can return a table instance with this type, if it is public. Why does this require changing the TableCatalog interface?

The v1 analyzer rule catches v2 plans and converts them to v1 plans, if the table is DataSourceV1Table

I don't think v2 plans should be converted to v1 plans. As we discussed in the last sync, v1 read and write interfaces can be called from v2, but v2 should always provide v2 behavior guarantees. Converting back to v1 plans is not the right way to add compatibility.

cloud-fan · 2019-08-05T18:54:36Z

CatalogTableAsV2 wraps a CatalogTable so that it can be converted to an UnresolvedCatalogRelation. Why is a second way to do this necessary?

It's nothing about CatalogTable. The DS v1 read/write API is BaseRelation.

Any catalog can return a table instance with this type, if it is public. Why does this require changing the TableCatalog interface?

Because the CTAS plan does not hold a Table instance. The TableCatalog needs to tell spark that it falls back to DS v1.

rdblue · 2019-08-05T19:16:45Z

I think the approach in #25348 is the right way to move forward because it calls the v1 API only when we can guarantee that the behavior is the same.

keypointt · 2019-08-21T21:20:02Z

sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

+
+/**
+ * A special Data Source V2 `Table`, which doesn't need to implement the read/write capabilities.
+ * Spark will fallback the read/write requests to the v1 relation.


consistent with upper case for wording V1 and V2?

keypointt · 2019-08-21T21:25:51Z

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala

+    _name = name
+    val map = options.asCaseSensitiveMap().asScala.toMap
+    // The `JDBCOptions` checks the existence of the table option. This is required by JDBC v1, but
+    // JDBC V2 only knows the table option when loading a table. Here we put a table option with a


same upper case V1 and V2?

rdblue · 2019-08-21T21:36:01Z

@cloud-fan, do you want to close this since #25348 was merged?

cloud-fan · 2019-08-22T13:18:21Z

This PR is for JDBC v2. I'll continue work on it after the v1 read fallback API is done.

github-actions · 2019-12-26T00:07:27Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

cloud-fan commented Jul 29, 2019

View reviewed changes

cloud-fan force-pushed the jdbc branch from f3257ee to 2689293 Compare July 29, 2019 14:56

cloud-fan added the SQL label Jul 29, 2019

shivsood mentioned this pull request Jul 29, 2019

[WIP][SPARK-24907][SQL] DataSourceV2 based connector for JDBC #25211

Closed

cloud-fan force-pushed the jdbc branch from 2689293 to acde8b2 Compare July 29, 2019 17:36

cloud-fan force-pushed the jdbc branch from acde8b2 to cff3d44 Compare July 30, 2019 09:01

v1 fallback API

8072fc5

cloud-fan force-pushed the jdbc branch from cff3d44 to 8072fc5 Compare July 30, 2019 12:18

keypointt reviewed Aug 21, 2019

View reviewed changes

github-actions bot added the Stale label Dec 26, 2019

github-actions bot closed this Dec 27, 2019

[SPARK-28554][SQL] implement basic catalog functionalities for JDBC v2 with a DS v1 fallback API #25291

[SPARK-28554][SQL] implement basic catalog functionalities for JDBC v2 with a DS v1 fallback API #25291

Uh oh!

Conversation

cloud-fan commented Jul 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan Jul 29, 2019

Choose a reason for hiding this comment

Uh oh!

keypointt Aug 21, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 29, 2019

Uh oh!

SparkQA commented Jul 29, 2019

Uh oh!

SparkQA commented Jul 29, 2019

Uh oh!

shivsood commented Jul 29, 2019

Uh oh!

cloud-fan commented Jul 29, 2019

Uh oh!

SparkQA commented Jul 29, 2019

Uh oh!

SparkQA commented Jul 30, 2019

Uh oh!

SparkQA commented Jul 30, 2019

Uh oh!

rdblue commented Aug 2, 2019

Uh oh!

brkyvz commented Aug 2, 2019

Uh oh!

rdblue commented Aug 2, 2019

Uh oh!

brkyvz commented Aug 2, 2019

Uh oh!

rdblue commented Aug 3, 2019

Uh oh!

brkyvz commented Aug 3, 2019

Uh oh!

brkyvz commented Aug 3, 2019

Uh oh!

cloud-fan commented Aug 5, 2019

Uh oh!

rdblue commented Aug 5, 2019

Uh oh!

cloud-fan commented Aug 5, 2019

Uh oh!

rdblue commented Aug 5, 2019

Uh oh!

cloud-fan commented Aug 5, 2019

Uh oh!

rdblue commented Aug 5, 2019

Uh oh!

keypointt Aug 21, 2019

Choose a reason for hiding this comment

Uh oh!

keypointt Aug 21, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue commented Aug 21, 2019

Uh oh!

cloud-fan commented Aug 22, 2019

Uh oh!

github-actions bot commented Dec 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cloud-fan commented Jul 29, 2019 •

edited

Loading