Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jul 29, 2019

What changes were proposed in this pull request?

This PR implements the basic catalog functionalities for JDBC v2, so that users can register a JDBC catalog and create/access JDBC tables directly.

To make the PR small, this PR introduces a data source v1 fallback API so that we don't need to implement ScanBuilder and WriteBuilder for JDBC v1 right now.

Note that, this PR is orthogonal to #25211, which implements ScanBuilder, WriterBuilder for JDBC v2.

How was this patch tested?

a new test suite.

InsertIntoDataSourceCommand(fallbackToV1Relation(t, output), query, overwrite = false)

case OverwriteByExpression(
DataSourceV2Relation(t: DataSourceV1Table, output, _), Literal(true, _), query, false) =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Data Source V1 API don't support static partition overwrite (e.g. INSERT OVERWRITE t PARTITION(a=1) SELECT ...). We must fully migrate JDBC to Data Source V2 before supporting it.

Here we only need to match the OverwriteByExpression(..., deleteExpr=Literal(true)), as this is the only thing DS v1 supports.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also add a @TODO here?

@cloud-fan
Copy link
Contributor Author

@cloud-fan cloud-fan added the SQL label Jul 29, 2019
@SparkQA
Copy link

SparkQA commented Jul 29, 2019

Test build #108333 has finished for PR 25291 at commit f3257ee.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateV1TableAsSelectExec(
  • case class JDBCTable(
  • class JDBCTableCatalog extends DataSourceV1TableCatalog with Logging
  • trait DataSourceV1Table extends Table

@SparkQA
Copy link

SparkQA commented Jul 29, 2019

Test build #108334 has finished for PR 25291 at commit 2689293.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateV1TableAsSelectExec(
  • case class JDBCTable(
  • class JDBCTableCatalog extends DataSourceV1TableCatalog with Logging
  • trait DataSourceV1Table extends Table

@shivsood
Copy link
Contributor

@cloud-fan @brkyvz @rdblue @gengliangwang

Thanks for adding me here. I understand that #25211 is still relevant and i should continue that work (as of now it implements a basic read/write for JDBC V2). Can you please review #25211 and let me know if i am on the right path or should i change something in view of #25291.

@cloud-fan
Copy link
Contributor Author

IIUC #25211 is invisible to end-users: the way to use JDBC data source is still the same. But it does have values in the long term: code clean up, more optimization opportunity, performance, etc.

This PR is to allow users to use JDBC source in the data source v2 way, i.e. register custom catalog and create/access tables directly. It only implements the basic catalog functionalities.

In general, we need both PRs but this one is smaller and visible to end-users, which probably should be reviewed first.

@SparkQA
Copy link

SparkQA commented Jul 29, 2019

Test build #108340 has finished for PR 25291 at commit acde8b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateV1TableAsSelectExec(
  • case class JDBCTable(
  • class JDBCTableCatalog extends DataSourceV1TableCatalog with Logging
  • trait DataSourceV1Table extends Table

@SparkQA
Copy link

SparkQA commented Jul 30, 2019

Test build #108379 has finished for PR 25291 at commit cff3d44.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateV1TableAsSelectExec(
  • case class JDBCTable(
  • class JDBCTableCatalog extends DataSourceV1TableCatalog with Logging
  • trait DataSourceV1Table extends Table

@SparkQA
Copy link

SparkQA commented Jul 30, 2019

Test build #108387 has finished for PR 25291 at commit 8072fc5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateV1TableAsSelectExec(
  • case class JDBCTable(
  • class JDBCTableCatalog extends DataSourceV1TableCatalog with Logging
  • trait DataSourceV1Table extends Table

@rdblue
Copy link
Contributor

rdblue commented Aug 2, 2019

@cloud-fan, why does this convert from v2 plans to v1 plans?

I don't think that v1 plans should be supported. To use v1 plans, the existing v1 code paths and support should be used. There's no need to mix v2 into it.

The support that I think is the goal of this effort is to provide a v2 implementation that calls a v1 writer for append and read operations.

I suggest clarifying the goal of this and possibly discussing it in the next v2 sync.

@brkyvz
Copy link
Contributor

brkyvz commented Aug 2, 2019

@rdblue The biggest issue is that V1 writers take logical plans and V2 writers take SparkPlans (which is much better IMO, because then you can actually see the executed query in the SQL Tab)

I can see us adding .buildForV1Writer as part of the WriteBuilder API, but that means that the V2WriteExec nodes also take LogicalPlans in addition to SparkPlans. Would that be gross? I don't know.

Having separate physical nodes seems reasonable to me, but it is certainly more work (mostly copy-paste though) than adding LogicalPlans to V2 nodes. What do you think?

@rdblue
Copy link
Contributor

rdblue commented Aug 2, 2019

@brkyvz, I don't understand. I thought the idea was not to use v1 plans, but to use the v1 read/write API. Does the API require using a v1 plan?

@brkyvz
Copy link
Contributor

brkyvz commented Aug 2, 2019

The API doesn't require V1 plans. V1 writers just need LogicalPlans as inputs rather than SparkPlans

@rdblue
Copy link
Contributor

rdblue commented Aug 3, 2019

@brkyvz, then why does this add analyzer rules that convert v2 plans to v1 plans? If we just need to pass a plan to the source, then we don't need to convert the whole plan back to v1 right?

@brkyvz
Copy link
Contributor

brkyvz commented Aug 3, 2019

I wasn't talking about the current state of the PR. We don't need these analyzer rewrites to rewrite to V1 plans.

@brkyvz
Copy link
Contributor

brkyvz commented Aug 3, 2019

FYI, I made an alternative proposal in #25348

@cloud-fan
Copy link
Contributor Author

Note that, this PR extends the data source v1 API, to make it support part of the v2 functionalities (catalog APIs). As a result, this PR only touches the DS v1 code path. The only exception is in v2 strategy. The table creation is v2 API and table inserting is v1 API, so I have to create a special physical plan that calls both v1 and v2 APIs.

#25348 goes in a different direction. It extends the DS v2 API, to make it support some v1 functionalities (write a logical plan directly). As a result, it only touches the DS v2 code path. It updates the v2 physical plan to include the input logical in order to call v1 APIs.

I don't have a strong preference here. I picked the first direction because DS v1 is in sql/core which can access DS v2 directly. If we all agree that the second direction is better, I'm happy to go with that too.

@rdblue
Copy link
Contributor

rdblue commented Aug 5, 2019

@cloud-fan, can you be more specific about what you are proposing for the v1 update and separate it from JDBC? Mixing those concerns together doesn't seem like a good path forward to me.

@cloud-fan
Copy link
Contributor Author

I put JDBC v2 here because I want to give a concrete example and prove that my proposal works. I can surely create a new PR for API changes only, as soon as we agree on the approach.

More specifically, JDBC v2 wants to

  1. implement catalog APIs (TableCatalog), so that users can register it as a catalog and access its tables directly.
  2. not implement ScanBuilder, WriteBuilder stuff right now, as it's a lot of work but the benefit is not very big.

The proposal here is:

  1. have a special version of the v2 Table interface (DataSourceV1Table), which can return a v1 BaseRelation.
  2. have a special version of the v2 TableCatalog interface (DataSourceV1TableCatalog), which can return a DataSourceV1Table.

How it works:

  1. JDBC v2 implements v2 APIs, so the analyzer treats it as a v2 source, and use v2 plans to process it.
  2. The v1 analyzer rule catches v2 plans and converts them to v1 plans, if the table is DataSourceV1Table.

@rdblue
Copy link
Contributor

rdblue commented Aug 5, 2019

have a special version of the v2 Table interface (DataSourceV1Table), which can return a v1 BaseRelation

CatalogTableAsV2 wraps a CatalogTable so that it can be converted to an UnresolvedCatalogRelation. Why is a second way to do this necessary?

have a special version of the v2 TableCatalog interface (DataSourceV1TableCatalog), which can return a DataSourceV1Table

Any catalog can return a table instance with this type, if it is public. Why does this require changing the TableCatalog interface?

The v1 analyzer rule catches v2 plans and converts them to v1 plans, if the table is DataSourceV1Table

I don't think v2 plans should be converted to v1 plans. As we discussed in the last sync, v1 read and write interfaces can be called from v2, but v2 should always provide v2 behavior guarantees. Converting back to v1 plans is not the right way to add compatibility.

@cloud-fan
Copy link
Contributor Author

CatalogTableAsV2 wraps a CatalogTable so that it can be converted to an UnresolvedCatalogRelation. Why is a second way to do this necessary?

It's nothing about CatalogTable. The DS v1 read/write API is BaseRelation.

Any catalog can return a table instance with this type, if it is public. Why does this require changing the TableCatalog interface?

Because the CTAS plan does not hold a Table instance. The TableCatalog needs to tell spark that it falls back to DS v1.

@rdblue
Copy link
Contributor

rdblue commented Aug 5, 2019

I think the approach in #25348 is the right way to move forward because it calls the v1 API only when we can guarantee that the behavior is the same.


/**
* A special Data Source V2 `Table`, which doesn't need to implement the read/write capabilities.
* Spark will fallback the read/write requests to the v1 relation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consistent with upper case for wording V1 and V2?

_name = name
val map = options.asCaseSensitiveMap().asScala.toMap
// The `JDBCOptions` checks the existence of the table option. This is required by JDBC v1, but
// JDBC V2 only knows the table option when loading a table. Here we put a table option with a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same upper case V1 and V2?

@rdblue
Copy link
Contributor

rdblue commented Aug 21, 2019

@cloud-fan, do you want to close this since #25348 was merged?

@cloud-fan
Copy link
Contributor Author

This PR is for JDBC v2. I'll continue work on it after the v1 read fallback API is done.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Dec 26, 2019
@github-actions github-actions bot closed this Dec 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants