[SPARK-26811][SQL] Add capabilities to v2.Table #24012

rdblue · 2019-03-07T20:06:36Z

What changes were proposed in this pull request?

This adds a new method, capabilities to v2.Table that returns a set of TableCapability. Capabilities are used to fail queries during analysis checks, V2WriteSupportCheck, when the table does not support operations, like truncation.

How was this patch tested?

Existing tests for regressions, added new analysis suite, V2WriteSupportCheckSuite, for new capability checks.

SparkQA · 2019-03-08T00:54:53Z

Test build #103161 has finished for PR 24012 at commit 8636867.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-03-08T16:34:01Z

@cloud-fan, can you take a look at this PR? It adds capabilities like we discussed.

cloud-fan · 2019-03-11T06:05:25Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsRead.java

@@ -25,7 +25,7 @@
 * {@link #newScanBuilder(DataSourceOptions)} that is used to create a scan for batch, micro-batch,
 * or continuous processing.
 */
-interface SupportsRead extends Table {
+public interface SupportsRead extends Table {


shall we remove this interface as well? We can move newScanBuilder to table and throw exception by default. Tables that reports batch/stream scan capability should overwrite newScanBuilder

I like having this because it maintains separation between the read/write API and the catalog API. We could update the read and write API later or add a new one by adding a different read trait, without changing how catalogs and tables work. So I think it is worth keeping SupportsRead and SupportsWrite.

cloud-fan · 2019-03-12T16:25:01Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java

+  /**
+   * Returns the set of capabilities for this table.
+   */
+  Set<TableCapability> capabilities();


I don't think we will have tons of capabilities, maybe Array is good enough? Array is also more java/scala friendly.

I don't think it's a good idea to use an array when the storage should be a set, just because it is necessary to call asJava when returning it.

cloud-fan · 2019-03-13T17:40:46Z

LGTM except https://github.com/apache/spark/pull/24012/files#r264765864

rdblue · 2019-03-13T21:07:54Z

@cloud-fan, I've rebased to pick up the changes in master introduced by the move to CaseInsensitiveStringMap. I think this is ready to go.

Although I see your point with returning an Array of capabilities, I think it is better to return a Set. That's how Spark uses the data and I see no reason to use the wrong kind of storage -- which we would no doubt coerce to a set -- just to avoid calling asJava in a few places.

SparkQA · 2019-03-13T22:17:27Z

Test build #103455 has finished for PR 24012 at commit 0d44757.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-14T03:16:39Z

sounds good.

rdblue · 2019-03-14T20:21:23Z

Retest this please.

SparkQA · 2019-03-15T01:39:36Z

Test build #103510 has finished for PR 24012 at commit 93c77f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-15T14:53:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala

  /**
   * When possible, this method should return the schema of the given `files`.  When the format
   * does not support inference, or no valid files are given should return None.  In these cases
   * Spark will require that user specify the schema manually.
   */
  def inferSchema(files: Seq[FileStatus]): Option[StructType]
 }
+
+object FileTable {
+  private lazy val CAPABILITIES = Set(BATCH_READ, BATCH_WRITE, TRUNCATE).asJava


nit: this doesn't need to be lazy val

I'll fix this since I need to resolve conflicts.

## What changes were proposed in this pull request? The data source option check_files_exist is introduced in In #23383 when the file source V2 framework is implemented. In the PR, FileIndex was created as a member of FileTable, so that we could implement partition pruning like 0f9fcab in the future. At that time `FileIndex`es will always be created for file writes, so we needed the option to decide whether to check file existence. After #23774, the option is not needed anymore, since Dataframe writes won't create unnecessary FileIndex. This PR is to remove the option. ## How was this patch tested? Unit test. Closes #24069 from gengliangwang/removeOptionCheckFilesExist. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

rdblue · 2019-03-15T18:31:54Z

@cloud-fan, I've fixed the commit conflict caused by 6d22ee3. As I noted on that commit, please do not commit non-functional changes that cause unnecessary conflicts. That problem delayed getting this work in by another day.

I've also removed lazy from that capabilities val.

SparkQA · 2019-03-15T22:48:40Z

Test build #103546 has finished for PR 24012 at commit 69e729e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-03-16T19:08:53Z

@cloud-fan, tests are passing on this so it is ready for another look. Thank you!

cloud-fan · 2019-03-18T10:25:22Z

thanks, merging to master!

rdblue · 2019-03-18T16:11:33Z

Thank you for reviewing this, @cloud-fan!

rxin · 2019-04-15T23:11:00Z

Is there a plan documented on what the final API would look like? It's super confusing to have half capability via traits and half capability via enums.

cloud-fan · 2019-04-16T02:01:35Z

#24129 is adding streaming read/write capability. Eventually we should have all the capabilities via enum.

## What changes were proposed in this pull request? This is a followup of #24012 , to add the corresponding capabilities for streaming. ## How was this patch tested? existing tests Closes #24129 from cloud-fan/capability. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

This adds a new method, `capabilities` to `v2.Table` that returns a set of `TableCapability`. Capabilities are used to fail queries during analysis checks, `V2WriteSupportCheck`, when the table does not support operations, like truncation. Existing tests for regressions, added new analysis suite, `V2WriteSupportCheckSuite`, for new capability checks. Closes apache#24012 from rdblue/SPARK-26811-add-capabilities. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? It's a followup of apache#24012 , to fix 2 documentation: 1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now. 2. `Scan` should link the `BATCH_READ` instead of hardcoding it. ## How was this patch tested? N/A Closes apache#24285 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

This is a followup of apache#24012 , to add the corresponding capabilities for streaming. existing tests Closes apache#24129 from cloud-fan/capability. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

rdblue force-pushed the SPARK-26811-add-capabilities branch from 22f3953 to 23746e7 Compare March 7, 2019 20:12

This comment has been minimized.

Sign in to view

rdblue changed the title ~~[SPARK-26811][SQL] Add capabilities to v2.Table.~~ [SPARK-26811][SQL] Add capabilities to v2.Table Mar 7, 2019

This comment has been minimized.

Sign in to view

cloud-fan reviewed Mar 11, 2019

View reviewed changes

cloud-fan reviewed Mar 12, 2019

View reviewed changes

rdblue force-pushed the SPARK-26811-add-capabilities branch from 8636867 to 0d44757 Compare March 13, 2019 21:05

cloud-fan reviewed Mar 15, 2019

View reviewed changes

rdblue added 4 commits March 15, 2019 11:26

Add capabilities to v2.Table.

bd9fe02

Add docs to TableCapability.

d25ba82

Fix Kafka source.

520ade1

Update V2WriteSupportCheckSuite for CaseInsensitiveStringMap change.

69e729e

rdblue force-pushed the SPARK-26811-add-capabilities branch from 93c77f5 to 69e729e Compare March 15, 2019 18:27

cloud-fan closed this in e348f14 Mar 18, 2019

cloud-fan mentioned this pull request Mar 18, 2019

[SPARK-27190][SQL] add table capability for streaming #24129

Closed

rdblue deleted the SPARK-26811-add-capabilities branch March 18, 2019 16:11

cloud-fan mentioned this pull request Apr 3, 2019

[SPARK-26811][SQL][followup] fix some documentation #24285

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26811][SQL] Add capabilities to v2.Table #24012

[SPARK-26811][SQL] Add capabilities to v2.Table #24012

rdblue commented Mar 7, 2019 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

SparkQA commented Mar 8, 2019

rdblue commented Mar 8, 2019

cloud-fan Mar 11, 2019

rdblue Mar 12, 2019

cloud-fan Mar 12, 2019

rdblue Mar 13, 2019

cloud-fan commented Mar 13, 2019

rdblue commented Mar 13, 2019

SparkQA commented Mar 13, 2019

cloud-fan commented Mar 14, 2019

rdblue commented Mar 14, 2019

SparkQA commented Mar 15, 2019

cloud-fan Mar 15, 2019

rdblue Mar 15, 2019

rdblue commented Mar 15, 2019

SparkQA commented Mar 15, 2019

rdblue commented Mar 16, 2019

cloud-fan commented Mar 18, 2019

rdblue commented Mar 18, 2019

rxin commented Apr 15, 2019 •

edited

Loading

cloud-fan commented Apr 16, 2019

[SPARK-26811][SQL] Add capabilities to v2.Table #24012

[SPARK-26811][SQL] Add capabilities to v2.Table #24012

Conversation

rdblue commented Mar 7, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

SparkQA commented Mar 8, 2019

rdblue commented Mar 8, 2019

cloud-fan Mar 11, 2019

Choose a reason for hiding this comment

rdblue Mar 12, 2019

Choose a reason for hiding this comment

cloud-fan Mar 12, 2019

Choose a reason for hiding this comment

rdblue Mar 13, 2019

Choose a reason for hiding this comment

cloud-fan commented Mar 13, 2019

rdblue commented Mar 13, 2019

SparkQA commented Mar 13, 2019

cloud-fan commented Mar 14, 2019

rdblue commented Mar 14, 2019

SparkQA commented Mar 15, 2019

cloud-fan Mar 15, 2019

Choose a reason for hiding this comment

rdblue Mar 15, 2019

Choose a reason for hiding this comment

rdblue commented Mar 15, 2019

SparkQA commented Mar 15, 2019

rdblue commented Mar 16, 2019

cloud-fan commented Mar 18, 2019

rdblue commented Mar 18, 2019

rxin commented Apr 15, 2019 • edited Loading

cloud-fan commented Apr 16, 2019

rdblue commented Mar 7, 2019 •

edited

Loading

rxin commented Apr 15, 2019 •

edited

Loading