Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26811][SQL] Add capabilities to v2.Table #24012

Closed
wants to merge 4 commits into from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Mar 7, 2019

What changes were proposed in this pull request?

This adds a new method, capabilities to v2.Table that returns a set of TableCapability. Capabilities are used to fail queries during analysis checks, V2WriteSupportCheck, when the table does not support operations, like truncation.

How was this patch tested?

Existing tests for regressions, added new analysis suite, V2WriteSupportCheckSuite, for new capability checks.

@rdblue rdblue force-pushed the SPARK-26811-add-capabilities branch from 22f3953 to 23746e7 Compare March 7, 2019 20:12
@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@rdblue rdblue changed the title [SPARK-26811][SQL] Add capabilities to v2.Table. [SPARK-26811][SQL] Add capabilities to v2.Table Mar 7, 2019
@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Mar 8, 2019

Test build #103161 has finished for PR 24012 at commit 8636867.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 8, 2019

@cloud-fan, can you take a look at this PR? It adds capabilities like we discussed.

@@ -25,7 +25,7 @@
* {@link #newScanBuilder(DataSourceOptions)} that is used to create a scan for batch, micro-batch,
* or continuous processing.
*/
interface SupportsRead extends Table {
public interface SupportsRead extends Table {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we remove this interface as well? We can move newScanBuilder to table and throw exception by default. Tables that reports batch/stream scan capability should overwrite newScanBuilder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having this because it maintains separation between the read/write API and the catalog API. We could update the read and write API later or add a new one by adding a different read trait, without changing how catalogs and tables work. So I think it is worth keeping SupportsRead and SupportsWrite.

/**
* Returns the set of capabilities for this table.
*/
Set<TableCapability> capabilities();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will have tons of capabilities, maybe Array is good enough? Array is also more java/scala friendly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good idea to use an array when the storage should be a set, just because it is necessary to call asJava when returning it.

@cloud-fan
Copy link
Contributor

@rdblue rdblue force-pushed the SPARK-26811-add-capabilities branch from 8636867 to 0d44757 Compare March 13, 2019 21:05
@rdblue
Copy link
Contributor Author

rdblue commented Mar 13, 2019

@cloud-fan, I've rebased to pick up the changes in master introduced by the move to CaseInsensitiveStringMap. I think this is ready to go.

Although I see your point with returning an Array of capabilities, I think it is better to return a Set. That's how Spark uses the data and I see no reason to use the wrong kind of storage -- which we would no doubt coerce to a set -- just to avoid calling asJava in a few places.

@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103455 has finished for PR 24012 at commit 0d44757.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

sounds good.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 14, 2019

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 15, 2019

Test build #103510 has finished for PR 24012 at commit 93c77f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* When possible, this method should return the schema of the given `files`. When the format
* does not support inference, or no valid files are given should return None. In these cases
* Spark will require that user specify the schema manually.
*/
def inferSchema(files: Seq[FileStatus]): Option[StructType]
}

object FileTable {
private lazy val CAPABILITIES = Set(BATCH_READ, BATCH_WRITE, TRUNCATE).asJava
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this doesn't need to be lazy val

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix this since I need to resolve conflicts.

@rdblue rdblue force-pushed the SPARK-26811-add-capabilities branch from 93c77f5 to 69e729e Compare March 15, 2019 18:27
rdblue referenced this pull request Mar 15, 2019
## What changes were proposed in this pull request?

The data source option check_files_exist is introduced in In #23383 when the file source V2 framework is implemented. In the PR, FileIndex was created as a member of FileTable, so that we could implement partition pruning like 0f9fcab in the future. At that time `FileIndex`es will always be created for file writes, so we needed the option to decide whether to check file existence.

After #23774, the option is not needed anymore, since Dataframe writes won't create unnecessary FileIndex. This PR is to remove the option.

## How was this patch tested?

Unit test.

Closes #24069 from gengliangwang/removeOptionCheckFilesExist.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@rdblue
Copy link
Contributor Author

rdblue commented Mar 15, 2019

@cloud-fan, I've fixed the commit conflict caused by 6d22ee3. As I noted on that commit, please do not commit non-functional changes that cause unnecessary conflicts. That problem delayed getting this work in by another day.

I've also removed lazy from that capabilities val.

@SparkQA
Copy link

SparkQA commented Mar 15, 2019

Test build #103546 has finished for PR 24012 at commit 69e729e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 16, 2019

@cloud-fan, tests are passing on this so it is ready for another look. Thank you!

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@rdblue
Copy link
Contributor Author

rdblue commented Mar 18, 2019

Thank you for reviewing this, @cloud-fan!

@rdblue rdblue deleted the SPARK-26811-add-capabilities branch March 18, 2019 16:11
@rxin
Copy link
Contributor

rxin commented Apr 15, 2019

Is there a plan documented on what the final API would look like? It's super confusing to have half capability via traits and half capability via enums.

@cloud-fan
Copy link
Contributor

#24129 is adding streaming read/write capability. Eventually we should have all the capabilities via enum.

cloud-fan added a commit that referenced this pull request Apr 26, 2019
## What changes were proposed in this pull request?

This is a followup of #24012 , to add the corresponding capabilities for streaming.

## How was this patch tested?

existing tests

Closes #24129 from cloud-fan/capability.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
mccheah pushed a commit to palantir/spark that referenced this pull request May 15, 2019
This adds a new method, `capabilities` to `v2.Table` that returns a set of `TableCapability`. Capabilities are used to fail queries during analysis checks, `V2WriteSupportCheck`, when the table does not support operations, like truncation.

Existing tests for regressions, added new analysis suite, `V2WriteSupportCheckSuite`, for new capability checks.

Closes apache#24012 from rdblue/SPARK-26811-add-capabilities.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
mccheah pushed a commit to palantir/spark that referenced this pull request May 24, 2019
## What changes were proposed in this pull request?

It's a followup of apache#24012 , to fix 2 documentation:
1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now.
2. `Scan` should link the `BATCH_READ` instead of hardcoding it.

## How was this patch tested?
N/A

Closes apache#24285 from cloud-fan/doc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
mccheah pushed a commit to palantir/spark that referenced this pull request Jun 6, 2019
## What changes were proposed in this pull request?

It's a followup of apache#24012 , to fix 2 documentation:
1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now.
2. `Scan` should link the `BATCH_READ` instead of hardcoding it.

## How was this patch tested?
N/A

Closes apache#24285 from cloud-fan/doc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
mccheah pushed a commit to palantir/spark that referenced this pull request Jun 6, 2019
This is a followup of apache#24012 , to add the corresponding capabilities for streaming.

existing tests

Closes apache#24129 from cloud-fan/capability.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants