-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26811][SQL] Add capabilities to v2.Table #24012
Conversation
22f3953
to
23746e7
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Test build #103161 has finished for PR 24012 at commit
|
@cloud-fan, can you take a look at this PR? It adds capabilities like we discussed. |
@@ -25,7 +25,7 @@ | |||
* {@link #newScanBuilder(DataSourceOptions)} that is used to create a scan for batch, micro-batch, | |||
* or continuous processing. | |||
*/ | |||
interface SupportsRead extends Table { | |||
public interface SupportsRead extends Table { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we remove this interface as well? We can move newScanBuilder
to table and throw exception by default. Tables that reports batch/stream scan capability should overwrite newScanBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having this because it maintains separation between the read/write API and the catalog API. We could update the read and write API later or add a new one by adding a different read trait, without changing how catalogs and tables work. So I think it is worth keeping SupportsRead
and SupportsWrite
.
/** | ||
* Returns the set of capabilities for this table. | ||
*/ | ||
Set<TableCapability> capabilities(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we will have tons of capabilities, maybe Array
is good enough? Array
is also more java/scala friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good idea to use an array when the storage should be a set, just because it is necessary to call asJava
when returning it.
8636867
to
0d44757
Compare
@cloud-fan, I've rebased to pick up the changes in master introduced by the move to Although I see your point with returning an Array of capabilities, I think it is better to return a Set. That's how Spark uses the data and I see no reason to use the wrong kind of storage -- which we would no doubt coerce to a set -- just to avoid calling |
Test build #103455 has finished for PR 24012 at commit
|
sounds good. |
Retest this please. |
Test build #103510 has finished for PR 24012 at commit
|
/** | ||
* When possible, this method should return the schema of the given `files`. When the format | ||
* does not support inference, or no valid files are given should return None. In these cases | ||
* Spark will require that user specify the schema manually. | ||
*/ | ||
def inferSchema(files: Seq[FileStatus]): Option[StructType] | ||
} | ||
|
||
object FileTable { | ||
private lazy val CAPABILITIES = Set(BATCH_READ, BATCH_WRITE, TRUNCATE).asJava |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this doesn't need to be lazy val
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix this since I need to resolve conflicts.
93c77f5
to
69e729e
Compare
## What changes were proposed in this pull request? The data source option check_files_exist is introduced in In #23383 when the file source V2 framework is implemented. In the PR, FileIndex was created as a member of FileTable, so that we could implement partition pruning like 0f9fcab in the future. At that time `FileIndex`es will always be created for file writes, so we needed the option to decide whether to check file existence. After #23774, the option is not needed anymore, since Dataframe writes won't create unnecessary FileIndex. This PR is to remove the option. ## How was this patch tested? Unit test. Closes #24069 from gengliangwang/removeOptionCheckFilesExist. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan, I've fixed the commit conflict caused by 6d22ee3. As I noted on that commit, please do not commit non-functional changes that cause unnecessary conflicts. That problem delayed getting this work in by another day. I've also removed |
Test build #103546 has finished for PR 24012 at commit
|
@cloud-fan, tests are passing on this so it is ready for another look. Thank you! |
thanks, merging to master! |
Thank you for reviewing this, @cloud-fan! |
Is there a plan documented on what the final API would look like? It's super confusing to have half capability via traits and half capability via enums. |
#24129 is adding streaming read/write capability. Eventually we should have all the capabilities via enum. |
## What changes were proposed in this pull request? This is a followup of #24012 , to add the corresponding capabilities for streaming. ## How was this patch tested? existing tests Closes #24129 from cloud-fan/capability. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This adds a new method, `capabilities` to `v2.Table` that returns a set of `TableCapability`. Capabilities are used to fail queries during analysis checks, `V2WriteSupportCheck`, when the table does not support operations, like truncation. Existing tests for regressions, added new analysis suite, `V2WriteSupportCheckSuite`, for new capability checks. Closes apache#24012 from rdblue/SPARK-26811-add-capabilities. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? It's a followup of apache#24012 , to fix 2 documentation: 1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now. 2. `Scan` should link the `BATCH_READ` instead of hardcoding it. ## How was this patch tested? N/A Closes apache#24285 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? It's a followup of apache#24012 , to fix 2 documentation: 1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now. 2. `Scan` should link the `BATCH_READ` instead of hardcoding it. ## How was this patch tested? N/A Closes apache#24285 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This is a followup of apache#24012 , to add the corresponding capabilities for streaming. existing tests Closes apache#24129 from cloud-fan/capability. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This adds a new method,
capabilities
tov2.Table
that returns a set ofTableCapability
. Capabilities are used to fail queries during analysis checks,V2WriteSupportCheck
, when the table does not support operations, like truncation.How was this patch tested?
Existing tests for regressions, added new analysis suite,
V2WriteSupportCheckSuite
, for new capability checks.