-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-41290][SQL] Support GENERATED ALWAYS AS expressions for columns in create/replace table statements #38823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableProvider.java
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
Outdated
Show resolved
Hide resolved
|
Does the SQL standard say anything about the restrictions of the generate expression? Can we allow |
|
also cc @huaxingao @sunchao |
| * | ||
| * Override this method to allow defining generated columns in create/replace table statements. | ||
| */ | ||
| default boolean supportsGeneratedColumnsOnCreation() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just call this method supportsGeneratedColumns, so that we don't need to add a new API if we want to extend generated columns in other statements in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this be an issue though if we add the syntax to another statement (for example alter table add column) and the catalog/data source (whatever we land on) hasn't updated their implementation yet to accomodate for it?
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableProvider.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableProvider.java
Outdated
Show resolved
Hide resolved
Checking for data type make sense. Other restrictions we can enforce
|
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
Show resolved
Hide resolved
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for update. Only one question about error method remains.
+1, LGTM.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala
Outdated
Show resolved
Hide resolved
| if (GeneratedColumn.hasGeneratedColumns(newSchema)) { | ||
| throw QueryCompilationErrors.generatedColumnsUnsupported( | ||
| Seq(tableDesc.identifier.catalog.get, tableDesc.identifier.database.get, | ||
| tableDesc.identifier.table)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be more specified because it is a V1 table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a new error class for it if you think it's necessary?
|
Looks good for backport. |
This is a long-standing problem. There are quite some features that only have DS v2 API but no implementation, e.g. UPDATE, MERGE, DELETE. I think we should make the testing in-memory catalog/table a builtin connector, so that we can document these SQL syntaxes with valid examples. cc @huaxingao |
dtenedor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the extra test cases! If we're banning subquery expressions, the validation checks become much simpler now.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala
Show resolved
Hide resolved
|
thanks, merging to master/3.4! |
…s in create/replace table statements
### What changes were proposed in this pull request?
Enables creating generated columns in CREATE/REPLACE TABLE statements by specifying a generation expression for a column with GENERATED ALWAYS AS expr.
For example the following will be supported:
```sql
CREATE TABLE default.example (
time TIMESTAMP,
date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
)
```
To be more specific this PR
1. Adds parser support for `GENERATED ALWAYS AS expr` in create/replace table statements. Generation expressions are temporarily stored in the field's metadata, and then are parsed/verified in `DataSourceV2Strategy` and used to instantiate v2 [Column](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Column.java).
4. Adds `TableCatalog::capabilities()` and `TableCatalogCapability.SUPPORTS_CREATE_TABLE_WITH_GENERATED_COLUMNS` This will be used to determine whether to allow specifying generation expressions or whether to throw an exception.
### Why are the changes needed?
`GENERATED ALWAYS AS` is SQL standard. These changes will allow defining generated columns in create/replace table statements in Spark SQL.
### Does this PR introduce _any_ user-facing change?
Using `GENERATED ALWAYS AS expr` in CREATE/REPLACE table statements will no longer throw a parsing error. When used with a supporting table catalog the query should progress, when used with a nonsupporting catalog there will be an analysis exception.
Previous behavior:
```
spark-sql> CREATE TABLE default.example (
> time TIMESTAMP,
> date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
> )
> ;
Error in query:
Syntax error at or near 'GENERATED'(line 3, pos 14)
== SQL ==
CREATE TABLE default.example (
time TIMESTAMP,
date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
--------------^^^
)
```
New behavior:
```
AnalysisException: [UNSUPPORTED_FEATURE.TABLE_OPERATION] The feature is not supported: Table `my_tab` does not
support creating generated columns with GENERATED ALWAYS AS expressions. Please check the current catalog and
namespace to make sure the qualified table name is expected, and also check the catalog implementation which is
configured by "spark.sql.catalog".
```
### How was this patch tested?
Adds unit tests
Closes #38823 from allisonport-db/parser-support.
Authored-by: Allison Portis <allison.portis@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit bd6b751)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…s in create/replace table statements
### What changes were proposed in this pull request?
Enables creating generated columns in CREATE/REPLACE TABLE statements by specifying a generation expression for a column with GENERATED ALWAYS AS expr.
For example the following will be supported:
```sql
CREATE TABLE default.example (
time TIMESTAMP,
date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
)
```
To be more specific this PR
1. Adds parser support for `GENERATED ALWAYS AS expr` in create/replace table statements. Generation expressions are temporarily stored in the field's metadata, and then are parsed/verified in `DataSourceV2Strategy` and used to instantiate v2 [Column](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Column.java).
4. Adds `TableCatalog::capabilities()` and `TableCatalogCapability.SUPPORTS_CREATE_TABLE_WITH_GENERATED_COLUMNS` This will be used to determine whether to allow specifying generation expressions or whether to throw an exception.
### Why are the changes needed?
`GENERATED ALWAYS AS` is SQL standard. These changes will allow defining generated columns in create/replace table statements in Spark SQL.
### Does this PR introduce _any_ user-facing change?
Using `GENERATED ALWAYS AS expr` in CREATE/REPLACE table statements will no longer throw a parsing error. When used with a supporting table catalog the query should progress, when used with a nonsupporting catalog there will be an analysis exception.
Previous behavior:
```
spark-sql> CREATE TABLE default.example (
> time TIMESTAMP,
> date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
> )
> ;
Error in query:
Syntax error at or near 'GENERATED'(line 3, pos 14)
== SQL ==
CREATE TABLE default.example (
time TIMESTAMP,
date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
--------------^^^
)
```
New behavior:
```
AnalysisException: [UNSUPPORTED_FEATURE.TABLE_OPERATION] The feature is not supported: Table `my_tab` does not
support creating generated columns with GENERATED ALWAYS AS expressions. Please check the current catalog and
namespace to make sure the qualified table name is expected, and also check the catalog implementation which is
configured by "spark.sql.catalog".
```
### How was this patch tested?
Adds unit tests
Closes apache#38823 from allisonport-db/parser-support.
Authored-by: Allison Portis <allison.portis@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit bd6b751)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ere is a char/varchar column in the schema ### What changes were proposed in this pull request? #38823 added support for defining generated columns in create table statements. This included generation expression validation. This validation currently erroneously fails when there are char or varchar columns anywhere in the table schema since the checkAnalysis fails here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala#L123. This PR replaces any char/varchar columns in the schema with a string before analysis. ### Why are the changes needed? This should not fail. ``` CREATE TABLE default.example ( name VARCHAR(64), tstamp TIMESTAMP, tstamp_date DATE GENERATED ALWAYS AS (CAST(tstamp as DATE)) ) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Adds a unit test. Closes #41868 from allisonport-db/validateGeneratedColumns-charvarchar. Authored-by: Allison Portis <allison.portis@databricks.com> Signed-off-by: Kent Yao <yao@apache.org>
…ere is a char/varchar column in the schema ### What changes were proposed in this pull request? #38823 added support for defining generated columns in create table statements. This included generation expression validation. This validation currently erroneously fails when there are char or varchar columns anywhere in the table schema since the checkAnalysis fails here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala#L123. This PR replaces any char/varchar columns in the schema with a string before analysis. ### Why are the changes needed? This should not fail. ``` CREATE TABLE default.example ( name VARCHAR(64), tstamp TIMESTAMP, tstamp_date DATE GENERATED ALWAYS AS (CAST(tstamp as DATE)) ) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Adds a unit test. Closes #41868 from allisonport-db/validateGeneratedColumns-charvarchar. Authored-by: Allison Portis <allison.portis@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit f0e1828) Signed-off-by: Kent Yao <yao@apache.org>
…ere is a char/varchar column in the schema ### What changes were proposed in this pull request? apache#38823 added support for defining generated columns in create table statements. This included generation expression validation. This validation currently erroneously fails when there are char or varchar columns anywhere in the table schema since the checkAnalysis fails here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala#L123. This PR replaces any char/varchar columns in the schema with a string before analysis. ### Why are the changes needed? This should not fail. ``` CREATE TABLE default.example ( name VARCHAR(64), tstamp TIMESTAMP, tstamp_date DATE GENERATED ALWAYS AS (CAST(tstamp as DATE)) ) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Adds a unit test. Closes apache#41868 from allisonport-db/validateGeneratedColumns-charvarchar. Authored-by: Allison Portis <allison.portis@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit f0e1828) Signed-off-by: Kent Yao <yao@apache.org>
What changes were proposed in this pull request?
Enables creating generated columns in CREATE/REPLACE TABLE statements by specifying a generation expression for a column with GENERATED ALWAYS AS expr.
For example the following will be supported:
To be more specific this PR
GENERATED ALWAYS AS exprin create/replace table statements. Generation expressions are temporarily stored in the field's metadata, and then are parsed/verified inDataSourceV2Strategyand used to instantiate v2 Column.TableCatalog::capabilities()andTableCatalogCapability.SUPPORTS_CREATE_TABLE_WITH_GENERATED_COLUMNSThis will be used to determine whether to allow specifying generation expressions or whether to throw an exception.Why are the changes needed?
GENERATED ALWAYS ASis SQL standard. These changes will allow defining generated columns in create/replace table statements in Spark SQL.Does this PR introduce any user-facing change?
Using
GENERATED ALWAYS AS exprin CREATE/REPLACE table statements will no longer throw a parsing error. When used with a supporting table catalog the query should progress, when used with a nonsupporting catalog there will be an analysis exception.Previous behavior:
New behavior:
How was this patch tested?
Adds unit tests