Skip to content

Conversation

@allisonport-db
Copy link
Contributor

@allisonport-db allisonport-db commented Nov 28, 2022

What changes were proposed in this pull request?

Enables creating generated columns in CREATE/REPLACE TABLE statements by specifying a generation expression for a column with GENERATED ALWAYS AS expr.

For example the following will be supported:

CREATE TABLE default.example (
    time TIMESTAMP,
    date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
)

To be more specific this PR

  1. Adds parser support for GENERATED ALWAYS AS expr in create/replace table statements. Generation expressions are temporarily stored in the field's metadata, and then are parsed/verified in DataSourceV2Strategy and used to instantiate v2 Column.
  2. Adds TableCatalog::capabilities() and TableCatalogCapability.SUPPORTS_CREATE_TABLE_WITH_GENERATED_COLUMNS This will be used to determine whether to allow specifying generation expressions or whether to throw an exception.

Why are the changes needed?

GENERATED ALWAYS AS is SQL standard. These changes will allow defining generated columns in create/replace table statements in Spark SQL.

Does this PR introduce any user-facing change?

Using GENERATED ALWAYS AS expr in CREATE/REPLACE table statements will no longer throw a parsing error. When used with a supporting table catalog the query should progress, when used with a nonsupporting catalog there will be an analysis exception.

Previous behavior:

spark-sql> CREATE TABLE default.example (
         >     time TIMESTAMP,
         >     date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
         > )
         > ;
Error in query: 
Syntax error at or near 'GENERATED'(line 3, pos 14)

== SQL ==
CREATE TABLE default.example (
    time TIMESTAMP,
    date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
--------------^^^
)

New behavior:

AnalysisException: [UNSUPPORTED_FEATURE.TABLE_OPERATION] The feature is not supported: Table `my_tab` does not 
support creating generated columns with GENERATED ALWAYS AS expressions. Please check the current catalog and 
namespace to make sure the qualified table name is expected, and also check the catalog implementation which is 
configured by "spark.sql.catalog".

How was this patch tested?

Adds unit tests

@github-actions github-actions bot added the CORE label Nov 28, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@allisonport-db allisonport-db changed the title [SPARK-41290] Support defining generated columns in create table (support GENERATED ALWAYS AS syntax) [SPARK-41290] Support GENERATED ALWAYS AS expressions in create/replace table statements Nov 30, 2022
@allisonport-db allisonport-db changed the title [SPARK-41290] Support GENERATED ALWAYS AS expressions in create/replace table statements [SPARK-41290][SQL] Support GENERATED ALWAYS AS expressions for columns in create/replace table statements Nov 30, 2022
@allisonport-db allisonport-db marked this pull request as ready for review November 30, 2022 07:46
@cloud-fan
Copy link
Contributor

Does the SQL standard say anything about the restrictions of the generate expression? Can we allow GENERATED AS rand()? I think at least catalyst should have a rule to check the expression and make sure the data type is the same as column type.

@cloud-fan
Copy link
Contributor

also cc @huaxingao @sunchao

*
* Override this method to allow defining generated columns in create/replace table statements.
*/
default boolean supportsGeneratedColumnsOnCreation() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just call this method supportsGeneratedColumns, so that we don't need to add a new API if we want to extend generated columns in other statements in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this be an issue though if we add the syntax to another statement (for example alter table add column) and the catalog/data source (whatever we land on) hasn't updated their implementation yet to accomodate for it?

@github-actions github-actions bot removed the CORE label Dec 15, 2022
@allisonport-db
Copy link
Contributor Author

Does the SQL standard say anything about the restrictions of the generate expression? Can we allow GENERATED AS rand()? I think at least catalyst should have a rule to check the expression and make sure the data type is the same as column type.

Checking for data type make sense.

Other restrictions we can enforce

  • no UDFs
  • only deterministic functions
  • no subqueries
  • no window functions
  • no aggregate functions
  • no generator functions
  • interdependence with other generated columns

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for update. Only one question about error method remains.
+1, LGTM.

Comment on lines +140 to +144
if (GeneratedColumn.hasGeneratedColumns(newSchema)) {
throw QueryCompilationErrors.generatedColumnsUnsupported(
Seq(tableDesc.identifier.catalog.get, tableDesc.identifier.database.get,
tableDesc.identifier.table))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be more specified because it is a V1 table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a new error class for it if you think it's necessary?

@viirya
Copy link
Member

viirya commented Feb 24, 2023

Looks good for backport.

@cloud-fan
Copy link
Contributor

Are we going to document this new SQL syntax somewhere

This is a long-standing problem. There are quite some features that only have DS v2 API but no implementation, e.g. UPDATE, MERGE, DELETE. I think we should make the testing in-memory catalog/table a builtin connector, so that we can document these SQL syntaxes with valid examples. cc @huaxingao

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the extra test cases! If we're banning subquery expressions, the validation checks become much simpler now.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.4!

@cloud-fan cloud-fan closed this in bd6b751 Feb 27, 2023
cloud-fan pushed a commit that referenced this pull request Feb 27, 2023
…s in create/replace table statements

### What changes were proposed in this pull request?

Enables creating generated columns in CREATE/REPLACE TABLE statements by specifying a generation expression for a column with GENERATED ALWAYS AS expr.

For example the following will be supported:
```sql
CREATE TABLE default.example (
    time TIMESTAMP,
    date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
)
```

To be more specific this PR
1. Adds parser support for `GENERATED ALWAYS AS expr` in create/replace table statements. Generation expressions are temporarily stored in the field's metadata, and then are parsed/verified in `DataSourceV2Strategy` and used to instantiate v2 [Column](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Column.java).
4. Adds `TableCatalog::capabilities()` and `TableCatalogCapability.SUPPORTS_CREATE_TABLE_WITH_GENERATED_COLUMNS` This will be used to determine whether to allow specifying generation expressions or whether to throw an exception.

### Why are the changes needed?

`GENERATED ALWAYS AS` is SQL standard. These changes will allow defining generated columns in create/replace table statements in Spark SQL.

### Does this PR introduce _any_ user-facing change?

Using `GENERATED ALWAYS AS expr` in CREATE/REPLACE table statements will no longer throw a parsing error. When used with a supporting table catalog the query should progress, when used with a nonsupporting catalog there will be an analysis exception.

Previous behavior:
```
spark-sql> CREATE TABLE default.example (
         >     time TIMESTAMP,
         >     date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
         > )
         > ;
Error in query:
Syntax error at or near 'GENERATED'(line 3, pos 14)

== SQL ==
CREATE TABLE default.example (
    time TIMESTAMP,
    date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
--------------^^^
)
```

New behavior:
```
AnalysisException: [UNSUPPORTED_FEATURE.TABLE_OPERATION] The feature is not supported: Table `my_tab` does not
support creating generated columns with GENERATED ALWAYS AS expressions. Please check the current catalog and
namespace to make sure the qualified table name is expected, and also check the catalog implementation which is
configured by "spark.sql.catalog".
```

### How was this patch tested?

Adds unit tests

Closes #38823 from allisonport-db/parser-support.

Authored-by: Allison Portis <allison.portis@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit bd6b751)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…s in create/replace table statements

### What changes were proposed in this pull request?

Enables creating generated columns in CREATE/REPLACE TABLE statements by specifying a generation expression for a column with GENERATED ALWAYS AS expr.

For example the following will be supported:
```sql
CREATE TABLE default.example (
    time TIMESTAMP,
    date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
)
```

To be more specific this PR
1. Adds parser support for `GENERATED ALWAYS AS expr` in create/replace table statements. Generation expressions are temporarily stored in the field's metadata, and then are parsed/verified in `DataSourceV2Strategy` and used to instantiate v2 [Column](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Column.java).
4. Adds `TableCatalog::capabilities()` and `TableCatalogCapability.SUPPORTS_CREATE_TABLE_WITH_GENERATED_COLUMNS` This will be used to determine whether to allow specifying generation expressions or whether to throw an exception.

### Why are the changes needed?

`GENERATED ALWAYS AS` is SQL standard. These changes will allow defining generated columns in create/replace table statements in Spark SQL.

### Does this PR introduce _any_ user-facing change?

Using `GENERATED ALWAYS AS expr` in CREATE/REPLACE table statements will no longer throw a parsing error. When used with a supporting table catalog the query should progress, when used with a nonsupporting catalog there will be an analysis exception.

Previous behavior:
```
spark-sql> CREATE TABLE default.example (
         >     time TIMESTAMP,
         >     date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
         > )
         > ;
Error in query:
Syntax error at or near 'GENERATED'(line 3, pos 14)

== SQL ==
CREATE TABLE default.example (
    time TIMESTAMP,
    date DATE GENERATED ALWAYS AS (CAST(time AS DATE))
--------------^^^
)
```

New behavior:
```
AnalysisException: [UNSUPPORTED_FEATURE.TABLE_OPERATION] The feature is not supported: Table `my_tab` does not
support creating generated columns with GENERATED ALWAYS AS expressions. Please check the current catalog and
namespace to make sure the qualified table name is expected, and also check the catalog implementation which is
configured by "spark.sql.catalog".
```

### How was this patch tested?

Adds unit tests

Closes apache#38823 from allisonport-db/parser-support.

Authored-by: Allison Portis <allison.portis@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit bd6b751)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
yaooqinn pushed a commit that referenced this pull request Jul 6, 2023
…ere is a char/varchar column in the schema

### What changes were proposed in this pull request?

#38823 added support for defining generated columns in create table statements. This included generation expression validation. This validation currently erroneously fails when there are char or varchar columns anywhere in the table schema since the checkAnalysis fails here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala#L123.

This PR replaces any char/varchar columns in the schema with a string before analysis.

### Why are the changes needed?

This should not fail.
```
CREATE TABLE default.example (
    name VARCHAR(64),
    tstamp TIMESTAMP,
    tstamp_date DATE GENERATED ALWAYS AS (CAST(tstamp as DATE))
)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Adds a unit test.

Closes #41868 from allisonport-db/validateGeneratedColumns-charvarchar.

Authored-by: Allison Portis <allison.portis@databricks.com>
Signed-off-by: Kent Yao <yao@apache.org>
yaooqinn pushed a commit that referenced this pull request Jul 6, 2023
…ere is a char/varchar column in the schema

### What changes were proposed in this pull request?

#38823 added support for defining generated columns in create table statements. This included generation expression validation. This validation currently erroneously fails when there are char or varchar columns anywhere in the table schema since the checkAnalysis fails here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala#L123.

This PR replaces any char/varchar columns in the schema with a string before analysis.

### Why are the changes needed?

This should not fail.
```
CREATE TABLE default.example (
    name VARCHAR(64),
    tstamp TIMESTAMP,
    tstamp_date DATE GENERATED ALWAYS AS (CAST(tstamp as DATE))
)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Adds a unit test.

Closes #41868 from allisonport-db/validateGeneratedColumns-charvarchar.

Authored-by: Allison Portis <allison.portis@databricks.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit f0e1828)
Signed-off-by: Kent Yao <yao@apache.org>
viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
…ere is a char/varchar column in the schema

### What changes were proposed in this pull request?

apache#38823 added support for defining generated columns in create table statements. This included generation expression validation. This validation currently erroneously fails when there are char or varchar columns anywhere in the table schema since the checkAnalysis fails here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala#L123.

This PR replaces any char/varchar columns in the schema with a string before analysis.

### Why are the changes needed?

This should not fail.
```
CREATE TABLE default.example (
    name VARCHAR(64),
    tstamp TIMESTAMP,
    tstamp_date DATE GENERATED ALWAYS AS (CAST(tstamp as DATE))
)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Adds a unit test.

Closes apache#41868 from allisonport-db/validateGeneratedColumns-charvarchar.

Authored-by: Allison Portis <allison.portis@databricks.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit f0e1828)
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.