Spark: add spark extension for identifier fields #2560

jackye1995 · 2021-05-07T02:14:11Z

continuation of #2556 , will need to rebase after that is merged.

Here I propose the following syntax:

ALTER TABLE table REPLACE IDENTIFIER FIELDS (field1, field2, ...)

-- drop all
ALTER TABLE table DROP IDENTIFIER FIELDS

IDENTIFIER and FIELDS are new keywords. Because we already have FIELD as a keyword, I think adding FIELDS can be beneficial for other batch operations in the future. It also makes the statement more clear instead of saying ALTER TABLE table REPLACE IDENTIFIER.

The reason for having another drop statement to drop all is because typically in SQL statements a ( ... ) clause always has at least one element. We can do ALTER TABLE table REPLACE IDENTIFIER FIELDS () instead, but it just feels awkward to me, not sure how other people think about this.

jackye1995 · 2021-05-07T02:14:49Z

api/src/main/java/org/apache/iceberg/types/TypeUtil.java

 package org.apache.iceberg.types;

 import java.util.List;
 import java.util.Locale;


contents in this file will be rebased out.

jackye1995 · 2021-05-07T02:15:46Z

...sions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4

 nonReserved
    : ADD | ALTER | AS | ASC | BY | CALL | DESC | DROP | FIELD | FIRST | LAST | NULLS | ORDERED | PARTITION | TABLE | WRITE
-    | DISTRIBUTED | LOCALLY | UNORDERED
+    | DISTRIBUTED | LOCALLY | UNORDERED | REPLACE | WITH | ROW_IDENTIFIER | FIELDS


also added missing keywords that I forgot to add in #2365

jackye1995 · 2021-05-07T02:17:06Z

@openinx @rdblue @yyanyy could you take a look? Thanks!

yyanyy · 2021-05-07T21:16:46Z

...sions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4

    | ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform                        #dropPartitionField
    | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
    | ALTER TABLE multipartIdentifier WRITE writeSpec                                       #setWriteDistributionAndOrdering
+    | ALTER TABLE multipartIdentifier REPLACE ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #replaceIdentifierFields


do we now have SET ROW_IDENTIFIER? if not, wondering if it would seem odd if user can only "replace" even when the table doesn't have row identifiers yet

+1 for @yyanyy 's comment.

yeah SET makes sense to me. Was trying to reuse REPLACE keyword, but it does feel awkward.

yyanyy · 2021-05-07T21:16:51Z

...c/main/scala/org/apache/spark/sql/execution/datasources/v2/ReplaceIdentifierFieldsExec.scala

+    catalog.loadTable(ident) match {
+      case iceberg: SparkTable =>
+        iceberg.table.updateSchema()
+          .setIdentifierFields(scala.collection.JavaConverters.seqAsJavaList(fields))


not familiar with this code base, curious - is it possible that input fields is null here and we encounter NPE? If it's nullable and NPE safe, do we want replace with empty input to be the same as drop?

Yeah that was what I was debating about with myself. Technically we can do it, but I have never seen a (...) clause with no element in SQL, so I am hesitated to support that.

If we go with the SET route, I can add a UNSET for removing fields. What do you feel about that?

yyanyy · 2021-05-07T21:18:21Z

spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAlterTableSchema.java

+import org.junit.Assert;
+import org.junit.Test;
+
+public class TestAlterTableSchema extends SparkExtensionsTestBase {


great to see this!

openinx · 2021-05-08T02:30:36Z

I would like to invite @aokolnychyi to review this patch !

jackye1995 · 2021-05-11T03:05:51Z

@yyanyy I changed to use SET and UNSET instead of REPLACE and DROP, please see it that feels better, thanks

rdblue · 2021-05-21T22:31:39Z

...sions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4

    | ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform                        #dropPartitionField
    | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
    | ALTER TABLE multipartIdentifier WRITE writeSpec                                       #setWriteDistributionAndOrdering
+    | ALTER TABLE multipartIdentifier SET ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #setIdentifierFields


Should fields be a list of multipartIdentifier since we support nested fields?

rdblue · 2021-05-21T22:32:07Z

...sions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4

 nonReserved
    : ADD | ALTER | AS | ASC | BY | CALL | DESC | DROP | FIELD | FIRST | LAST | NULLS | ORDERED | PARTITION | TABLE | WRITE
-    | DISTRIBUTED | LOCALLY | UNORDERED
+    | DISTRIBUTED | LOCALLY | UNORDERED | REPLACE | WITH | ROW_IDENTIFIER | FIELDS | SET | UNSET


Looks like REPLACE is here, but should not be?

we forgot to add REPLACE in an older PR, so I added it here.

...sions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4

rdblue · 2021-05-21T22:33:39Z

...sions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4

    | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
    | ALTER TABLE multipartIdentifier WRITE writeSpec                                       #setWriteDistributionAndOrdering
+    | ALTER TABLE multipartIdentifier SET ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #setIdentifierFields
+    | ALTER TABLE multipartIdentifier UNSET ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #unsetIdentifierFields


What is the behavior of UNSET IDENTIFIER FIELDS? That isn't obvious to me so I'm wondering if we should change it.

What about DROP instead of UNSET since this is dropping identifier fields from the set?

rdblue · 2021-05-21T22:36:42Z

...src/main/scala/org/apache/spark/sql/execution/datasources/v2/UnsetIdentifierFieldsExec.scala

+  override protected def run(): Seq[InternalRow] = {
+    catalog.loadTable(ident) match {
+      case iceberg: SparkTable =>
+        val identifierFieldNames = iceberg.table.schema().identifierFieldNames()


Nit: it's better to omit parens with Java APIs when they would be omitted in a Scala API. Since this is a getter, I would omit the parens on schema and identifierFieldNames.

rdblue · 2021-05-21T22:37:36Z

...src/main/scala/org/apache/spark/sql/execution/datasources/v2/UnsetIdentifierFieldsExec.scala

+    catalog.loadTable(ident) match {
+      case iceberg: SparkTable =>
+        val identifierFieldNames = iceberg.table.schema().identifierFieldNames()
+        fields.map(f => identifierFieldNames.remove(f))


This should not assume that it is safe to modify identifierFieldNames. The class is free to return an immutable collection, and arguably should return one. We don't want to rely on the existing behavior.

jackye1995 · 2021-06-04T02:29:36Z

@rdblue updated based on the suggestions to use DROP. Also fixed a bug along the way regarding nested field in Schema.

rdblue · 2021-06-16T20:19:59Z

api/src/main/java/org/apache/iceberg/Schema.java

    return identifierFieldIds()
            .stream()
-            .map(id -> findField(id).name())
+            .map(id -> lazyIdToName().get(id))


Good catch. It took me a minute to realize what the difference was.

rdblue · 2021-06-16T20:31:12Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/v2/DropIdentifierFieldsExec.scala

+    catalog.loadTable(ident) match {
+      case iceberg: SparkTable =>
+        val identifierFieldNames = iceberg.table.schema().identifierFieldNames()
+        fields.map(f => identifierFieldNames.remove(f))


@jackye1995, looks like you didn't see my earlier comment:

This should not assume that it is safe to modify identifierFieldNames. The Schema class is free to return an immutable collection, and arguably should return one. We don't want to rely on the existing behavior that it is modifiable.

Can you copy the identifier field names collection and modify the copy instead?

I think this should also validate that all of the fields listed in the DROP call are actually identifier fields. It would be an error if I ran DROP IDENTIFIER FIELDS (manager_id) but the rows are currently identified by user_id.

rdblue · 2021-06-16T20:38:52Z

...sions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4

    | ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform                        #dropPartitionField
    | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
    | ALTER TABLE multipartIdentifier WRITE writeSpec                                       #setWriteDistributionAndOrdering
+    | ALTER TABLE table=multipartIdentifier SET IDENTIFIER_KW FIELDS '(' fields+=multipartIdentifier (',' fields+=multipartIdentifier)* ')' #setIdentifierFields


You might consider allowing more flexible syntax here, like SET IDENTIFIER FIELD user_id since FIELD will make more sense to users. Should the parens be optional?

Could you move the fields to a fieldList rule so that we don't have two copies of it?

I think we have the following options to choose from:

ADD IDENTIFIER FIELD

ADD IDENTIFIER FIELDS

SET IDENTIFIER FIELD

SET IDENTIFIER FIELDS

DROP IDENTIFIER FIELD

DROP IDENTIFIER FIELDS

I am choosing to implement the batch operations because that allows one single commit to add or drop multiple fields. Personally I don't feel it is hard for users to specify SET IDENTIFIER FIELDS (user_id), given that all Hive DDLs are in this batch format. Do you think it is worth having a clause also for single field modifications in this case?

If we want to have a way to only add new fields without the need to know current fields, I can also add a ADD IDENTIFIER FIELDS.

My comment was based on similar reasoning for other DDL. For example, it is better to support both ADD COLUMNS and ADD COLUMN interchangeably so that users don't get needless syntax errors when they use a plural or singular term with the "wrong" number of columns.

I think that the problem extending that to this is that it isn't clear that SET IDENTIFIER FIELD x actually replaces all existing identifier fields, which I think is why you're mentioning not wanting to use the ADD IDENTIFIER FIELD syntax. I think that if there is confusion now, then there would be more confusion for users, so maybe it is better to only support SET IDENTIFIER FIELDS (...). Though I would make the parens optional.

. . . given that all Hive DDLs are in this batch format

I don't think that we want to take inspiration from Hive very much. Better to use Postgres or Trino as reference points.

If we want to have a way to only add new fields without the need to know current fields, I can also add a ADD IDENTIFIER FIELDS.

This wasn't my intent and I would not want to have confusion, so let's not allow using FIELD interchangeably.

I would make the parens optional

Do you mean to also not have parens optional when there are multiple fields, like:

SET IDENTIFIER FIELDS first_name, last_name

Or is it only for the case when there is 1 field?

Yeah, I would make it possible to use that syntax. No need for users to remember that they need parens when we don't actually need them to parse correctly.

Okay that makes sense. I will update with all the other comments, thanks!

rdblue · 2021-06-16T20:54:12Z

spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAlterTableSchema.java

+  }
+
+  @Test
+  public void testDropIdentifierFields() {


It would be good to have a test for dropping an identifier field that isn't an identifier field, and for dropping a field that isn't a table field. The error messages should probably be different for those cases?

rdblue · 2021-06-16T20:54:48Z

spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAlterTableSchema.java

+  }
+
+  @Test
+  public void testSetIdentifierFields() {


It would be good to have a test for using an identifier field that is not defined in the table.

jackye1995 · 2021-06-22T08:12:43Z

@rdblue sorry I was stuck with some internal work and did not get time to update this. the latest code addressed most of the comments.

I have been thinking about the syntax over the last few days, I think I was very biased with the Hive DDL syntax, but as you pointed out, if we prefer the other way, it might be better to instead use a syntax like:

ALTER TABLE table WRITE IDENTIFIED BY first_name,last_name;

And we don't support any other syntax like add and drop.

This avoids the confusion and it should be clear that we are replacing all the identifier fields. It is closer to the existing syntax of sort order, plus it does not need parentheses. What do you think?

rdblue · 2021-06-22T19:31:22Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/v2/DropIdentifierFieldsExec.scala

+
+        for (name <- fields) {
+          Preconditions.checkArgument(schema.findField(name) != null,
+            "Cannot complete drop identifier fields operation: field %s not found", name)


Nit: you can omit "complete" and "operation" to be more concise. "Cannot drop identifier field: %s (not found)" says basically the same thing.

rdblue · 2021-06-22T19:34:39Z

@jackye1995, sorry, I didn't read your last comment before looking at the updates. I approved and merged this.

I like your idea to change this to WRITE IDENTIFIED BY. I'd like to hear what @aokolnychyi thinks about it. I'm happy with the current syntax as well.

jackye1995 · 2021-06-22T19:47:13Z

Okay let me create an issue to discuss that

Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么？ merge upstream/master，引入最近的一些bugFix和优化 ## 该MR的修改是什么？核心关注PR： > Predicate PushDown 支持，https://github.com/apache/iceberg/pull/2358， https://github.com/apache/iceberg/pull/2926， https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题，直接skip掉即可， apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务， apache#288 > Spark 修复nested Struct Pruning问题， apache#2877 > 可以使用Table Properties指定创建v2 format表，apache#2887 > 补充SortRewriteStrategy框架，逐步支持不同rewrite策略， apache#2609 （WIP：apache#2829） > Spark 为catalog配置hadoop属性支持， apache#2792 > Spark 针对timestamps without timezone读写支持， apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots， apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关，补充schema-id， Core: add schema id to snapshot > Spark Extension支持identifier fields操作， apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的？ UT

github-actions bot added API spark labels May 7, 2021

jackye1995 commented May 7, 2021

View reviewed changes

yyanyy reviewed May 7, 2021

View reviewed changes

rdblue reviewed May 21, 2021

View reviewed changes

...sions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4 Outdated Show resolved Hide resolved

rdblue reviewed May 21, 2021

View reviewed changes

github-actions bot added the core label Jun 4, 2021

rdblue reviewed Jun 16, 2021

View reviewed changes

Jack Ye added 2 commits June 17, 2021 13:56

Spark: add spark extension for identifier fields

6cdc968

fix based on comments

f941d75

rdblue reviewed Jun 22, 2021

View reviewed changes

rdblue approved these changes Jun 22, 2021

View reviewed changes

rdblue merged commit 1460743 into apache:master Jun 22, 2021

jackye1995 mentioned this pull request Jun 22, 2021

Spark SQL extension syntax for identifier field update #2724

Closed

lvyanquan mentioned this pull request Dec 16, 2022

Docs: add example of spark extension for identifier fields #6444

Merged

Spark: add spark extension for identifier fields #2560

Spark: add spark extension for identifier fields #2560

Uh oh!

Conversation

jackye1995 commented May 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented May 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openinx commented May 8, 2021

Uh oh!

jackye1995 commented May 11, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented Jun 4, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented Jun 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 22, 2021

Uh oh!

jackye1995 commented Jun 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!