Skip to content

Conversation

@jackye1995
Copy link
Contributor

continuation of #2556 , will need to rebase after that is merged.

Here I propose the following syntax:

ALTER TABLE table REPLACE IDENTIFIER FIELDS (field1, field2, ...)

-- drop all
ALTER TABLE table DROP IDENTIFIER FIELDS

IDENTIFIER and FIELDS are new keywords. Because we already have FIELD as a keyword, I think adding FIELDS can be beneficial for other batch operations in the future. It also makes the statement more clear instead of saying ALTER TABLE table REPLACE IDENTIFIER.

The reason for having another drop statement to drop all is because typically in SQL statements a ( ... ) clause always has at least one element. We can do ALTER TABLE table REPLACE IDENTIFIER FIELDS () instead, but it just feels awkward to me, not sure how other people think about this.

package org.apache.iceberg.types;

import java.util.List;
import java.util.Locale;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contents in this file will be rebased out.

nonReserved
: ADD | ALTER | AS | ASC | BY | CALL | DESC | DROP | FIELD | FIRST | LAST | NULLS | ORDERED | PARTITION | TABLE | WRITE
| DISTRIBUTED | LOCALLY | UNORDERED
| DISTRIBUTED | LOCALLY | UNORDERED | REPLACE | WITH | ROW_IDENTIFIER | FIELDS
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also added missing keywords that I forgot to add in #2365

@jackye1995
Copy link
Contributor Author

@openinx @rdblue @yyanyy could you take a look? Thanks!

| ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform #dropPartitionField
| ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
| ALTER TABLE multipartIdentifier WRITE writeSpec #setWriteDistributionAndOrdering
| ALTER TABLE multipartIdentifier REPLACE ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #replaceIdentifierFields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we now have SET ROW_IDENTIFIER? if not, wondering if it would seem odd if user can only "replace" even when the table doesn't have row identifiers yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @yyanyy 's comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah SET makes sense to me. Was trying to reuse REPLACE keyword, but it does feel awkward.

catalog.loadTable(ident) match {
case iceberg: SparkTable =>
iceberg.table.updateSchema()
.setIdentifierFields(scala.collection.JavaConverters.seqAsJavaList(fields))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not familiar with this code base, curious - is it possible that input fields is null here and we encounter NPE? If it's nullable and NPE safe, do we want replace with empty input to be the same as drop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that was what I was debating about with myself. Technically we can do it, but I have never seen a (...) clause with no element in SQL, so I am hesitated to support that.

If we go with the SET route, I can add a UNSET for removing fields. What do you feel about that?

import org.junit.Assert;
import org.junit.Test;

public class TestAlterTableSchema extends SparkExtensionsTestBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great to see this!

@openinx
Copy link
Member

openinx commented May 8, 2021

I would like to invite @aokolnychyi to review this patch !

@jackye1995
Copy link
Contributor Author

@yyanyy I changed to use SET and UNSET instead of REPLACE and DROP, please see it that feels better, thanks

| ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform #dropPartitionField
| ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
| ALTER TABLE multipartIdentifier WRITE writeSpec #setWriteDistributionAndOrdering
| ALTER TABLE multipartIdentifier SET ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #setIdentifierFields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should fields be a list of multipartIdentifier since we support nested fields?

nonReserved
: ADD | ALTER | AS | ASC | BY | CALL | DESC | DROP | FIELD | FIRST | LAST | NULLS | ORDERED | PARTITION | TABLE | WRITE
| DISTRIBUTED | LOCALLY | UNORDERED
| DISTRIBUTED | LOCALLY | UNORDERED | REPLACE | WITH | ROW_IDENTIFIER | FIELDS | SET | UNSET
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like REPLACE is here, but should not be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we forgot to add REPLACE in an older PR, so I added it here.

| ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
| ALTER TABLE multipartIdentifier WRITE writeSpec #setWriteDistributionAndOrdering
| ALTER TABLE multipartIdentifier SET ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #setIdentifierFields
| ALTER TABLE multipartIdentifier UNSET ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #unsetIdentifierFields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behavior of UNSET IDENTIFIER FIELDS? That isn't obvious to me so I'm wondering if we should change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about DROP instead of UNSET since this is dropping identifier fields from the set?

override protected def run(): Seq[InternalRow] = {
catalog.loadTable(ident) match {
case iceberg: SparkTable =>
val identifierFieldNames = iceberg.table.schema().identifierFieldNames()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it's better to omit parens with Java APIs when they would be omitted in a Scala API. Since this is a getter, I would omit the parens on schema and identifierFieldNames.

catalog.loadTable(ident) match {
case iceberg: SparkTable =>
val identifierFieldNames = iceberg.table.schema().identifierFieldNames()
fields.map(f => identifierFieldNames.remove(f))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not assume that it is safe to modify identifierFieldNames. The class is free to return an immutable collection, and arguably should return one. We don't want to rely on the existing behavior.

@github-actions github-actions bot added the core label Jun 4, 2021
@jackye1995
Copy link
Contributor Author

@rdblue updated based on the suggestions to use DROP. Also fixed a bug along the way regarding nested field in Schema.

return identifierFieldIds()
.stream()
.map(id -> findField(id).name())
.map(id -> lazyIdToName().get(id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. It took me a minute to realize what the difference was.

catalog.loadTable(ident) match {
case iceberg: SparkTable =>
val identifierFieldNames = iceberg.table.schema().identifierFieldNames()
fields.map(f => identifierFieldNames.remove(f))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995, looks like you didn't see my earlier comment:

This should not assume that it is safe to modify identifierFieldNames. The Schema class is free to return an immutable collection, and arguably should return one. We don't want to rely on the existing behavior that it is modifiable.

Can you copy the identifier field names collection and modify the copy instead?

I think this should also validate that all of the fields listed in the DROP call are actually identifier fields. It would be an error if I ran DROP IDENTIFIER FIELDS (manager_id) but the rows are currently identified by user_id.

| ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform #dropPartitionField
| ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
| ALTER TABLE multipartIdentifier WRITE writeSpec #setWriteDistributionAndOrdering
| ALTER TABLE table=multipartIdentifier SET IDENTIFIER_KW FIELDS '(' fields+=multipartIdentifier (',' fields+=multipartIdentifier)* ')' #setIdentifierFields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider allowing more flexible syntax here, like SET IDENTIFIER FIELD user_id since FIELD will make more sense to users. Should the parens be optional?

Could you move the fields to a fieldList rule so that we don't have two copies of it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have the following options to choose from:

  • ADD IDENTIFIER FIELD
  • ADD IDENTIFIER FIELDS
  • SET IDENTIFIER FIELD
  • SET IDENTIFIER FIELDS
  • DROP IDENTIFIER FIELD
  • DROP IDENTIFIER FIELDS

I am choosing to implement the batch operations because that allows one single commit to add or drop multiple fields. Personally I don't feel it is hard for users to specify SET IDENTIFIER FIELDS (user_id), given that all Hive DDLs are in this batch format. Do you think it is worth having a clause also for single field modifications in this case?

If we want to have a way to only add new fields without the need to know current fields, I can also add a ADD IDENTIFIER FIELDS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was based on similar reasoning for other DDL. For example, it is better to support both ADD COLUMNS and ADD COLUMN interchangeably so that users don't get needless syntax errors when they use a plural or singular term with the "wrong" number of columns.

I think that the problem extending that to this is that it isn't clear that SET IDENTIFIER FIELD x actually replaces all existing identifier fields, which I think is why you're mentioning not wanting to use the ADD IDENTIFIER FIELD syntax. I think that if there is confusion now, then there would be more confusion for users, so maybe it is better to only support SET IDENTIFIER FIELDS (...). Though I would make the parens optional.

. . . given that all Hive DDLs are in this batch format

I don't think that we want to take inspiration from Hive very much. Better to use Postgres or Trino as reference points.

If we want to have a way to only add new fields without the need to know current fields, I can also add a ADD IDENTIFIER FIELDS.

This wasn't my intent and I would not want to have confusion, so let's not allow using FIELD interchangeably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the parens optional

Do you mean to also not have parens optional when there are multiple fields, like:

SET IDENTIFIER FIELDS first_name, last_name

Or is it only for the case when there is 1 field?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would make it possible to use that syntax. No need for users to remember that they need parens when we don't actually need them to parse correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay that makes sense. I will update with all the other comments, thanks!

}

@Test
public void testDropIdentifierFields() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a test for dropping an identifier field that isn't an identifier field, and for dropping a field that isn't a table field. The error messages should probably be different for those cases?

}

@Test
public void testSetIdentifierFields() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a test for using an identifier field that is not defined in the table.

@jackye1995
Copy link
Contributor Author

@rdblue sorry I was stuck with some internal work and did not get time to update this. the latest code addressed most of the comments.

I have been thinking about the syntax over the last few days, I think I was very biased with the Hive DDL syntax, but as you pointed out, if we prefer the other way, it might be better to instead use a syntax like:

ALTER TABLE table WRITE IDENTIFIED BY first_name,last_name;

And we don't support any other syntax like add and drop.

This avoids the confusion and it should be clear that we are replacing all the identifier fields. It is closer to the existing syntax of sort order, plus it does not need parentheses. What do you think?


for (name <- fields) {
Preconditions.checkArgument(schema.findField(name) != null,
"Cannot complete drop identifier fields operation: field %s not found", name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you can omit "complete" and "operation" to be more concise. "Cannot drop identifier field: %s (not found)" says basically the same thing.

@rdblue rdblue merged commit 1460743 into apache:master Jun 22, 2021
@rdblue
Copy link
Contributor

rdblue commented Jun 22, 2021

@jackye1995, sorry, I didn't read your last comment before looking at the updates. I approved and merged this.

I like your idea to change this to WRITE IDENTIFIED BY. I'd like to hear what @aokolnychyi thinks about it. I'm happy with the current syntax as well.

@jackye1995
Copy link
Contributor Author

Okay let me create an issue to discuss that

chenjunjiedada pushed a commit to chenjunjiedada/incubator-iceberg that referenced this pull request Oct 20, 2021
Merge remote-tracking branch 'upstream/merge-master-20210816' into master
## 该MR主要解决什么?

merge upstream/master,引入最近的一些bugFix和优化

## 该MR的修改是什么?

核心关注PR:
> Predicate PushDown 支持,https://github.com/apache/iceberg/pull/2358, https://github.com/apache/iceberg/pull/2926, https://github.com/apache/iceberg/pull/2777/files
> Spark场景写入空dataset 报错问题,直接skip掉即可, apache#2960
> Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务, apache#288
> Spark 修复nested Struct Pruning问题, apache#2877
> 可以使用Table Properties指定创建v2 format表,apache#2887
> 补充SortRewriteStrategy框架,逐步支持不同rewrite策略, apache#2609 (WIP:apache#2829)
> Spark 为catalog配置hadoop属性支持, apache#2792
> Spark 针对timestamps without timezone读写支持, apache#2757
> Spark MicroBatch支持配置属性skip delete snapshots, apache#2752
> Spark V2 RewriteDatafilesAction 支持
> Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关,补充schema-id, Core: add schema id to snapshot 
> Spark Extension支持identifier fields操作, apache#2560
> Parquet: Update to 1.12.0, apache#2441
> Hive: Vectorized ORC reads for Hive, apache#2613
> Spark: Add an action to remove all referenced files, apache#2415

## 该MR是如何测试的?

UT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants