Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Schema for a branch should return table schema #9131

Merged
merged 1 commit into from
Dec 5, 2023

Conversation

nastra
Copy link
Contributor

@nastra nastra commented Nov 22, 2023

When retrieving the schema for branch we should always return the table schema instead of the snapshot schema. This is because the table schema is the schema that will be used when the branch will be created. We should only return the schema of the snapshot when we have a tag.
Below is an example that shows the weird schema behavior when describing a table.

-- create a table with a single column and insert a value
spark-sql (default)> create table t (s string);
spark-sql (default)> insert into t values ('foo');
-- create a branch, the schema is the same as the original table
spark-sql (default)> alter table t create branch b1;
spark-sql (default)> describe default.t;
s                       string                      --> this schema comes from top-level table metadata
spark-sql (default)> describe default.t.branch_b1;
s                       string                      --> this is the same schema, but comes from the snapshot with the record ('foo')
-- alter the table schema and now the definitions diverge
spark-sql (default)> alter table t add column i int;
spark-sql (default)> describe default.t;
s                       string
i                       int
spark-sql (default)> describe default.t.branch_b1;
s                       string
-- insert into the branch and the schema changes back to the top-level schmea
spark-sql (default)> insert into default.t.branch_b1 values ('bar');

spark-sql (default)> describe default.t.branch_b1;
s                       string
i                       int

.containsExactly(
new GenericRowWithSchema(new Object[] {1}, null),
new GenericRowWithSchema(new Object[] {2}, null),
new GenericRowWithSchema(new Object[] {3}, null));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use SimpleRecord like the rest of the tests do instead?

Copy link
Contributor Author

@nastra nastra Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately that doesn't work, because SimpleRecord expects the data field to be populated. The particular error is [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name data cannot be resolved. Did you mean one of the following? [id].

@nastra nastra force-pushed the branch-schema branch 2 times, most recently from d1a7ff8 to 761fdae Compare November 23, 2023 10:40
@@ -171,7 +171,7 @@ public Table loadTable(Identifier ident, String version) throws NoSuchTableExcep
SparkTable sparkTable = (SparkTable) table;

Preconditions.checkArgument(
sparkTable.snapshotId() == null,
sparkTable.snapshotId() == null && sparkTable.branch() == null,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether we actually want to fix this as part of this PR or a separate PR, but in the Iceberg sync we briefly talked about making sure that SELECT * from ns.table.branch_x VERSION AS OF ... shouldn't be supported and should throw an error, which is what this check is doing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've applied this and moved this to #9219

@@ -173,6 +173,10 @@ public Long snapshotId() {
return snapshotId;
}

public String branch() {
return branch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't introduced by this commit, but branch should be final right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is effectively final as it's only set once. However, it's not marked as final due to the way the different constructors in SparkTable are called

.containsExactly(
new SimpleRecord(1, null), new SimpleRecord(2, null), new SimpleRecord(3, null));

// writing new records into the branch should work with the re-introduced column
Copy link
Contributor

@rdblue rdblue Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is an appropriate place for the write test. It should be a new test case because this case tests the schema that is used when reading.

In addition, the test case should test writing when the current snapshot for a branch has a different schema than the table schema. With the column added back, the schemas are the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this to a separate test and also used a different schema

When retrieving the schema for branch we should always return the table
schema instead of the snapshot schema. This is because the table schema
is the schema that will be used when the branch will be created.
We should only return the schema of the snapshot when we have a tag.
@nastra
Copy link
Contributor Author

nastra commented Dec 5, 2023

Thanks for reviewing this @rdblue, I've applied your feedback and also moved the VERSION AS OF handling to #9219

@nastra nastra merged commit a4d4756 into apache:main Dec 5, 2023
45 checks passed
@nastra nastra deleted the branch-schema branch December 5, 2023 11:03
lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024
When retrieving the schema for branch we should always return the table
schema instead of the snapshot schema. This is because the table schema
is the schema that will be used when the branch will be created.
We should only return the schema of the snapshot when we have a tag.
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
When retrieving the schema for branch we should always return the table
schema instead of the snapshot schema. This is because the table schema
is the schema that will be used when the branch will be created.
We should only return the schema of the snapshot when we have a tag.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants