Core: Schema for a branch should return table schema #9131

nastra · 2023-11-22T14:13:54Z

When retrieving the schema for branch we should always return the table schema instead of the snapshot schema. This is because the table schema is the schema that will be used when the branch will be created. We should only return the schema of the snapshot when we have a tag.
Below is an example that shows the weird schema behavior when describing a table.

-- create a table with a single column and insert a value
spark-sql (default)> create table t (s string);
spark-sql (default)> insert into t values ('foo');
-- create a branch, the schema is the same as the original table
spark-sql (default)> alter table t create branch b1;

spark-sql (default)> describe default.t;
s                       string                      --> this schema comes from top-level table metadata
spark-sql (default)> describe default.t.branch_b1;
s                       string                      --> this is the same schema, but comes from the snapshot with the record ('foo')
-- alter the table schema and now the definitions diverge
spark-sql (default)> alter table t add column i int;

spark-sql (default)> describe default.t;
s                       string
i                       int
spark-sql (default)> describe default.t.branch_b1;
s                       string

-- insert into the branch and the schema changes back to the top-level schmea
spark-sql (default)> insert into default.t.branch_b1 values ('bar');

spark-sql (default)> describe default.t.branch_b1;
s                       string
i                       int

rdblue · 2023-11-22T17:50:09Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java

+        .containsExactly(
+            new GenericRowWithSchema(new Object[] {1}, null),
+            new GenericRowWithSchema(new Object[] {2}, null),
+            new GenericRowWithSchema(new Object[] {3}, null));


Can you use SimpleRecord like the rest of the tests do instead?

unfortunately that doesn't work, because SimpleRecord expects the data field to be populated. The particular error is [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name data cannot be resolved. Did you mean one of the following? [id].

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java

nastra · 2023-11-23T13:54:44Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

@@ -171,7 +171,7 @@ public Table loadTable(Identifier ident, String version) throws NoSuchTableExcep
      SparkTable sparkTable = (SparkTable) table;

      Preconditions.checkArgument(
-          sparkTable.snapshotId() == null,
+          sparkTable.snapshotId() == null && sparkTable.branch() == null,


I'm not sure whether we actually want to fix this as part of this PR or a separate PR, but in the Iceberg sync we briefly talked about making sure that SELECT * from ns.table.branch_x VERSION AS OF ... shouldn't be supported and should throw an error, which is what this check is doing

Probably a separate PR.

I've applied this and moved this to #9219

rdblue · 2023-12-04T19:45:16Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

@@ -173,6 +173,10 @@ public Long snapshotId() {
    return snapshotId;
  }

+  public String branch() {
+    return branch;


This wasn't introduced by this commit, but branch should be final right?

it is effectively final as it's only set once. However, it's not marked as final due to the way the different constructors in SparkTable are called

rdblue · 2023-12-04T19:49:28Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java

+        .containsExactly(
+            new SimpleRecord(1, null), new SimpleRecord(2, null), new SimpleRecord(3, null));
+
+    // writing new records into the branch should work with the re-introduced column


I don't think this is an appropriate place for the write test. It should be a new test case because this case tests the schema that is used when reading.

In addition, the test case should test writing when the current snapshot for a branch has a different schema than the table schema. With the column added back, the schemas are the same.

I've moved this to a separate test and also used a different schema

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java

When retrieving the schema for branch we should always return the table schema instead of the snapshot schema. This is because the table schema is the schema that will be used when the branch will be created. We should only return the schema of the snapshot when we have a tag.

nastra · 2023-12-05T11:02:02Z

Thanks for reviewing this @rdblue, I've applied your feedback and also moved the VERSION AS OF handling to #9219

When retrieving the schema for branch we should always return the table schema instead of the snapshot schema. This is because the table schema is the schema that will be used when the branch will be created. We should only return the schema of the snapshot when we have a tag.

github-actions bot added the core label Nov 22, 2023

nastra force-pushed the branch-schema branch from 02b9447 to 8da34e8 Compare November 22, 2023 16:22

github-actions bot added the spark label Nov 22, 2023

nastra mentioned this pull request Nov 22, 2023

Spark SQL DESCRIBE not showing proper schema on a branch #9026

Closed

rdblue reviewed Nov 22, 2023

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java Show resolved Hide resolved

nastra force-pushed the branch-schema branch 2 times, most recently from d1a7ff8 to 761fdae Compare November 23, 2023 10:40

nastra commented Nov 23, 2023

View reviewed changes

rdblue approved these changes Dec 4, 2023

View reviewed changes

rdblue reviewed Dec 4, 2023

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java Outdated Show resolved Hide resolved

nastra force-pushed the branch-schema branch from 761fdae to ceb6256 Compare December 5, 2023 07:54

nastra merged commit a4d4756 into apache:main Dec 5, 2023
45 checks passed

nastra deleted the branch-schema branch December 5, 2023 11:03

nastra mentioned this pull request Dec 5, 2023

Spark: Don't allow branch_ usage with VERSION AS OF #9219

Merged

namrathamyske mentioned this pull request Feb 16, 2024

branch schema affected by main table schema #9737

Closed

szehon-ho mentioned this pull request May 20, 2024

[Spec] Add Iceberg Materialized View Spec #10280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Schema for a branch should return table schema #9131

Core: Schema for a branch should return table schema #9131

nastra commented Nov 22, 2023 •

edited

Loading

rdblue Nov 22, 2023

nastra Nov 23, 2023 •

edited

Loading

nastra Nov 23, 2023

rdblue Nov 24, 2023

nastra Dec 5, 2023

rdblue Dec 4, 2023

nastra Dec 5, 2023

rdblue Dec 4, 2023 •

edited

Loading

nastra Dec 5, 2023

nastra commented Dec 5, 2023

Core: Schema for a branch should return table schema #9131

Core: Schema for a branch should return table schema #9131

Conversation

nastra commented Nov 22, 2023 • edited Loading

rdblue Nov 22, 2023

Choose a reason for hiding this comment

nastra Nov 23, 2023 • edited Loading

Choose a reason for hiding this comment

nastra Nov 23, 2023

Choose a reason for hiding this comment

rdblue Nov 24, 2023

Choose a reason for hiding this comment

nastra Dec 5, 2023

Choose a reason for hiding this comment

rdblue Dec 4, 2023

Choose a reason for hiding this comment

nastra Dec 5, 2023

Choose a reason for hiding this comment

rdblue Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

nastra Dec 5, 2023

Choose a reason for hiding this comment

nastra commented Dec 5, 2023

nastra commented Nov 22, 2023 •

edited

Loading

nastra Nov 23, 2023 •

edited

Loading

rdblue Dec 4, 2023 •

edited

Loading