API, Core: Add schema_id to ContentFile #4898

ConeyLiu · 2022-05-30T06:07:11Z

This is the first part of #4842. Add the schema id to DataFile/DeteFile/ManifestFile and which could be used to evaluate the filter expression based on the schema.

ConeyLiu · 2022-05-30T06:08:11Z

Hi @rdblue @kbendick, could you help to review this when you are free? Thanks a lot.

api/src/main/java/org/apache/iceberg/ContentFile.java

api/src/main/java/org/apache/iceberg/DataFile.java

api/src/main/java/org/apache/iceberg/ManifestFile.java

core/src/main/java/org/apache/iceberg/avro/Avro.java

core/src/main/java/org/apache/iceberg/V2Metadata.java

rdblue · 2022-05-30T15:33:29Z

@ConeyLiu, thanks for breaking this up, but this is still far too large for a single PR. There's no need to update all of the writers in a single PR. Instead, this should focus on core and API classes and just read null values for the new field. You can add writes later.

ConeyLiu · 2022-05-31T10:47:41Z

Hi @rdblue, thanks for the review. I have reverted the changes for the writers. Please take another look, thanks a lot.

szehon-ho

Actually now realizing this is a change in the spec, initially thought it's something we can derive from existing metadata.

Is this something we can do or need to wait until V3?

api/src/main/java/org/apache/iceberg/ContentFile.java

api/src/main/java/org/apache/iceberg/ManifestFile.java

rdblue · 2022-06-07T22:42:32Z

Is this something we can do or need to wait until V3?

@szehon-ho, this is something we can do because it is backward compatible. Older readers will ignore new fields, so we can add it safely. And if we don't have a schema ID for a column, then we just return null and skip the optimization.

ConeyLiu · 2022-06-09T13:05:08Z

A little busy recently. Will address the comments tomorrow.

ConeyLiu · 2022-06-10T14:15:25Z

Thanks, @rdblue @szehon-ho for the review. Comments have been addressed.

szehon-ho

I wonder, is it possible to add a test to try to deserialize an older manifest entry without schema_id?

format/spec.md

spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java

ConeyLiu · 2022-06-11T13:43:15Z

I wonder, is it possible to add a test to try to deserialize an older manifest entry without schema_id?

It seems to need to implement customized V1Writer/V2Writer and those Indexed classes. Is there any suggestion for this?

szehon-ho · 2022-06-13T21:47:54Z

Yea there is some limited discussion in this related issue, but I guess no good conclusions: #2542.

Maybe a test writer that creates metadata files with all optional columns as null? That way can test all the new columns at once.

By the way, change mostly looks good, and it's good you put the new field is optional to avoid the issue like the one mentioned. I just dont know enough about the serialization/deserialization code to be sure if there are any other problems with previous serialized metadata, so was hoping to have a test to verify it. Though can leave to @rdblue to approve if he is confident , and we can tackle the backward compat tests on the side.

ConeyLiu · 2022-06-15T12:15:10Z

Thanks @szehon-ho for the review and suggestion.

Maybe a test writer that creates metadata files with all optional columns as null? That way can test all the new columns at once.

I will add the test later.

ConeyLiu · 2022-06-29T12:01:30Z

core/src/test/java/org/apache/iceberg/TestManifestReader.java

+        ImmutableMap.of(1, Conversions.toByteBuffer(Types.IntegerType.get(), 1))); // upper bounds
+    Integer sortOrderId = 2;
+
+    String fileName = String.format("OldManifestFileV%s.avro", formatVersion);


The OldManifestFileV1.avro/OldManifestFileV2.avro/ is the previously DataFile spec(the file instance is https://github.com/apache/iceberg/blob/master/core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java#L74). Hi, @szehon-ho I think this test covered your concern.

This is really great to have this test. I was initially thinking that we want to have a TestV1Writer / TestV2Writer that writes all the optional fields as null, instead of checking in the old version avro file. Is that possible? I can also take a look myself if that is possible.

I am not sure I have fully understood your ways. In this pr, we not only add a new field to the DataFile and also changed the return StructType of DataFile.getType. Shouldn't we need to customize DataFile/V1Metadata/V2Metadata to write the DataFile with old data spec?

Yea I was thinking to make a Test version of those writers.. not sure if its possible. Anyway, I guess this works too, only issue is wont be too debuggable if something goes wrong.

ConeyLiu · 2022-06-29T12:04:10Z

Hi @rdblue @szehon-ho, I am sorry for the late update. The compatible test has been added. Hopeful, you could take another look when you are free.

szehon-ho · 2022-06-30T17:51:29Z

core/src/main/java/org/apache/iceberg/deletes/EqualityDeleteWriter.java

  public EqualityDeleteWriter(FileAppender<T> appender, FileFormat format, String location,
                              PartitionSpec spec, StructLike partition, EncryptionKeyMetadata keyMetadata,
                              SortOrder sortOrder, int... equalityFieldIds) {
+    this(appender, format, location, spec, partition, keyMetadata, sortOrder, -1, equalityFieldIds);


It's not that great to have to add a new constructor to all these. As we already have WriterFactory that abstract it and should be the ones getting called, I wonder if we can just change this interface? cc @rdblue @aokolnychyi

This is a public class and constructor, I think we should keep the compatibility.

I'm not a huge fan of it, I think we can have a builder like SparkAppenderFactory (where we did some refactor to not have one constructor per new argument)

Should we do it in this patch or a separate patch?

core/src/main/java/org/apache/iceberg/BaseFile.java

core/src/test/java/org/apache/iceberg/TestManifestReader.java

szehon-ho · 2022-06-30T21:33:51Z

core/src/test/java/org/apache/iceberg/TestManifestReader.java

+        ImmutableMap.of(1, Conversions.toByteBuffer(Types.IntegerType.get(), 1))); // upper bounds
+    Integer sortOrderId = 2;
+
+    String fileName = String.format("OldManifestFileV%s.avro", formatVersion);


This is really great to have this test. I was initially thinking that we want to have a TestV1Writer / TestV2Writer that writes all the optional fields as null, instead of checking in the old version avro file. Is that possible? I can also take a look myself if that is possible.

szehon-ho

Hey Im sorry for the late responses, as Im definitely not the most familiar to review this part of the code. I was chatting with @dramaticlly and he could possibly help take a look at the idea for making a cleaner backward compatiblity test in general, over at #2542. Dont want to block the change until that's there, but for me its more re-assuring to have those tests overall :)

Also I noticed, spec-id and schema is already written in the header of each manifest. As far as I can tell, it seems to be the right one that the manifest was written in, even after rewriteManifests. Wondering at a high level, is it adequate for the optimization you are planning?

core/src/main/java/org/apache/iceberg/BaseFile.java

szehon-ho · 2022-08-08T23:10:21Z

core/src/main/java/org/apache/iceberg/deletes/EqualityDeleteWriter.java

  public EqualityDeleteWriter(FileAppender<T> appender, FileFormat format, String location,
                              PartitionSpec spec, StructLike partition, EncryptionKeyMetadata keyMetadata,
                              SortOrder sortOrder, int... equalityFieldIds) {
+    this(appender, format, location, spec, partition, keyMetadata, sortOrder, -1, equalityFieldIds);


I'm not a huge fan of it, I think we can have a builder like SparkAppenderFactory (where we did some refactor to not have one constructor per new argument)

ConeyLiu · 2022-08-09T14:09:44Z

Hi @szehon-ho thanks for reviewing this again.

Also I noticed, spec-id and schema is already written in the header of each manifest. As far as I can tell, it seems to be the right one that the manifest was written in, even after rewriteManifests. Wondering at a high level, is it adequate for the optimization you are planning?

What happened after the rewritten data file? It seems like the schema of the manifest file is the current table schema. And those manifest entries in the same manifest file could have a different schema after rewrite.

szehon-ho · 2022-08-09T17:17:02Z

@ConeyLiu that's a good question, I think (may be wrong) rewriteDataFiles groups files by partition/partition spec, and may not preserve the old schemas. Ie, all the data files are rewritten with latest schema of that partition spec.

I think the situation would be the same even in your proposal to add new schemaid field to data_file, right? After rewriteDataFiles we have to carry over the latest schema-id of each spec , in order for your initial proposed optimization to be accurate? Because there may be data in the new file that was written by a later schema.

ConeyLiu · 2022-08-10T03:53:19Z

I think the situation would be the same even in your proposal to add new schemaid field to data_file, right? After rewriteDataFiles we have to carry over the latest schema-id of each spec , in order for your initial proposed optimization to be accurate? Because there may be data in the new file that was written by a later schema.

You are correct. The data file with the new spec after rewrite. We can not benefit from the schema evaluation because we lost the original schema information.

As far as I can tell, it seems to be the right one that the manifest was written in, even after rewriteManifests.

In RewriteManifest, we use the current table partition spec or the specified spec with spec ID. I think the schema used in the current space is not the same one as the original schema for the old manifest file. That's because we will rewrite the partition spec when updating the table schema. Please correct me if I am wrong.

szehon-ho · 2022-08-10T17:58:25Z

Yea you are right, it seems it will set the latest schema of each spec on the rewritten manifests, so the information is lost if you evolve schemas within a spec.

chenjunjiedada · 2023-05-04T03:31:00Z

@szehon-ho，@rdblue Any update here?

szehon-ho

Yea general direction makes sense to me. But as its changing the spec, would love to get another opinion as well . Pinged @aokolnychyi on this if he has time

szehon-ho · 2023-05-05T05:15:33Z

core/src/test/java/org/apache/iceberg/TestManifestReader.java

+        ImmutableMap.of(1, Conversions.toByteBuffer(Types.IntegerType.get(), 1))); // upper bounds
+    Integer sortOrderId = 2;
+
+    String fileName = String.format("OldManifestFileV%s.avro", formatVersion);


Yea I was thinking to make a Test version of those writers.. not sure if its possible. Anyway, I guess this works too, only issue is wont be too debuggable if something goes wrong.

aokolnychyi · 2023-05-05T19:32:11Z

Will this mean all evaluator logic will have to change to be schema specific? Is there a simple example how this will be consumed?

ConeyLiu · 2023-05-06T00:51:41Z

Thanks @szehon-ho @aokolnychyi

Will this mean all evaluator logic will have to change to be schema specific? Is there a simple example how this will be consumed?

We don't need to change those existing evaluators. Just need to create a new SchemaEvaluator by the filter expression and schema, and use it to evaluate the file. You can refer here:

return CloseableIterable.filter(
      open(projection(fileSchema, fileProjection, projectColumns, caseSensitive)),
      entry -> {
        boolean keep = entry != null;
        if (keep && schemaEvaluator != null && entry.file().schemaId() > -1) {
          // evaluate based on the schemaId
          keep = schemaEvaluator.eval(schemasById.get(entry.file().schemaId()));
        }

       return keep &&
            evaluator.eval(entry.file().partition()) &&
            metricsEvaluator.eval(entry.file()) &&
            inPartitionSet(entry.file());
       });

ConeyLiu · 2023-07-19T14:16:02Z

Hi @rdblue @szehon-ho @aokolnychyi do you have any time to look at this again?

ConeyLiu · 2023-07-19T14:22:19Z

api/src/main/java/org/apache/iceberg/DataFile.java

        CONTENT,
        FILE_PATH,
        FILE_FORMAT,
+        SCHEMA_ID,


@szehon-ho if you are concerned about this order. We may put the SCHEMA_ID to last to align other new fields. Then we don't need to update too many methods of get/put in BaseFile. And some of the spark/flink UTs are not needed to update as well.

Moved the last

ConeyLiu · 2023-07-19T14:24:37Z

core/src/test/resources/OldManifestFileV1.avro

These two manifest files are wroten with the old spec. They are used to test use spec reader reading the files wroten with old spec.

ConeyLiu · 2023-09-06T11:41:01Z

api/src/main/java/org/apache/iceberg/DataFile.java

        EQUALITY_IDS,
-        SORT_ORDER_ID);
+        SORT_ORDER_ID,
+        SCHEMA_ID);


Put it in the last one to reduce the code changes.

ConeyLiu · 2023-09-06T11:46:22Z

Hi @rdblue @szehon-ho @aokolnychyi @RussellSpitzer @nastra, could you help to review this? This should be useful for tables with frequent column additions or deletions.

github-actions · 2024-08-10T00:13:33Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-08-18T00:14:28Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added API core data ORC parquet spark labels May 30, 2022

ConeyLiu mentioned this pull request May 30, 2022

[CORE] Support file filtering based on schema #4842

Closed