[LI] Add feature for Spark ORC reader to ignore field ids in files by using a new table property #134

rzhang10 · 2022-12-13T23:25:31Z

Adds a new table property "read.orc.ignore.field-ids.enabled" to control the Spark ORC reader behavior to ignore field-ids in file schema even if it contains it. This feature will be useful for LI-Iceberg to read Gobblin dual hive/Iceberg tables with shared iceberg written files.

Integration tested via spark-shell on the cluster, with setting the table property
ALTER TABLE xxx.xxx SET TBLPROPERTIES ('read.orc.ignore.field-ids.enabled' = 'true');
makes the table readable.

….ids pro… (linkedin#122)" This reverts commit 21c3a80.

… using a new table property

rdsr · 2022-12-14T00:08:36Z

@rzhang10 is it possible to add unit tests here?

rzhang10 · 2022-12-14T00:13:46Z

@rzhang10 is it possible to add unit tests here?

No, I feel it's quite hard to do, because it's hard inside Iceberg codebase to create a hive table with iceberg files..(for that I will need to repeat all the logic gobblin is currently doing customly in the unit test).

I think we can rely on integration testing on the cluster.

rdsr · 2022-12-14T01:48:03Z

@rzhang10 is it possible to add unit tests here?

No, I feel it's quite hard to do, because it's hard inside Iceberg codebase to create a hive table with iceberg files..(for that I will need to repeat all the logic gobblin is currently doing customly in the unit test).

I think we can rely on integration testing on the cluster.

Would it make sense to

write with Iceberg.
Open file header and remove the field-ids
Register the file with Hive table
and Query?

If this is far too complex then we can probably leave it out.. your call @rzhang10

orc/src/main/java/org/apache/iceberg/orc/OrcIterable.java

yiqiangin

LGTM

rzhang10 added 2 commits December 6, 2022 11:01

Revert "Add logic to derive partition column id from partition.column…

0afe788

….ids pro… (linkedin#122)" This reverts commit 21c3a80.

[LI] Add feature for Spark ORC reader to ignore field ids in files by…

ce0a738

… using a new table property

rzhang10 changed the title ~~Spark orc ignore field ids read~~ [LI] Add feature for Spark ORC reader to ignore field ids in files by using a new table property Dec 13, 2022

github-actions bot added CORE ORC SPARK labels Dec 13, 2022

yiqiangin reviewed Dec 14, 2022

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/OrcIterable.java Outdated Show resolved Hide resolved

yiqiangin approved these changes Dec 14, 2022

View reviewed changes

Address comments

d1da4db

rzhang10 merged commit 976277d into linkedin:li-0.11.x Dec 14, 2022

rzhang10 mentioned this pull request Jan 26, 2023

[LI] Bug fix: Remove ids from fileSchema before feeding it into apply… #136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LI] Add feature for Spark ORC reader to ignore field ids in files by using a new table property #134

[LI] Add feature for Spark ORC reader to ignore field ids in files by using a new table property #134

Uh oh!

rzhang10 commented Dec 13, 2022 •

edited

Loading

Uh oh!

rdsr commented Dec 14, 2022

Uh oh!

rzhang10 commented Dec 14, 2022

Uh oh!

rdsr commented Dec 14, 2022

Uh oh!

Uh oh!

yiqiangin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[LI] Add feature for Spark ORC reader to ignore field ids in files by using a new table property #134

[LI] Add feature for Spark ORC reader to ignore field ids in files by using a new table property #134

Uh oh!

Conversation

rzhang10 commented Dec 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdsr commented Dec 14, 2022

Uh oh!

rzhang10 commented Dec 14, 2022

Uh oh!

rdsr commented Dec 14, 2022

Uh oh!

Uh oh!

yiqiangin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rzhang10 commented Dec 13, 2022 •

edited

Loading