ORC: Change union read schema from hive to trino #85

rzhang10 · 2021-11-11T00:38:20Z

This rb changes the OC spark reader and vectorized reader code path to read a union data type of [int, string] to a spark schema/data of [tag, field0, field1] instead of the previous [tag_0, tag_1].

…t reading ORC

autumnust

Overall looks good but I still think we should gate empty union cases.

autumnust · 2021-11-11T18:08:15Z

orc/src/main/java/org/apache/iceberg/orc/OrcSchemaWithTypeVisitor.java

  }

-  private static <T> T visitUnion(Type type, TypeDescription union, OrcSchemaWithTypeVisitor<T> visitor) {
+  protected T visitUnion(Type type, TypeDescription union, OrcSchemaWithTypeVisitor<T> visitor) {


why is this change needed?

+1 on why this change

This is just to align with the same code pattern of other visitor's method, see visitRecord, also, this change enables subclass overriding, which might be necessary in the future, since this OrcSchemaWithTypeVisitor is a very generic visitor.

autumnust · 2021-11-11T19:56:15Z

orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java

    if (orcField != null && orcField.type.getCategory().equals(TypeDescription.Category.UNION)) {
      orcType = TypeDescription.createUnion();
-      for (Types.NestedField nestedField : type.asStructType().fields()) {
+      List<Types.NestedField> nestedFields = type.asStructType().fields();


maybe precondition checking for the list size before accessing the second element and forwards?

Same as the comment below, I don't think it's necessary here since It's already known for sure the length would be at least 2, this is an internal implementation detail.

autumnust · 2021-11-11T20:27:34Z

spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkOrcReaders.java

        long batchOffsetInFile) {
      UnionColumnVector unionColumnVector = (UnionColumnVector) vector;
      List<Types.NestedField> fields = structType.fields();
-      assert fields.size() == unionColumnVector.fields.length;


Shall we preserve this kind of assertion in terms of field size?

I don't think it's necessary since I already know for sure the length would match, this is internal implementation detail.

funcheetah · 2021-11-11T20:40:08Z

orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java

      Map<Integer, OrcField> mapping) {
    TypeDescription orcType;
    OrcField orcField = mapping.getOrDefault(fieldId, null);
+    // this branch means the iceberg struct schema actually correspond to an underlying union


nit: corresponds

funcheetah · 2021-11-11T20:46:47Z

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkOrcUnions.java

-  }
-
-  @Test
-  public void testSingleComponentUnion() throws IOException {


Why test for single component union is removed? How is [ "string" ] represented in the new schema?

We already discussed before we won't support this schema and user should never use single type union cuz it doesn't make sense.

Given it is a valid Avro schema, do we have a way to prevent user from creating this kinda of schema in the first place? What is the behavior if user do create it for current implementation?

The way to prevent it will be upon the due diligence of the user to not create such schema. We can just tell the user this case is not supported and behavior is undefined.

Actually, I have a second thought on this, @autumnust when gobblin writes/converts an ORC dataset and encounters an Avor [null, string] type, does it write just string type to ORC file, or does it write a union? My intuition is that you guys write a single string as ORC schema is nullable by itself.

So Gobblin does the transformation: https://github.com/apache/gobblin/blob/5bc22a5502ddeb810a35b0fb996dd9dfc8c81121/gobblin-utility/src/main/java/org/apache/gobblin/util/orc/AvroOrcSchemaConverter.java#L62

So a default-value representation in Avro shall not leave in a state as single-element union.

funcheetah · 2021-11-11T20:49:27Z

orc/src/main/java/org/apache/iceberg/orc/OrcSchemaWithTypeVisitor.java

  }

-  private static <T> T visitUnion(Type type, TypeDescription union, OrcSchemaWithTypeVisitor<T> visitor) {
+  protected T visitUnion(Type type, TypeDescription union, OrcSchemaWithTypeVisitor<T> visitor) {


+1 on why this change

funcheetah · 2021-11-11T20:55:54Z

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkOrcUnions.java

        Types.NestedField.optional(0, "c1", Types.StructType.of(
-            Types.NestedField.optional(1, "tag_0", Types.IntegerType.get()),
-            Types.NestedField.optional(2, "tag_1",
+            Types.NestedField.optional(100, "tag", Types.IntegerType.get()),


Could you explain how the value 100 is determined for the id field here?

it's just a random id I picked.

funcheetah

LGTM. Regarding the single item union, I think we need to verify the behavior of it in current implementation. If we do not want to support this, we need a way to document/communicate that it is not supported.

* [LI] ORC: Refactor union-to-struct schema - Part 2. changes to support reading ORC * Change Hive type to Iceberg type conversion for union

* Hive Catalog: Add a hive catalog that does not override existing Hive metadata (#10) Add custom hive catalog to not override existing Hive metadata Fail early with a proper exception if the metadata file is not existing Simplify CustomHiveCatalog (#22) * Shading: Add a iceberg-runtime shaded module (#12) * ORC: Add test for reading files without Iceberg IDs (#16) * Hive Metadata Scan: Support reading tables with only Hive metadata (#23, #24, #25, #26) - Support for non string partition columns (#24) - Support for Hive tables without avro.schema.literal (#25) - Hive Metadata Scan: Notify ScanEvent listeners on planning (#35) - Hive Metadata Scan: Do not use table snapshot summary for estimating statistics (#37) - Hive Metadata Scan: Return empty statistics (#49) - Hive Metadata Scan: Do not throw an exception on dangling partitions; log warning message (#50) - Hive Metadata Scan: Fix pushdown of non-partition predicates within NOT (#51) Co-authored-by: Ratandeep Ratti <rratti@linkedin.com> Co-authored-by: Kuai Yu <kuyu@linkedin.com> Co-authored-by: Walaa Eldin Moustafa <wmoustafa@linkedin.com> * Row level filtering: Allow table scans to pass a row level filter for ORC files - ORC: Support NameMapping with row-level filtering (#53) * Hive: Made Predicate Pushdown dynamic based on the Hive Version * Hive: Fix uppercase bug and determine catalog from table properties (#38) * Hive: Return lowercase fieldname from IcebergRecordStructField * Hive: Determine catalog from table property * Hive: Fix schema not forwarded to SerDe on MR jobs (#45) (#47) * Hive: Use Hive table location in HiveIcebergSplit * Hive: Fix schema not passed to Serde * Hive: Refactor tests for tables with unqualified location URI Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> * Hive Metadata Scan: Support case insensitive name mapping (#52) * Hive Metadata Scan: Merge Hive and Avro schemas to fix datatype inconsistencies (#57) Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (#58) Hive Metadata Scan: Fix support for Hive timestamp type (#61) Co-authored-by: Raymond Zhang <razhang@linkedin.com> Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (#67) * Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time * Trigger CI (cherry picked from commit b90e838) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc (#64) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc * Fix style check * Do not delete metadata location when HMS has been successfully updated (#68) (cherry picked from commit 766407e) * Support reading Avro complex union types (#73) Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * [#2039] Support default value semantic for AVRO (#75) (cherry picked from commit c18f4c4) * Support hive non string partition cols (#78) * Support non-string hive type partition columns in LegacyHiveTableScan * Leverage eval against partition filter expression to filter non-string columns * Support default value read for ORC format in spark (#76) * Support default value read for ORC format in spark * Refactor common code for ReadBuilder for both non-vectorized and vectorized read * Fix code style issue * Add special handling of ROW_POSITION metadata column * Add corner case check for partition field * Use BaseDataReader.convertConstant to convert constants, and expand its functionality to support nested-type contants such as array/map/struct * Support nested type default value for vectorized read * Support deeply nested type default value for vectorized read * Support reading ORC complex union types (#74) * Support reading orc complex union types * add more tests * support union in VectorizedSparkOrcReaders and improve tests * support union in VectorizedSparkOrcReaders and improve tests - continued * fix checkstyle Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * Support avro.schema.literal/hive union types in Hive legacy table to Iceberg conversion (#80) * Fix ORC schema visitors to support reading ORC files with deeply nest… (#81) * Fix ORC schema visitors to support reading ORC files with deeply nested union type schema * Added test for vectorized read * Disable avro validation for default values Co-authored-by: Shenoda Guirguis <sguirgui@linkedin.com> * Fix spark avro reader reading union schema data (#83) * Fix spark avro reader to read correctly structured nested data values * Make sure field-id mapping is correctly maintained given arbitrary nested schema that contains union * Avro: Change union read schema from hive to trino (#84) * [LI] Avro: Refactor union-to-struct schema - Part 1. changes to support reading Avro * ORC: Change union read schema from hive to trino (#85) * [LI] ORC: Refactor union-to-struct schema - Part 2. changes to support reading ORC * Change Hive type to Iceberg type conversion for union * Recorder hive table properties to align the avro.schema.literal placement contract (#86) * [#2039] Support default value semantic for AVRO (cherry picked from commit c18f4c4) * reverting commits 2c59857 and f362aed (#88) Co-authored-by: Shenoda Guirguis <sguirgui@sguirgui-mn1.linkedin.biz> * logically patching PR 2328 on HiveMetadataPreservingTableOperations * Support timestamp as partition type (#91) * Support timestamp in partition types * Address comment * Separate classes under hive legacy package to new hivelink module (#87) * separate class under legacy to new hiveberg module * fix build * remove hiveberg dependency in iceberg-spark2 module * Revert "remove hiveberg dependency in iceberg-spark2 module" This reverts commit 2e8b743. * rename hiveberg module to hivelink Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * [LI] Align default value validation align with avro semantics in terms of nullable (nested) fields (#92) * Align default value validation align with avro semantics in terms of nullable (nested) fields * Allow setting null as default value for nested fields in record default * [LI][Spark][Avro] read avro union using decoder instead of directly returning v… (#94) * [LI][Spark] read avro union using decoder instead of directly returning value * Add a comment for the schema * Improve the logging when the deserailzed index is invalid to read the symbol from enum (#96) * Move custom hive catalog to hivelink-core (#99) * Handle non-nullable union of single type for Avro (#98) * Handle non-nullable union of single type Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * Handle null default in nested type default value situations (#100) * Move 'Hive Metadata Scan: Support case insensitive name mapping' (PR 52) to hivelink-core (#102) * Remove activeSparkSession (#103) * Disable default value preserving (#106) * Disable default value preserving * [LI][Avro] Do not reorder elements inside a Avro union schema (#93) * handle single type union properly in AvroSchemaVisitor for deep nested schema (#107) * Handle non-nullable union of single type for ORC spark non-vectorized reader (#104) * Handle single type union for non-vectorized reader * [Avro] Retain the type of field while copying the default values. (#109) * Retain the type of field while copying the default values. * [Hivelink] Refactor support hive non string partition cols to rid of … (#110) * [Hivelink] Refactor support hive non string partition cols to rid of Iceberg-oss code changes * Release automation overhaul: Sonatype Nexus, Shipkit and GH Actions (#101) * Add scm and developer info (#111) * [Core] Fix and refactor schema parser (#112) * [Core] Fix/Refactor SchemaParser to fix multiple bugs * Enhance the UT for testing required fields with default values (#113) * Enhance the UT for testing required fields with default values * Addressed review comments * Addressed review comment * Support single type union for ORC-vectorization reader (#114) * Support single type union for ORC-vectorization reader * Support single type union for ORC-vectorization reader Co-authored-by: Yiqiang Ding <yiqding@yiqding-mn1.linkedin.biz> * Refactor HMS code upon cherry-pick * Check for schema corruption and fix it on commit (#117) * Check for schema corruption and fix it on commit * ORC: Handle query where select and filter only uses default value col… (#118) * ORC: Handle query where select and filter only use default value columns * Set ORC columns and fix case-sensitivity issue with schema check (#119) * Hive: Return null for currentSnapshot() (#121) * Hive: Return null for currentSnapshot() * Handle snapshots() * Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes (#120) * Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes * Add logic to derive partition column id from partition.column.ids pro… (#122) * Add logic to derive partition column id from partition.column.ids property * Do not push down filter to ORC for union type schema (#123) * Bug fix: MergeHiveSchemaWithAvro should retain avro properties for li… (#125) * Bug fix: MergeHiveSchemaWithAvro should retain avro properties for list and map when they are nullable * LinkedIn rebase draft * Refactor hivelink 1 * Make hivelink module test all pass * Make spark 2.4 module work * Fix mr module * Make spark 3.1 module work * Fix TestSparkMetadataColumns * Minor fix for spark 2.4 * Update default spark version to 3.1 * Update java ci to only run spark 2.4 and 3.1 * Minor fix HiveTableOperations * Adapt github CI to 0.14.x branch * Fix mr module checkstyle * Fix checkstyle for orc module * Fix spark2.4 checkstyle * Refactor catalog loading logic using CatalogUtil * Minor change to CI/release Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Co-authored-by: Ratandeep Ratti <rratti@linkedin.com> Co-authored-by: Shardul Mahadik <shardul.m@somaiya.edu> Co-authored-by: Kuai Yu <kuyu@linkedin.com> Co-authored-by: Walaa Eldin Moustafa <wmoustafa@linkedin.com> Co-authored-by: Sushant Raikar <sraikar@linkedin.com> Co-authored-by: ZihanLi58 <48699939+ZihanLi58@users.noreply.github.com> Co-authored-by: Wenye Zhang <wyzhang@linkedin.com> Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> Co-authored-by: Shenoda Guirguis <sguirguis@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@sguirgui-mn1.linkedin.biz> Co-authored-by: Lei Sun <lesun@linkedin.com> Co-authored-by: Jiefan <jiefli@linkedin.com> Co-authored-by: yiqiangin <103528904+yiqiangin@users.noreply.github.com> Co-authored-by: Malini Mahalakshmi Venkatachari <maluchari@gmail.com> Co-authored-by: Yiqiang Ding <yiqding@linkedin.com> Co-authored-by: Yiqiang Ding <yiqding@yiqding-mn1.linkedin.biz> Co-authored-by: Jack Moseley <jmoseley@linkedin.com>

* Hive Catalog: Add a hive catalog that does not override existing Hive metadata (linkedin#10) Add custom hive catalog to not override existing Hive metadata Fail early with a proper exception if the metadata file is not existing Simplify CustomHiveCatalog (linkedin#22) * Shading: Add a iceberg-runtime shaded module (linkedin#12) * ORC: Add test for reading files without Iceberg IDs (linkedin#16) * Hive Metadata Scan: Support reading tables with only Hive metadata (linkedin#23, linkedin#24, linkedin#25, linkedin#26) - Support for non string partition columns (linkedin#24) - Support for Hive tables without avro.schema.literal (linkedin#25) - Hive Metadata Scan: Notify ScanEvent listeners on planning (linkedin#35) - Hive Metadata Scan: Do not use table snapshot summary for estimating statistics (linkedin#37) - Hive Metadata Scan: Return empty statistics (linkedin#49) - Hive Metadata Scan: Do not throw an exception on dangling partitions; log warning message (linkedin#50) - Hive Metadata Scan: Fix pushdown of non-partition predicates within NOT (linkedin#51) Co-authored-by: Ratandeep Ratti <rratti@linkedin.com> Co-authored-by: Kuai Yu <kuyu@linkedin.com> Co-authored-by: Walaa Eldin Moustafa <wmoustafa@linkedin.com> * Row level filtering: Allow table scans to pass a row level filter for ORC files - ORC: Support NameMapping with row-level filtering (linkedin#53) * Hive: Made Predicate Pushdown dynamic based on the Hive Version * Hive: Fix uppercase bug and determine catalog from table properties (linkedin#38) * Hive: Return lowercase fieldname from IcebergRecordStructField * Hive: Determine catalog from table property * Hive: Fix schema not forwarded to SerDe on MR jobs (linkedin#45) (linkedin#47) * Hive: Use Hive table location in HiveIcebergSplit * Hive: Fix schema not passed to Serde * Hive: Refactor tests for tables with unqualified location URI Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> * Hive Metadata Scan: Support case insensitive name mapping (linkedin#52) * Hive Metadata Scan: Merge Hive and Avro schemas to fix datatype inconsistencies (linkedin#57) Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (linkedin#58) Hive Metadata Scan: Fix support for Hive timestamp type (linkedin#61) Co-authored-by: Raymond Zhang <razhang@linkedin.com> Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (linkedin#67) * Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time * Trigger CI (cherry picked from commit b90e838) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc (linkedin#64) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc * Fix style check * Do not delete metadata location when HMS has been successfully updated (linkedin#68) (cherry picked from commit 766407e) * Support reading Avro complex union types (linkedin#73) Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * [#2039] Support default value semantic for AVRO (linkedin#75) (cherry picked from commit c18f4c4) * Support hive non string partition cols (linkedin#78) * Support non-string hive type partition columns in LegacyHiveTableScan * Leverage eval against partition filter expression to filter non-string columns * Support default value read for ORC format in spark (linkedin#76) * Support default value read for ORC format in spark * Refactor common code for ReadBuilder for both non-vectorized and vectorized read * Fix code style issue * Add special handling of ROW_POSITION metadata column * Add corner case check for partition field * Use BaseDataReader.convertConstant to convert constants, and expand its functionality to support nested-type contants such as array/map/struct * Support nested type default value for vectorized read * Support deeply nested type default value for vectorized read * Support reading ORC complex union types (linkedin#74) * Support reading orc complex union types * add more tests * support union in VectorizedSparkOrcReaders and improve tests * support union in VectorizedSparkOrcReaders and improve tests - continued * fix checkstyle Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * Support avro.schema.literal/hive union types in Hive legacy table to Iceberg conversion (linkedin#80) * Fix ORC schema visitors to support reading ORC files with deeply nest… (linkedin#81) * Fix ORC schema visitors to support reading ORC files with deeply nested union type schema * Added test for vectorized read * Disable avro validation for default values Co-authored-by: Shenoda Guirguis <sguirgui@linkedin.com> * Fix spark avro reader reading union schema data (linkedin#83) * Fix spark avro reader to read correctly structured nested data values * Make sure field-id mapping is correctly maintained given arbitrary nested schema that contains union * Avro: Change union read schema from hive to trino (linkedin#84) * [LI] Avro: Refactor union-to-struct schema - Part 1. changes to support reading Avro * ORC: Change union read schema from hive to trino (linkedin#85) * [LI] ORC: Refactor union-to-struct schema - Part 2. changes to support reading ORC * Change Hive type to Iceberg type conversion for union * Recorder hive table properties to align the avro.schema.literal placement contract (linkedin#86) * [#2039] Support default value semantic for AVRO (cherry picked from commit c18f4c4) * reverting commits 2c59857 and f362aed (linkedin#88) Co-authored-by: Shenoda Guirguis <sguirgui@sguirgui-mn1.linkedin.biz> * logically patching PR 2328 on HiveMetadataPreservingTableOperations * Support timestamp as partition type (linkedin#91) * Support timestamp in partition types * Address comment * Separate classes under hive legacy package to new hivelink module (linkedin#87) * separate class under legacy to new hiveberg module * fix build * remove hiveberg dependency in iceberg-spark2 module * Revert "remove hiveberg dependency in iceberg-spark2 module" This reverts commit 2e8b743. * rename hiveberg module to hivelink Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * [LI] Align default value validation align with avro semantics in terms of nullable (nested) fields (linkedin#92) * Align default value validation align with avro semantics in terms of nullable (nested) fields * Allow setting null as default value for nested fields in record default * [LI][Spark][Avro] read avro union using decoder instead of directly returning v… (linkedin#94) * [LI][Spark] read avro union using decoder instead of directly returning value * Add a comment for the schema * Improve the logging when the deserailzed index is invalid to read the symbol from enum (linkedin#96) * Move custom hive catalog to hivelink-core (linkedin#99) * Handle non-nullable union of single type for Avro (linkedin#98) * Handle non-nullable union of single type Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * Handle null default in nested type default value situations (linkedin#100) * Move 'Hive Metadata Scan: Support case insensitive name mapping' (PR 52) to hivelink-core (linkedin#102) * Remove activeSparkSession (linkedin#103) * Disable default value preserving (linkedin#106) * Disable default value preserving * [LI][Avro] Do not reorder elements inside a Avro union schema (linkedin#93) * handle single type union properly in AvroSchemaVisitor for deep nested schema (linkedin#107) * Handle non-nullable union of single type for ORC spark non-vectorized reader (linkedin#104) * Handle single type union for non-vectorized reader * [Avro] Retain the type of field while copying the default values. (linkedin#109) * Retain the type of field while copying the default values. * [Hivelink] Refactor support hive non string partition cols to rid of … (linkedin#110) * [Hivelink] Refactor support hive non string partition cols to rid of Iceberg-oss code changes * Release automation overhaul: Sonatype Nexus, Shipkit and GH Actions (linkedin#101) * Add scm and developer info (linkedin#111) * [Core] Fix and refactor schema parser (linkedin#112) * [Core] Fix/Refactor SchemaParser to fix multiple bugs * Enhance the UT for testing required fields with default values (linkedin#113) * Enhance the UT for testing required fields with default values * Addressed review comments * Addressed review comment * Support single type union for ORC-vectorization reader (linkedin#114) * Support single type union for ORC-vectorization reader * Support single type union for ORC-vectorization reader Co-authored-by: Yiqiang Ding <yiqding@yiqding-mn1.linkedin.biz> * Refactor HMS code upon cherry-pick * Check for schema corruption and fix it on commit (linkedin#117) * Check for schema corruption and fix it on commit * ORC: Handle query where select and filter only uses default value col… (linkedin#118) * ORC: Handle query where select and filter only use default value columns * Set ORC columns and fix case-sensitivity issue with schema check (linkedin#119) * Hive: Return null for currentSnapshot() (linkedin#121) * Hive: Return null for currentSnapshot() * Handle snapshots() * Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes (linkedin#120) * Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes * Add logic to derive partition column id from partition.column.ids pro… (linkedin#122) * Add logic to derive partition column id from partition.column.ids property * Do not push down filter to ORC for union type schema (linkedin#123) * Bug fix: MergeHiveSchemaWithAvro should retain avro properties for li… (linkedin#125) * Bug fix: MergeHiveSchemaWithAvro should retain avro properties for list and map when they are nullable * LinkedIn rebase draft * Refactor hivelink 1 * Make hivelink module test all pass * Make spark 2.4 module work * Fix mr module * Make spark 3.1 module work * Fix TestSparkMetadataColumns * Minor fix for spark 2.4 * Update default spark version to 3.1 * Update java ci to only run spark 2.4 and 3.1 * Minor fix HiveTableOperations * Adapt github CI to 0.14.x branch * Fix mr module checkstyle * Fix checkstyle for orc module * Fix spark2.4 checkstyle * Refactor catalog loading logic using CatalogUtil * Minor change to CI/release Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Co-authored-by: Ratandeep Ratti <rratti@linkedin.com> Co-authored-by: Shardul Mahadik <shardul.m@somaiya.edu> Co-authored-by: Kuai Yu <kuyu@linkedin.com> Co-authored-by: Walaa Eldin Moustafa <wmoustafa@linkedin.com> Co-authored-by: Sushant Raikar <sraikar@linkedin.com> Co-authored-by: ZihanLi58 <48699939+ZihanLi58@users.noreply.github.com> Co-authored-by: Wenye Zhang <wyzhang@linkedin.com> Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> Co-authored-by: Shenoda Guirguis <sguirguis@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@sguirgui-mn1.linkedin.biz> Co-authored-by: Lei Sun <lesun@linkedin.com> Co-authored-by: Jiefan <jiefli@linkedin.com> Co-authored-by: yiqiangin <103528904+yiqiangin@users.noreply.github.com> Co-authored-by: Malini Mahalakshmi Venkatachari <maluchari@gmail.com> Co-authored-by: Yiqiang Ding <yiqding@linkedin.com> Co-authored-by: Yiqiang Ding <yiqding@yiqding-mn1.linkedin.biz> Co-authored-by: Jack Moseley <jmoseley@linkedin.com>

* Rebase LI-Iceberg changes on top of Apache Iceberg 1.0.0 release * Hive Catalog: Add a hive catalog that does not override existing Hive metadata (#10) Add custom hive catalog to not override existing Hive metadata Fail early with a proper exception if the metadata file is not existing Simplify CustomHiveCatalog (#22) * Shading: Add a iceberg-runtime shaded module (#12) * ORC: Add test for reading files without Iceberg IDs (#16) * Hive Metadata Scan: Support reading tables with only Hive metadata (#23, #24, #25, #26) - Support for non string partition columns (#24) - Support for Hive tables without avro.schema.literal (#25) - Hive Metadata Scan: Notify ScanEvent listeners on planning (#35) - Hive Metadata Scan: Do not use table snapshot summary for estimating statistics (#37) - Hive Metadata Scan: Return empty statistics (#49) - Hive Metadata Scan: Do not throw an exception on dangling partitions; log warning message (#50) - Hive Metadata Scan: Fix pushdown of non-partition predicates within NOT (#51) Co-authored-by: Ratandeep Ratti <rratti@linkedin.com> Co-authored-by: Kuai Yu <kuyu@linkedin.com> Co-authored-by: Walaa Eldin Moustafa <wmoustafa@linkedin.com> * Row level filtering: Allow table scans to pass a row level filter for ORC files - ORC: Support NameMapping with row-level filtering (#53) * Hive: Made Predicate Pushdown dynamic based on the Hive Version * Hive: Fix uppercase bug and determine catalog from table properties (#38) * Hive: Return lowercase fieldname from IcebergRecordStructField * Hive: Determine catalog from table property * Hive: Fix schema not forwarded to SerDe on MR jobs (#45) (#47) * Hive: Use Hive table location in HiveIcebergSplit * Hive: Fix schema not passed to Serde * Hive: Refactor tests for tables with unqualified location URI Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> * Hive Metadata Scan: Support case insensitive name mapping (#52) * Hive Metadata Scan: Merge Hive and Avro schemas to fix datatype inconsistencies (#57) Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (#58) Hive Metadata Scan: Fix support for Hive timestamp type (#61) Co-authored-by: Raymond Zhang <razhang@linkedin.com> Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (#67) * Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time * Trigger CI (cherry picked from commit b90e838) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc (#64) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc * Fix style check * Do not delete metadata location when HMS has been successfully updated (#68) (cherry picked from commit 766407e) * Support reading Avro complex union types (#73) Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * [#2039] Support default value semantic for AVRO (#75) (cherry picked from commit c18f4c4) * Support hive non string partition cols (#78) * Support non-string hive type partition columns in LegacyHiveTableScan * Leverage eval against partition filter expression to filter non-string columns * Support default value read for ORC format in spark (#76) * Support default value read for ORC format in spark * Refactor common code for ReadBuilder for both non-vectorized and vectorized read * Fix code style issue * Add special handling of ROW_POSITION metadata column * Add corner case check for partition field * Use BaseDataReader.convertConstant to convert constants, and expand its functionality to support nested-type contants such as array/map/struct * Support nested type default value for vectorized read * Support deeply nested type default value for vectorized read * Support reading ORC complex union types (#74) * Support reading orc complex union types * add more tests * support union in VectorizedSparkOrcReaders and improve tests * support union in VectorizedSparkOrcReaders and improve tests - continued * fix checkstyle Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * Support avro.schema.literal/hive union types in Hive legacy table to Iceberg conversion (#80) * Fix ORC schema visitors to support reading ORC files with deeply nest… (#81) * Fix ORC schema visitors to support reading ORC files with deeply nested union type schema * Added test for vectorized read * Disable avro validation for default values Co-authored-by: Shenoda Guirguis <sguirgui@linkedin.com> * Fix spark avro reader reading union schema data (#83) * Fix spark avro reader to read correctly structured nested data values * Make sure field-id mapping is correctly maintained given arbitrary nested schema that contains union * Avro: Change union read schema from hive to trino (#84) * [LI] Avro: Refactor union-to-struct schema - Part 1. changes to support reading Avro * ORC: Change union read schema from hive to trino (#85) * [LI] ORC: Refactor union-to-struct schema - Part 2. changes to support reading ORC * Change Hive type to Iceberg type conversion for union * Recorder hive table properties to align the avro.schema.literal placement contract (#86) * [#2039] Support default value semantic for AVRO (cherry picked from commit c18f4c4) * reverting commits 2c59857 and f362aed (#88) Co-authored-by: Shenoda Guirguis <sguirgui@sguirgui-mn1.linkedin.biz> * logically patching PR 2328 on HiveMetadataPreservingTableOperations * Support timestamp as partition type (#91) * Support timestamp in partition types * Address comment * Separate classes under hive legacy package to new hivelink module (#87) * separate class under legacy to new hiveberg module * fix build * remove hiveberg dependency in iceberg-spark2 module * Revert "remove hiveberg dependency in iceberg-spark2 module" This reverts commit 2e8b743. * rename hiveberg module to hivelink Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * [LI] Align default value validation align with avro semantics in terms of nullable (nested) fields (#92) * Align default value validation align with avro semantics in terms of nullable (nested) fields * Allow setting null as default value for nested fields in record default * [LI][Spark][Avro] read avro union using decoder instead of directly returning v… (#94) * [LI][Spark] read avro union using decoder instead of directly returning value * Add a comment for the schema * Improve the logging when the deserailzed index is invalid to read the symbol from enum (#96) * Move custom hive catalog to hivelink-core (#99) * Handle non-nullable union of single type for Avro (#98) * Handle non-nullable union of single type Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * Handle null default in nested type default value situations (#100) * Move 'Hive Metadata Scan: Support case insensitive name mapping' (PR 52) to hivelink-core (#102) * Remove activeSparkSession (#103) * Disable default value preserving (#106) * Disable default value preserving * [LI][Avro] Do not reorder elements inside a Avro union schema (#93) * handle single type union properly in AvroSchemaVisitor for deep nested schema (#107) * Handle non-nullable union of single type for ORC spark non-vectorized reader (#104) * Handle single type union for non-vectorized reader * [Avro] Retain the type of field while copying the default values. (#109) * Retain the type of field while copying the default values. * [Hivelink] Refactor support hive non string partition cols to rid of … (#110) * [Hivelink] Refactor support hive non string partition cols to rid of Iceberg-oss code changes * Release automation overhaul: Sonatype Nexus, Shipkit and GH Actions (#101) * Add scm and developer info (#111) * [Core] Fix and refactor schema parser (#112) * [Core] Fix/Refactor SchemaParser to fix multiple bugs * Enhance the UT for testing required fields with default values (#113) * Enhance the UT for testing required fields with default values * Addressed review comments * Addressed review comment * Support single type union for ORC-vectorization reader (#114) * Support single type union for ORC-vectorization reader * Support single type union for ORC-vectorization reader Co-authored-by: Yiqiang Ding <yiqding@yiqding-mn1.linkedin.biz> * Refactor HMS code upon cherry-pick * Check for schema corruption and fix it on commit (#117) * Check for schema corruption and fix it on commit * ORC: Handle query where select and filter only uses default value col… (#118) * ORC: Handle query where select and filter only use default value columns * Set ORC columns and fix case-sensitivity issue with schema check (#119) * Hive: Return null for currentSnapshot() (#121) * Hive: Return null for currentSnapshot() * Handle snapshots() * Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes (#120) * Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes * Add logic to derive partition column id from partition.column.ids pro… (#122) * Add logic to derive partition column id from partition.column.ids property * Do not push down filter to ORC for union type schema (#123) * Bug fix: MergeHiveSchemaWithAvro should retain avro properties for li… (#125) * Bug fix: MergeHiveSchemaWithAvro should retain avro properties for list and map when they are nullable * LinkedIn rebase draft * Refactor hivelink 1 * Make hivelink module test all pass * Make spark 2.4 module work * Fix mr module * Make spark 3.1 module work * Fix TestSparkMetadataColumns * Minor fix for spark 2.4 * Update default spark version to 3.1 * Update java ci to only run spark 2.4 and 3.1 * Minor fix HiveTableOperations * Adapt github CI to 0.14.x branch * Fix mr module checkstyle * Fix checkstyle for orc module * Fix spark2.4 checkstyle * Refactor catalog loading logic using CatalogUtil * Minor change to CI/release Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Co-authored-by: Ratandeep Ratti <rratti@linkedin.com> Co-authored-by: Shardul Mahadik <shardul.m@somaiya.edu> Co-authored-by: Kuai Yu <kuyu@linkedin.com> Co-authored-by: Walaa Eldin Moustafa <wmoustafa@linkedin.com> Co-authored-by: Sushant Raikar <sraikar@linkedin.com> Co-authored-by: ZihanLi58 <48699939+ZihanLi58@users.noreply.github.com> Co-authored-by: Wenye Zhang <wyzhang@linkedin.com> Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> Co-authored-by: Shenoda Guirguis <sguirguis@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@sguirgui-mn1.linkedin.biz> Co-authored-by: Lei Sun <lesun@linkedin.com> Co-authored-by: Jiefan <jiefli@linkedin.com> Co-authored-by: yiqiangin <103528904+yiqiangin@users.noreply.github.com> Co-authored-by: Malini Mahalakshmi Venkatachari <maluchari@gmail.com> Co-authored-by: Yiqiang Ding <yiqding@linkedin.com> Co-authored-by: Yiqiang Ding <yiqding@yiqding-mn1.linkedin.biz> Co-authored-by: Jack Moseley <jmoseley@linkedin.com> * Add flink 1.14 artifacts for release Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Co-authored-by: Ratandeep Ratti <rratti@linkedin.com> Co-authored-by: Shardul Mahadik <shardul.m@somaiya.edu> Co-authored-by: Kuai Yu <kuyu@linkedin.com> Co-authored-by: Walaa Eldin Moustafa <wmoustafa@linkedin.com> Co-authored-by: Sushant Raikar <sraikar@linkedin.com> Co-authored-by: ZihanLi58 <48699939+ZihanLi58@users.noreply.github.com> Co-authored-by: Wenye Zhang <wyzhang@linkedin.com> Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> Co-authored-by: Shenoda Guirguis <sguirguis@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@sguirgui-mn1.linkedin.biz> Co-authored-by: Lei Sun <lesun@linkedin.com> Co-authored-by: Jiefan <jiefli@linkedin.com> Co-authored-by: yiqiangin <103528904+yiqiangin@users.noreply.github.com> Co-authored-by: Malini Mahalakshmi Venkatachari <maluchari@gmail.com> Co-authored-by: Yiqiang Ding <yiqding@linkedin.com> Co-authored-by: Yiqiang Ding <yiqding@yiqding-mn1.linkedin.biz> Co-authored-by: Jack Moseley <jmoseley@linkedin.com>

[LI] ORC: Refactor union-to-struct schema - Part 2. changes to suppor…

01c1663

…t reading ORC

github-actions bot added ORC SPARK labels Nov 11, 2021

autumnust requested changes Nov 11, 2021

View reviewed changes

funcheetah reviewed Nov 11, 2021

View reviewed changes

rzhang10 mentioned this pull request Nov 11, 2021

Avro: Change union read schema from hive to trino #84

Merged

rzhang10 changed the title ~~ORC: Refactor union schema from hive to trino~~ ORC: Change union read schema from hive to trino Nov 15, 2021

github-actions bot added the HIVE label Dec 6, 2021

Change Hive type to Iceberg type conversion for union

efac40c

rzhang10 force-pushed the orc_refactor_uion_schema_from_hive_to_trino branch from b2fbf75 to efac40c Compare December 6, 2021 22:39

funcheetah approved these changes Dec 7, 2021

View reviewed changes

rzhang10 merged commit 5003d3f into linkedin:li-0.11.x Dec 7, 2021

ORC: Change union read schema from hive to trino #85

ORC: Change union read schema from hive to trino #85

Uh oh!

Conversation

rzhang10 commented Nov 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

autumnust left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

funcheetah left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rzhang10 commented Nov 11, 2021 •

edited

Loading