Fix support for Hive timestamp type #61

rzhang10 · 2021-03-28T20:25:11Z

This PR fix support for hive timestamp type to be correctly converted to timestamp (Timestamp without timezone) type in iceberg, this will make PR #60 able to read Hive tables with timestamp type.

shenodaguirguis

LGTM, would be great if you can elaborate in the PR description why/how not adding the ADJUST_TO_UTC_PROP=false causes issues

hive-metastore/src/test/java/org/apache/iceberg/hive/legacy/TestMergeHiveSchemaWithAvro.java

shardulm94 · 2021-03-30T00:09:34Z

Thanks @rzhang10! Merged. Can you update the PR description with why/how the fix works as @shenodaguirguis suggested in #61 (review)?

…sistencies (linkedin#57) Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (linkedin#58) Hive Metadata Scan: Fix support for Hive timestamp type (linkedin#61) Co-authored-by: Raymond Zhang <razhang@linkedin.com> Co-authored-by: Shardul Mahadik <smahadik@linkedin.com>

* Hive Metadata Scan: Merge Hive and Avro schemas to fix datatype inconsistencies (#57) Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (#58) Hive Metadata Scan: Fix support for Hive timestamp type (#61) Co-authored-by: Raymond Zhang <razhang@linkedin.com> Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> * Hive Catalog: Additional logging for HiveMetadataPreservingTableOperations (#62) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc (#64) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc * Fix style check * Spark: Allow reading timestamp without timezone Cherry-picked PR 48: Spark: Allow reading timestamp without time zone Fix style and refactor read-timestamp-without-zone option to constant after rebase Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Co-authored-by: nagarathnam200 <nagarathnam200@gmail.com>

…sistencies (linkedin#57) Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (linkedin#58) Hive Metadata Scan: Fix support for Hive timestamp type (linkedin#61) Co-authored-by: Raymond Zhang <razhang@linkedin.com> Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (linkedin#67) * Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time * Trigger CI (cherry picked from commit b90e838)

…sistencies (#57) Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (#58) Hive Metadata Scan: Fix support for Hive timestamp type (#61) Co-authored-by: Raymond Zhang <razhang@linkedin.com> Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (#67) * Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time * Trigger CI (cherry picked from commit b90e838)

…sistencies (linkedin#57) Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (linkedin#58) Hive Metadata Scan: Fix support for Hive timestamp type (linkedin#61) Co-authored-by: Raymond Zhang <razhang@linkedin.com> Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (linkedin#67) * Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time * Trigger CI (cherry picked from commit b90e838)

* Hive Catalog: Add a hive catalog that does not override existing Hive metadata (#10) Add custom hive catalog to not override existing Hive metadata Fail early with a proper exception if the metadata file is not existing Simplify CustomHiveCatalog (#22) * Shading: Add a iceberg-runtime shaded module (#12) * ORC: Add test for reading files without Iceberg IDs (#16) * Hive Metadata Scan: Support reading tables with only Hive metadata (#23, #24, #25, #26) - Support for non string partition columns (#24) - Support for Hive tables without avro.schema.literal (#25) - Hive Metadata Scan: Notify ScanEvent listeners on planning (#35) - Hive Metadata Scan: Do not use table snapshot summary for estimating statistics (#37) - Hive Metadata Scan: Return empty statistics (#49) - Hive Metadata Scan: Do not throw an exception on dangling partitions; log warning message (#50) - Hive Metadata Scan: Fix pushdown of non-partition predicates within NOT (#51) Co-authored-by: Ratandeep Ratti <rratti@linkedin.com> Co-authored-by: Kuai Yu <kuyu@linkedin.com> Co-authored-by: Walaa Eldin Moustafa <wmoustafa@linkedin.com> * Row level filtering: Allow table scans to pass a row level filter for ORC files - ORC: Support NameMapping with row-level filtering (#53) * Hive: Made Predicate Pushdown dynamic based on the Hive Version * Hive: Fix uppercase bug and determine catalog from table properties (#38) * Hive: Return lowercase fieldname from IcebergRecordStructField * Hive: Determine catalog from table property * Hive: Fix schema not forwarded to SerDe on MR jobs (#45) (#47) * Hive: Use Hive table location in HiveIcebergSplit * Hive: Fix schema not passed to Serde * Hive: Refactor tests for tables with unqualified location URI Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> * Hive Metadata Scan: Support case insensitive name mapping (#52) * Hive Metadata Scan: Merge Hive and Avro schemas to fix datatype inconsistencies (#57) Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (#58) Hive Metadata Scan: Fix support for Hive timestamp type (#61) Co-authored-by: Raymond Zhang <razhang@linkedin.com> Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (#67) * Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time * Trigger CI (cherry picked from commit b90e838) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc (#64) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc * Fix style check * Do not delete metadata location when HMS has been successfully updated (#68) (cherry picked from commit 766407e) * Support reading Avro complex union types (#73) Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * [#2039] Support default value semantic for AVRO (#75) (cherry picked from commit c18f4c4) * Support hive non string partition cols (#78) * Support non-string hive type partition columns in LegacyHiveTableScan * Leverage eval against partition filter expression to filter non-string columns * Support default value read for ORC format in spark (#76) * Support default value read for ORC format in spark * Refactor common code for ReadBuilder for both non-vectorized and vectorized read * Fix code style issue * Add special handling of ROW_POSITION metadata column * Add corner case check for partition field * Use BaseDataReader.convertConstant to convert constants, and expand its functionality to support nested-type contants such as array/map/struct * Support nested type default value for vectorized read * Support deeply nested type default value for vectorized read * Support reading ORC complex union types (#74) * Support reading orc complex union types * add more tests * support union in VectorizedSparkOrcReaders and improve tests * support union in VectorizedSparkOrcReaders and improve tests - continued * fix checkstyle Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * Support avro.schema.literal/hive union types in Hive legacy table to Iceberg conversion (#80) * Fix ORC schema visitors to support reading ORC files with deeply nest… (#81) * Fix ORC schema visitors to support reading ORC files with deeply nested union type schema * Added test for vectorized read * Disable avro validation for default values Co-authored-by: Shenoda Guirguis <sguirgui@linkedin.com> * Fix spark avro reader reading union schema data (#83) * Fix spark avro reader to read correctly structured nested data values * Make sure field-id mapping is correctly maintained given arbitrary nested schema that contains union * Avro: Change union read schema from hive to trino (#84) * [LI] Avro: Refactor union-to-struct schema - Part 1. changes to support reading Avro * ORC: Change union read schema from hive to trino (#85) * [LI] ORC: Refactor union-to-struct schema - Part 2. changes to support reading ORC * Change Hive type to Iceberg type conversion for union * Recorder hive table properties to align the avro.schema.literal placement contract (#86) * [#2039] Support default value semantic for AVRO (cherry picked from commit c18f4c4) * reverting commits 2c59857 and f362aed (#88) Co-authored-by: Shenoda Guirguis <sguirgui@sguirgui-mn1.linkedin.biz> * logically patching PR 2328 on HiveMetadataPreservingTableOperations * Support timestamp as partition type (#91) * Support timestamp in partition types * Address comment * Separate classes under hive legacy package to new hivelink module (#87) * separate class under legacy to new hiveberg module * fix build * remove hiveberg dependency in iceberg-spark2 module * Revert "remove hiveberg dependency in iceberg-spark2 module" This reverts commit 2e8b743. * rename hiveberg module to hivelink Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * [LI] Align default value validation align with avro semantics in terms of nullable (nested) fields (#92) * Align default value validation align with avro semantics in terms of nullable (nested) fields * Allow setting null as default value for nested fields in record default * [LI][Spark][Avro] read avro union using decoder instead of directly returning v… (#94) * [LI][Spark] read avro union using decoder instead of directly returning value * Add a comment for the schema * Improve the logging when the deserailzed index is invalid to read the symbol from enum (#96) * Move custom hive catalog to hivelink-core (#99) * Handle non-nullable union of single type for Avro (#98) * Handle non-nullable union of single type Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * Handle null default in nested type default value situations (#100) * Move 'Hive Metadata Scan: Support case insensitive name mapping' (PR 52) to hivelink-core (#102) * Remove activeSparkSession (#103) * Disable default value preserving (#106) * Disable default value preserving * [LI][Avro] Do not reorder elements inside a Avro union schema (#93) * handle single type union properly in AvroSchemaVisitor for deep nested schema (#107) * Handle non-nullable union of single type for ORC spark non-vectorized reader (#104) * Handle single type union for non-vectorized reader * [Avro] Retain the type of field while copying the default values. (#109) * Retain the type of field while copying the default values. * [Hivelink] Refactor support hive non string partition cols to rid of … (#110) * [Hivelink] Refactor support hive non string partition cols to rid of Iceberg-oss code changes * Release automation overhaul: Sonatype Nexus, Shipkit and GH Actions (#101) * Add scm and developer info (#111) * [Core] Fix and refactor schema parser (#112) * [Core] Fix/Refactor SchemaParser to fix multiple bugs * Enhance the UT for testing required fields with default values (#113) * Enhance the UT for testing required fields with default values * Addressed review comments * Addressed review comment * Support single type union for ORC-vectorization reader (#114) * Support single type union for ORC-vectorization reader * Support single type union for ORC-vectorization reader Co-authored-by: Yiqiang Ding <yiqding@yiqding-mn1.linkedin.biz> * Refactor HMS code upon cherry-pick * Check for schema corruption and fix it on commit (#117) * Check for schema corruption and fix it on commit * ORC: Handle query where select and filter only uses default value col… (#118) * ORC: Handle query where select and filter only use default value columns * Set ORC columns and fix case-sensitivity issue with schema check (#119) * Hive: Return null for currentSnapshot() (#121) * Hive: Return null for currentSnapshot() * Handle snapshots() * Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes (#120) * Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes * Add logic to derive partition column id from partition.column.ids pro… (#122) * Add logic to derive partition column id from partition.column.ids property * Do not push down filter to ORC for union type schema (#123) * Bug fix: MergeHiveSchemaWithAvro should retain avro properties for li… (#125) * Bug fix: MergeHiveSchemaWithAvro should retain avro properties for list and map when they are nullable * LinkedIn rebase draft * Refactor hivelink 1 * Make hivelink module test all pass * Make spark 2.4 module work * Fix mr module * Make spark 3.1 module work * Fix TestSparkMetadataColumns * Minor fix for spark 2.4 * Update default spark version to 3.1 * Update java ci to only run spark 2.4 and 3.1 * Minor fix HiveTableOperations * Adapt github CI to 0.14.x branch * Fix mr module checkstyle * Fix checkstyle for orc module * Fix spark2.4 checkstyle * Refactor catalog loading logic using CatalogUtil * Minor change to CI/release Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Co-authored-by: Ratandeep Ratti <rratti@linkedin.com> Co-authored-by: Shardul Mahadik <shardul.m@somaiya.edu> Co-authored-by: Kuai Yu <kuyu@linkedin.com> Co-authored-by: Walaa Eldin Moustafa <wmoustafa@linkedin.com> Co-authored-by: Sushant Raikar <sraikar@linkedin.com> Co-authored-by: ZihanLi58 <48699939+ZihanLi58@users.noreply.github.com> Co-authored-by: Wenye Zhang <wyzhang@linkedin.com> Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> Co-authored-by: Shenoda Guirguis <sguirguis@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@sguirgui-mn1.linkedin.biz> Co-authored-by: Lei Sun <lesun@linkedin.com> Co-authored-by: Jiefan <jiefli@linkedin.com> Co-authored-by: yiqiangin <103528904+yiqiangin@users.noreply.github.com> Co-authored-by: Malini Mahalakshmi Venkatachari <maluchari@gmail.com> Co-authored-by: Yiqiang Ding <yiqding@linkedin.com> Co-authored-by: Yiqiang Ding <yiqding@yiqding-mn1.linkedin.biz> Co-authored-by: Jack Moseley <jmoseley@linkedin.com>

* Hive Catalog: Add a hive catalog that does not override existing Hive metadata (linkedin#10) Add custom hive catalog to not override existing Hive metadata Fail early with a proper exception if the metadata file is not existing Simplify CustomHiveCatalog (linkedin#22) * Shading: Add a iceberg-runtime shaded module (linkedin#12) * ORC: Add test for reading files without Iceberg IDs (linkedin#16) * Hive Metadata Scan: Support reading tables with only Hive metadata (linkedin#23, linkedin#24, linkedin#25, linkedin#26) - Support for non string partition columns (linkedin#24) - Support for Hive tables without avro.schema.literal (linkedin#25) - Hive Metadata Scan: Notify ScanEvent listeners on planning (linkedin#35) - Hive Metadata Scan: Do not use table snapshot summary for estimating statistics (linkedin#37) - Hive Metadata Scan: Return empty statistics (linkedin#49) - Hive Metadata Scan: Do not throw an exception on dangling partitions; log warning message (linkedin#50) - Hive Metadata Scan: Fix pushdown of non-partition predicates within NOT (linkedin#51) Co-authored-by: Ratandeep Ratti <rratti@linkedin.com> Co-authored-by: Kuai Yu <kuyu@linkedin.com> Co-authored-by: Walaa Eldin Moustafa <wmoustafa@linkedin.com> * Row level filtering: Allow table scans to pass a row level filter for ORC files - ORC: Support NameMapping with row-level filtering (linkedin#53) * Hive: Made Predicate Pushdown dynamic based on the Hive Version * Hive: Fix uppercase bug and determine catalog from table properties (linkedin#38) * Hive: Return lowercase fieldname from IcebergRecordStructField * Hive: Determine catalog from table property * Hive: Fix schema not forwarded to SerDe on MR jobs (linkedin#45) (linkedin#47) * Hive: Use Hive table location in HiveIcebergSplit * Hive: Fix schema not passed to Serde * Hive: Refactor tests for tables with unqualified location URI Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> * Hive Metadata Scan: Support case insensitive name mapping (linkedin#52) * Hive Metadata Scan: Merge Hive and Avro schemas to fix datatype inconsistencies (linkedin#57) Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (linkedin#58) Hive Metadata Scan: Fix support for Hive timestamp type (linkedin#61) Co-authored-by: Raymond Zhang <razhang@linkedin.com> Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (linkedin#67) * Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time * Trigger CI (cherry picked from commit b90e838) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc (linkedin#64) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc * Fix style check * Do not delete metadata location when HMS has been successfully updated (linkedin#68) (cherry picked from commit 766407e) * Support reading Avro complex union types (linkedin#73) Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * [#2039] Support default value semantic for AVRO (linkedin#75) (cherry picked from commit c18f4c4) * Support hive non string partition cols (linkedin#78) * Support non-string hive type partition columns in LegacyHiveTableScan * Leverage eval against partition filter expression to filter non-string columns * Support default value read for ORC format in spark (linkedin#76) * Support default value read for ORC format in spark * Refactor common code for ReadBuilder for both non-vectorized and vectorized read * Fix code style issue * Add special handling of ROW_POSITION metadata column * Add corner case check for partition field * Use BaseDataReader.convertConstant to convert constants, and expand its functionality to support nested-type contants such as array/map/struct * Support nested type default value for vectorized read * Support deeply nested type default value for vectorized read * Support reading ORC complex union types (linkedin#74) * Support reading orc complex union types * add more tests * support union in VectorizedSparkOrcReaders and improve tests * support union in VectorizedSparkOrcReaders and improve tests - continued * fix checkstyle Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * Support avro.schema.literal/hive union types in Hive legacy table to Iceberg conversion (linkedin#80) * Fix ORC schema visitors to support reading ORC files with deeply nest… (linkedin#81) * Fix ORC schema visitors to support reading ORC files with deeply nested union type schema * Added test for vectorized read * Disable avro validation for default values Co-authored-by: Shenoda Guirguis <sguirgui@linkedin.com> * Fix spark avro reader reading union schema data (linkedin#83) * Fix spark avro reader to read correctly structured nested data values * Make sure field-id mapping is correctly maintained given arbitrary nested schema that contains union * Avro: Change union read schema from hive to trino (linkedin#84) * [LI] Avro: Refactor union-to-struct schema - Part 1. changes to support reading Avro * ORC: Change union read schema from hive to trino (linkedin#85) * [LI] ORC: Refactor union-to-struct schema - Part 2. changes to support reading ORC * Change Hive type to Iceberg type conversion for union * Recorder hive table properties to align the avro.schema.literal placement contract (linkedin#86) * [#2039] Support default value semantic for AVRO (cherry picked from commit c18f4c4) * reverting commits 2c59857 and f362aed (linkedin#88) Co-authored-by: Shenoda Guirguis <sguirgui@sguirgui-mn1.linkedin.biz> * logically patching PR 2328 on HiveMetadataPreservingTableOperations * Support timestamp as partition type (linkedin#91) * Support timestamp in partition types * Address comment * Separate classes under hive legacy package to new hivelink module (linkedin#87) * separate class under legacy to new hiveberg module * fix build * remove hiveberg dependency in iceberg-spark2 module * Revert "remove hiveberg dependency in iceberg-spark2 module" This reverts commit 2e8b743. * rename hiveberg module to hivelink Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * [LI] Align default value validation align with avro semantics in terms of nullable (nested) fields (linkedin#92) * Align default value validation align with avro semantics in terms of nullable (nested) fields * Allow setting null as default value for nested fields in record default * [LI][Spark][Avro] read avro union using decoder instead of directly returning v… (linkedin#94) * [LI][Spark] read avro union using decoder instead of directly returning value * Add a comment for the schema * Improve the logging when the deserailzed index is invalid to read the symbol from enum (linkedin#96) * Move custom hive catalog to hivelink-core (linkedin#99) * Handle non-nullable union of single type for Avro (linkedin#98) * Handle non-nullable union of single type Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * Handle null default in nested type default value situations (linkedin#100) * Move 'Hive Metadata Scan: Support case insensitive name mapping' (PR 52) to hivelink-core (linkedin#102) * Remove activeSparkSession (linkedin#103) * Disable default value preserving (linkedin#106) * Disable default value preserving * [LI][Avro] Do not reorder elements inside a Avro union schema (linkedin#93) * handle single type union properly in AvroSchemaVisitor for deep nested schema (linkedin#107) * Handle non-nullable union of single type for ORC spark non-vectorized reader (linkedin#104) * Handle single type union for non-vectorized reader * [Avro] Retain the type of field while copying the default values. (linkedin#109) * Retain the type of field while copying the default values. * [Hivelink] Refactor support hive non string partition cols to rid of … (linkedin#110) * [Hivelink] Refactor support hive non string partition cols to rid of Iceberg-oss code changes * Release automation overhaul: Sonatype Nexus, Shipkit and GH Actions (linkedin#101) * Add scm and developer info (linkedin#111) * [Core] Fix and refactor schema parser (linkedin#112) * [Core] Fix/Refactor SchemaParser to fix multiple bugs * Enhance the UT for testing required fields with default values (linkedin#113) * Enhance the UT for testing required fields with default values * Addressed review comments * Addressed review comment * Support single type union for ORC-vectorization reader (linkedin#114) * Support single type union for ORC-vectorization reader * Support single type union for ORC-vectorization reader Co-authored-by: Yiqiang Ding <yiqding@yiqding-mn1.linkedin.biz> * Refactor HMS code upon cherry-pick * Check for schema corruption and fix it on commit (linkedin#117) * Check for schema corruption and fix it on commit * ORC: Handle query where select and filter only uses default value col… (linkedin#118) * ORC: Handle query where select and filter only use default value columns * Set ORC columns and fix case-sensitivity issue with schema check (linkedin#119) * Hive: Return null for currentSnapshot() (linkedin#121) * Hive: Return null for currentSnapshot() * Handle snapshots() * Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes (linkedin#120) * Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes * Add logic to derive partition column id from partition.column.ids pro… (linkedin#122) * Add logic to derive partition column id from partition.column.ids property * Do not push down filter to ORC for union type schema (linkedin#123) * Bug fix: MergeHiveSchemaWithAvro should retain avro properties for li… (linkedin#125) * Bug fix: MergeHiveSchemaWithAvro should retain avro properties for list and map when they are nullable * LinkedIn rebase draft * Refactor hivelink 1 * Make hivelink module test all pass * Make spark 2.4 module work * Fix mr module * Make spark 3.1 module work * Fix TestSparkMetadataColumns * Minor fix for spark 2.4 * Update default spark version to 3.1 * Update java ci to only run spark 2.4 and 3.1 * Minor fix HiveTableOperations * Adapt github CI to 0.14.x branch * Fix mr module checkstyle * Fix checkstyle for orc module * Fix spark2.4 checkstyle * Refactor catalog loading logic using CatalogUtil * Minor change to CI/release Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Co-authored-by: Ratandeep Ratti <rratti@linkedin.com> Co-authored-by: Shardul Mahadik <shardul.m@somaiya.edu> Co-authored-by: Kuai Yu <kuyu@linkedin.com> Co-authored-by: Walaa Eldin Moustafa <wmoustafa@linkedin.com> Co-authored-by: Sushant Raikar <sraikar@linkedin.com> Co-authored-by: ZihanLi58 <48699939+ZihanLi58@users.noreply.github.com> Co-authored-by: Wenye Zhang <wyzhang@linkedin.com> Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> Co-authored-by: Shenoda Guirguis <sguirguis@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@sguirgui-mn1.linkedin.biz> Co-authored-by: Lei Sun <lesun@linkedin.com> Co-authored-by: Jiefan <jiefli@linkedin.com> Co-authored-by: yiqiangin <103528904+yiqiangin@users.noreply.github.com> Co-authored-by: Malini Mahalakshmi Venkatachari <maluchari@gmail.com> Co-authored-by: Yiqiang Ding <yiqding@linkedin.com> Co-authored-by: Yiqiang Ding <yiqding@yiqding-mn1.linkedin.biz> Co-authored-by: Jack Moseley <jmoseley@linkedin.com>

* Rebase LI-Iceberg changes on top of Apache Iceberg 1.0.0 release * Hive Catalog: Add a hive catalog that does not override existing Hive metadata (#10) Add custom hive catalog to not override existing Hive metadata Fail early with a proper exception if the metadata file is not existing Simplify CustomHiveCatalog (#22) * Shading: Add a iceberg-runtime shaded module (#12) * ORC: Add test for reading files without Iceberg IDs (#16) * Hive Metadata Scan: Support reading tables with only Hive metadata (#23, #24, #25, #26) - Support for non string partition columns (#24) - Support for Hive tables without avro.schema.literal (#25) - Hive Metadata Scan: Notify ScanEvent listeners on planning (#35) - Hive Metadata Scan: Do not use table snapshot summary for estimating statistics (#37) - Hive Metadata Scan: Return empty statistics (#49) - Hive Metadata Scan: Do not throw an exception on dangling partitions; log warning message (#50) - Hive Metadata Scan: Fix pushdown of non-partition predicates within NOT (#51) Co-authored-by: Ratandeep Ratti <rratti@linkedin.com> Co-authored-by: Kuai Yu <kuyu@linkedin.com> Co-authored-by: Walaa Eldin Moustafa <wmoustafa@linkedin.com> * Row level filtering: Allow table scans to pass a row level filter for ORC files - ORC: Support NameMapping with row-level filtering (#53) * Hive: Made Predicate Pushdown dynamic based on the Hive Version * Hive: Fix uppercase bug and determine catalog from table properties (#38) * Hive: Return lowercase fieldname from IcebergRecordStructField * Hive: Determine catalog from table property * Hive: Fix schema not forwarded to SerDe on MR jobs (#45) (#47) * Hive: Use Hive table location in HiveIcebergSplit * Hive: Fix schema not passed to Serde * Hive: Refactor tests for tables with unqualified location URI Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> * Hive Metadata Scan: Support case insensitive name mapping (#52) * Hive Metadata Scan: Merge Hive and Avro schemas to fix datatype inconsistencies (#57) Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (#58) Hive Metadata Scan: Fix support for Hive timestamp type (#61) Co-authored-by: Raymond Zhang <razhang@linkedin.com> Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (#67) * Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time * Trigger CI (cherry picked from commit b90e838) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc (#64) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc * Fix style check * Do not delete metadata location when HMS has been successfully updated (#68) (cherry picked from commit 766407e) * Support reading Avro complex union types (#73) Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * [#2039] Support default value semantic for AVRO (#75) (cherry picked from commit c18f4c4) * Support hive non string partition cols (#78) * Support non-string hive type partition columns in LegacyHiveTableScan * Leverage eval against partition filter expression to filter non-string columns * Support default value read for ORC format in spark (#76) * Support default value read for ORC format in spark * Refactor common code for ReadBuilder for both non-vectorized and vectorized read * Fix code style issue * Add special handling of ROW_POSITION metadata column * Add corner case check for partition field * Use BaseDataReader.convertConstant to convert constants, and expand its functionality to support nested-type contants such as array/map/struct * Support nested type default value for vectorized read * Support deeply nested type default value for vectorized read * Support reading ORC complex union types (#74) * Support reading orc complex union types * add more tests * support union in VectorizedSparkOrcReaders and improve tests * support union in VectorizedSparkOrcReaders and improve tests - continued * fix checkstyle Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * Support avro.schema.literal/hive union types in Hive legacy table to Iceberg conversion (#80) * Fix ORC schema visitors to support reading ORC files with deeply nest… (#81) * Fix ORC schema visitors to support reading ORC files with deeply nested union type schema * Added test for vectorized read * Disable avro validation for default values Co-authored-by: Shenoda Guirguis <sguirgui@linkedin.com> * Fix spark avro reader reading union schema data (#83) * Fix spark avro reader to read correctly structured nested data values * Make sure field-id mapping is correctly maintained given arbitrary nested schema that contains union * Avro: Change union read schema from hive to trino (#84) * [LI] Avro: Refactor union-to-struct schema - Part 1. changes to support reading Avro * ORC: Change union read schema from hive to trino (#85) * [LI] ORC: Refactor union-to-struct schema - Part 2. changes to support reading ORC * Change Hive type to Iceberg type conversion for union * Recorder hive table properties to align the avro.schema.literal placement contract (#86) * [#2039] Support default value semantic for AVRO (cherry picked from commit c18f4c4) * reverting commits 2c59857 and f362aed (#88) Co-authored-by: Shenoda Guirguis <sguirgui@sguirgui-mn1.linkedin.biz> * logically patching PR 2328 on HiveMetadataPreservingTableOperations * Support timestamp as partition type (#91) * Support timestamp in partition types * Address comment * Separate classes under hive legacy package to new hivelink module (#87) * separate class under legacy to new hiveberg module * fix build * remove hiveberg dependency in iceberg-spark2 module * Revert "remove hiveberg dependency in iceberg-spark2 module" This reverts commit 2e8b743. * rename hiveberg module to hivelink Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * [LI] Align default value validation align with avro semantics in terms of nullable (nested) fields (#92) * Align default value validation align with avro semantics in terms of nullable (nested) fields * Allow setting null as default value for nested fields in record default * [LI][Spark][Avro] read avro union using decoder instead of directly returning v… (#94) * [LI][Spark] read avro union using decoder instead of directly returning value * Add a comment for the schema * Improve the logging when the deserailzed index is invalid to read the symbol from enum (#96) * Move custom hive catalog to hivelink-core (#99) * Handle non-nullable union of single type for Avro (#98) * Handle non-nullable union of single type Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> * Handle null default in nested type default value situations (#100) * Move 'Hive Metadata Scan: Support case insensitive name mapping' (PR 52) to hivelink-core (#102) * Remove activeSparkSession (#103) * Disable default value preserving (#106) * Disable default value preserving * [LI][Avro] Do not reorder elements inside a Avro union schema (#93) * handle single type union properly in AvroSchemaVisitor for deep nested schema (#107) * Handle non-nullable union of single type for ORC spark non-vectorized reader (#104) * Handle single type union for non-vectorized reader * [Avro] Retain the type of field while copying the default values. (#109) * Retain the type of field while copying the default values. * [Hivelink] Refactor support hive non string partition cols to rid of … (#110) * [Hivelink] Refactor support hive non string partition cols to rid of Iceberg-oss code changes * Release automation overhaul: Sonatype Nexus, Shipkit and GH Actions (#101) * Add scm and developer info (#111) * [Core] Fix and refactor schema parser (#112) * [Core] Fix/Refactor SchemaParser to fix multiple bugs * Enhance the UT for testing required fields with default values (#113) * Enhance the UT for testing required fields with default values * Addressed review comments * Addressed review comment * Support single type union for ORC-vectorization reader (#114) * Support single type union for ORC-vectorization reader * Support single type union for ORC-vectorization reader Co-authored-by: Yiqiang Ding <yiqding@yiqding-mn1.linkedin.biz> * Refactor HMS code upon cherry-pick * Check for schema corruption and fix it on commit (#117) * Check for schema corruption and fix it on commit * ORC: Handle query where select and filter only uses default value col… (#118) * ORC: Handle query where select and filter only use default value columns * Set ORC columns and fix case-sensitivity issue with schema check (#119) * Hive: Return null for currentSnapshot() (#121) * Hive: Return null for currentSnapshot() * Handle snapshots() * Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes (#120) * Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes * Add logic to derive partition column id from partition.column.ids pro… (#122) * Add logic to derive partition column id from partition.column.ids property * Do not push down filter to ORC for union type schema (#123) * Bug fix: MergeHiveSchemaWithAvro should retain avro properties for li… (#125) * Bug fix: MergeHiveSchemaWithAvro should retain avro properties for list and map when they are nullable * LinkedIn rebase draft * Refactor hivelink 1 * Make hivelink module test all pass * Make spark 2.4 module work * Fix mr module * Make spark 3.1 module work * Fix TestSparkMetadataColumns * Minor fix for spark 2.4 * Update default spark version to 3.1 * Update java ci to only run spark 2.4 and 3.1 * Minor fix HiveTableOperations * Adapt github CI to 0.14.x branch * Fix mr module checkstyle * Fix checkstyle for orc module * Fix spark2.4 checkstyle * Refactor catalog loading logic using CatalogUtil * Minor change to CI/release Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Co-authored-by: Ratandeep Ratti <rratti@linkedin.com> Co-authored-by: Shardul Mahadik <shardul.m@somaiya.edu> Co-authored-by: Kuai Yu <kuyu@linkedin.com> Co-authored-by: Walaa Eldin Moustafa <wmoustafa@linkedin.com> Co-authored-by: Sushant Raikar <sraikar@linkedin.com> Co-authored-by: ZihanLi58 <48699939+ZihanLi58@users.noreply.github.com> Co-authored-by: Wenye Zhang <wyzhang@linkedin.com> Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> Co-authored-by: Shenoda Guirguis <sguirguis@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@sguirgui-mn1.linkedin.biz> Co-authored-by: Lei Sun <lesun@linkedin.com> Co-authored-by: Jiefan <jiefli@linkedin.com> Co-authored-by: yiqiangin <103528904+yiqiangin@users.noreply.github.com> Co-authored-by: Malini Mahalakshmi Venkatachari <maluchari@gmail.com> Co-authored-by: Yiqiang Ding <yiqding@linkedin.com> Co-authored-by: Yiqiang Ding <yiqding@yiqding-mn1.linkedin.biz> Co-authored-by: Jack Moseley <jmoseley@linkedin.com> * Add flink 1.14 artifacts for release Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Co-authored-by: Ratandeep Ratti <rratti@linkedin.com> Co-authored-by: Shardul Mahadik <shardul.m@somaiya.edu> Co-authored-by: Kuai Yu <kuyu@linkedin.com> Co-authored-by: Walaa Eldin Moustafa <wmoustafa@linkedin.com> Co-authored-by: Sushant Raikar <sraikar@linkedin.com> Co-authored-by: ZihanLi58 <48699939+ZihanLi58@users.noreply.github.com> Co-authored-by: Wenye Zhang <wyzhang@linkedin.com> Co-authored-by: Wenye Zhang <wyzhang@wyzhang-mn1.linkedin.biz> Co-authored-by: Shenoda Guirguis <sguirguis@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@linkedin.com> Co-authored-by: Shenoda Guirguis <sguirgui@sguirgui-mn1.linkedin.biz> Co-authored-by: Lei Sun <lesun@linkedin.com> Co-authored-by: Jiefan <jiefli@linkedin.com> Co-authored-by: yiqiangin <103528904+yiqiangin@users.noreply.github.com> Co-authored-by: Malini Mahalakshmi Venkatachari <maluchari@gmail.com> Co-authored-by: Yiqiang Ding <yiqding@linkedin.com> Co-authored-by: Yiqiang Ding <yiqding@yiqding-mn1.linkedin.biz> Co-authored-by: Jack Moseley <jmoseley@linkedin.com>

Fix support for Hive timestamp type

5d0dde2

shenodaguirguis approved these changes Mar 29, 2021

View reviewed changes

shardulm94 reviewed Mar 29, 2021

View reviewed changes

hive-metastore/src/test/java/org/apache/iceberg/hive/legacy/TestMergeHiveSchemaWithAvro.java Outdated Show resolved Hide resolved

rzhang10 and others added 2 commits March 29, 2021 14:47

Address pr comments

29efeeb

Remove unnecessary assignment

b8118a6

shardulm94 merged commit 3d8a057 into linkedin:master Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix support for Hive timestamp type #61

Fix support for Hive timestamp type #61

Uh oh!

rzhang10 commented Mar 28, 2021

Uh oh!

shenodaguirguis left a comment

Uh oh!

Uh oh!

shardulm94 commented Mar 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix support for Hive timestamp type #61

Fix support for Hive timestamp type #61

Uh oh!

Conversation

rzhang10 commented Mar 28, 2021

Uh oh!

shenodaguirguis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shardulm94 commented Mar 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants