[MINOR] Fix logical type issue for timestamp columns #17601

linliu-code · 2025-12-15T20:34:33Z

Change Logs

This pr #9743 adds more schema evolution functionality and schema processing. However, we used the InternalSchema system to do various operations such as fix null ordering, reorder, and add columns. At the time, InternalSchema only had a single Timestamp type. When converting back to avro, this was assumed to be micros. Therefore, if the schema provider had any millis columns, the processed schema would end up with those columns as micros.

In this pr to update column stats with better support for logical types: #13711, the schema issues were fixed, as well as additional issues with handling and conversion of timestamps during ingestion.

this pr aims to add functionality to spark and hive readers and writers to automatically repair affected tables.
After switching to use the 1.1 binary, the affected columns will undergo evolution from timestamp-micros to timestamp-mills. Normally a lossy evolution that is not supported, this evolution is ok because the data is actually still timestamp-millis it is just mislabeled as micros in the parquet and table schemas

Impact

When reading from a hudi table using spark or hive reader if the table schema has a column as millis, but the data schema is micros, we will assume that this column is affected and read it as a millis value instead of a micros value. This correction is also applied to all readers that the default write paths use. As a table is rewritten the parquet files will be correct. A table's latest snapshot can be immediately fixed by writing one commit with the 1.1 binary, and then clustering the entire table.

Risk level (write none, low medium or high below)

High,
extensive testing was done and functional tests were added.

Documentation Update

#14100

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

lokeshj1703

@linliu-code Thanks for working on this! The PR contains a few changes which are not part of https://github.com/apache/hudi/pull/14161/files. Can we add description about how the fix works for older hudi tables. Also the original PR mentions a limitation.

However, we used the InternalSchema system to do various operations such as fix null ordering, reorder, and add columns. At the time, InternalSchema only had a single Timestamp type. When converting back to avro, this was assumed to be micros.

Is this limitation fixed in older hudi tables?

pom.xml

hudi-common/src/avro/test/java/org/apache/parquet/schema/TestAvroSchemaRepair.java

lokeshj1703 · 2025-12-30T09:44:19Z

hudi-common/src/avro/java/org/apache/parquet/schema/AvroSchemaRepair.java

+  public static boolean hasTimestampMillisField(Schema tableSchema) {
+    switch (tableSchema.getType()) {
+      case RECORD:
+        for (Schema.Field field : tableSchema.getFields()) {
+          if (hasTimestampMillisField(field.schema())) {
+            return true;
+          }
+        }
+        return false;
+
+      case ARRAY:
+        return hasTimestampMillisField(tableSchema.getElementType());
+
+      case MAP:
+        return hasTimestampMillisField(tableSchema.getValueType());
+
+      case UNION:
+        return hasTimestampMillisField(AvroSchemaUtils.getNonNullTypeFromUnion(tableSchema));
+
+      default:
+        return tableSchema.getType() == Schema.Type.LONG
+            && (tableSchema.getLogicalType() instanceof LogicalTypes.TimestampMillis || tableSchema.getLogicalType() instanceof LogicalTypes.LocalTimestampMillis);
+    }
+  }
+
+  /**
+   * Check if LogicalTypes.LocalTimestampMillis is supported in the current Avro version
+   *
+   * @return true if LocalTimestampMillis is available, false otherwise
+   */
+  public static boolean isLocalTimestampMillisSupported() {
+    try {
+      return Arrays.stream(LogicalTypes.class.getDeclaredClasses())
+          .anyMatch(c -> c.getSimpleName().equals("LocalTimestampMillis"));
+    } catch (Exception e) {
+      return false;
+    }
+  }
+}


It seems like these APIs are not used. Should we remove these?

lokeshj1703 · 2025-12-30T11:01:03Z

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java

+  public static Option<Schema> findNestedFieldSchema(Schema schema, String fieldName) {
+    if (StringUtils.isNullOrEmpty(fieldName)) {
+      return Option.empty();
+    }
+    String[] parts = fieldName.split("\\.");
+    for (String part : parts) {
+      Schema.Field foundField = getNonNullTypeFromUnion(schema).getField(part);
+      if (foundField == null) {
+        throw new HoodieAvroSchemaException(fieldName + " not a field in " + schema);
+      }
+      schema = foundField.schema();
+    }
+    return Option.of(getNonNullTypeFromUnion(schema));
+  }
+
+  public static Option<Schema.Type> findNestedFieldType(Schema schema, String fieldName) {
+    return findNestedFieldSchema(schema, fieldName).map(Schema::getType);
+  }
+


These APIs are not used anywhere.

lokeshj1703 · 2025-12-30T12:30:32Z

hudi-common/src/main/java/org/apache/hudi/avro/ConvertingGenericData.java

+    // NOTE: Those are not supported in Avro 1.8.2 (used by Spark 2)
+    // Only add conversions if they're available


Should we validate the fix and added tests with spark 2? I am not sure if CI covers it by default.

Right now we only make the conversion for Spark3.4+.

lokeshj1703 · 2025-12-30T13:16:26Z

hudi-common/src/main/java/org/apache/hudi/common/util/DateTimeUtils.java

    return Instant.ofEpochSecond(epochSeconds, nanoAdjustment);
  }

+  public static Instant nanosToInstant(long nanosFromEpoch) {


These are unused

…4161) Co-authored-by: Jonathan Vexler <=> Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Vamsi <vamsi@onehouse.ai> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: Lin Liu <linliu.code@gmail.com>

…oKeyGenerator (apache#7913) Co-authored-by: Sydney Beal <sydney_beal@comcast.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

hudi-bot · 2026-01-10T00:19:33Z

CI report:

5ef5773 UNKNOWN
1d2d706 UNKNOWN
4bbecb7 UNKNOWN
73e1942 UNKNOWN
ffcc9ca UNKNOWN
4c5a493 UNKNOWN
408cc29 UNKNOWN
0c4e026 UNKNOWN
8583da1 UNKNOWN
61285db UNKNOWN
d264474 UNKNOWN
a62e355 UNKNOWN
d22894c UNKNOWN
e8d9ca3 UNKNOWN
d2ef7f9 UNKNOWN
4440884 UNKNOWN
721b598 UNKNOWN
7d0b742 UNKNOWN
103e3b4 UNKNOWN
29a6abc UNKNOWN
9f21467 UNKNOWN
ed4eeff UNKNOWN
da522cd UNKNOWN
e2a1812 UNKNOWN
844a712 UNKNOWN
fcc5cda UNKNOWN
723a3a3 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

linliu-code changed the base branch from master to branch-0.x December 15, 2025 20:34

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from ac2916a to 5ef5773 Compare December 15, 2025 20:40

github-actions bot added the size:XL PR with lines of changes > 1000 label Dec 15, 2025

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 13 times, most recently from 0c4e026 to 79c4a88 Compare December 16, 2025 03:22

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 2 times, most recently from 8583da1 to 0c7b7b9 Compare December 24, 2025 00:18

linliu-code marked this pull request as ready for review December 24, 2025 00:19

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 6 times, most recently from fcbe23c to 20ada07 Compare December 30, 2025 09:56

lokeshj1703 reviewed Dec 30, 2025

View reviewed changes

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 4 times, most recently from 8c011e0 to ac33414 Compare December 31, 2025 07:39

linliu-code and others added 11 commits January 8, 2026 17:32

Skip compiling for spark <= 3.1

f1910be

Remove spark3.2 for NTZ support

56c4aee

Fix cherry-pick error

a09cbf6

Add more changes

28efea5

Fix more issues

1199934

[HUDI-8235] Adding support for EPOCHMICROSECONDS in TimestampBasedAvr…

bdf5261

…oKeyGenerator (apache#7913) Co-authored-by: Sydney Beal <sydney_beal@comcast.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

Use spark3.5 for Azure test

a6a0942

Fix more CI issues again

820d136

Resolve java17 issues

711876a

Fix Azure CI and integration tests

5ef643e

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 5 times, most recently from 9f21467 to 6e44877 Compare January 9, 2026 04:32

Fix more CI tests

ed4eeff

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 8 times, most recently from 844a712 to 5516cff Compare January 9, 2026 22:01

Resove a few dependency issues

fcc5cda

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 5516cff to fcc5cda Compare January 9, 2026 22:13

Add Janino dependency

723a3a3

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 52b2658 to 723a3a3 Compare January 9, 2026 23:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MINOR] Fix logical type issue for timestamp columns #17601

[MINOR] Fix logical type issue for timestamp columns #17601

linliu-code commented Dec 15, 2025 •

edited

Loading

Uh oh!

lokeshj1703 left a comment

Uh oh!

Uh oh!

Uh oh!

lokeshj1703 Dec 30, 2025

Uh oh!

lokeshj1703 Dec 30, 2025

Uh oh!

lokeshj1703 Dec 30, 2025

Uh oh!

linliu-code Dec 30, 2025

Uh oh!

lokeshj1703 Dec 30, 2025

Uh oh!

hudi-bot commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		// NOTE: Those are not supported in Avro 1.8.2 (used by Spark 2)
		// Only add conversions if they're available

[MINOR] Fix logical type issue for timestamp columns #17601

Are you sure you want to change the base?

[MINOR] Fix logical type issue for timestamp columns #17601

Conversation

linliu-code commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lokeshj1703 Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

lokeshj1703 Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

lokeshj1703 Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

linliu-code Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

lokeshj1703 Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jan 10, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

linliu-code commented Dec 15, 2025 •

edited

Loading