Skip to content

Conversation

@linliu-code
Copy link
Collaborator

@linliu-code linliu-code commented Dec 15, 2025

Change Logs

This pr #9743 adds more schema evolution functionality and schema processing. However, we used the InternalSchema system to do various operations such as fix null ordering, reorder, and add columns. At the time, InternalSchema only had a single Timestamp type. When converting back to avro, this was assumed to be micros. Therefore, if the schema provider had any millis columns, the processed schema would end up with those columns as micros.

In this pr to update column stats with better support for logical types: #13711, the schema issues were fixed, as well as additional issues with handling and conversion of timestamps during ingestion.

this pr aims to add functionality to spark and hive readers and writers to automatically repair affected tables.
After switching to use the 1.1 binary, the affected columns will undergo evolution from timestamp-micros to timestamp-mills. Normally a lossy evolution that is not supported, this evolution is ok because the data is actually still timestamp-millis it is just mislabeled as micros in the parquet and table schemas

Impact

When reading from a hudi table using spark or hive reader if the table schema has a column as millis, but the data schema is micros, we will assume that this column is affected and read it as a millis value instead of a micros value. This correction is also applied to all readers that the default write paths use. As a table is rewritten the parquet files will be correct. A table's latest snapshot can be immediately fixed by writing one commit with the 1.1 binary, and then clustering the entire table.

Risk level (write none, low medium or high below)

High,
extensive testing was done and functional tests were added.

Documentation Update

#14100

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@linliu-code linliu-code changed the base branch from master to branch-0.x December 15, 2025 20:34
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from ac2916a to 5ef5773 Compare December 15, 2025 20:40
@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Dec 15, 2025
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 12 times, most recently from 408cc29 to 0c4e026 Compare December 16, 2025 02:51
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 0c4e026 to 79c4a88 Compare December 16, 2025 03:22
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 13 times, most recently from 87efa59 to 4440884 Compare January 3, 2026 07:46
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 4440884 to 996a1ed Compare January 3, 2026 08:06
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 4 times, most recently from 7595450 to e44e6d2 Compare January 3, 2026 18:50
@lokeshj1703
Copy link
Collaborator

@linliu-code We can add support for Spark 3.2 as well in this PR if it is not very difficult. The fix should ideally work for Spark 3.2 as well.
Also since local timestamp is not supported for older hudi versions, it is better to throw UnsupportedOperationException for any conversions related to it. We can remove any supporting logic for local timestamp but can still keep it in the enums to preserve ordinal numbering.

@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 5 times, most recently from 6973034 to 103e3b4 Compare January 4, 2026 01:56
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 103e3b4 to e35e8f9 Compare January 4, 2026 02:08
@hudi-bot
Copy link
Collaborator

hudi-bot commented Jan 4, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants