Skip to content

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

Description

  1. Upgrades Pyiceberg to 0.10
  2. Performs Schema Evolution for Iceberg Append Writes

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner December 5, 2025 22:42
@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Dec 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement to the Iceberg datasink by adding support for schema evolution. The changes are well-structured, and the refactoring of IcebergDatasink makes it more robust and easier to understand. The addition of comprehensive tests for schema evolution is also a great contribution.

I have a couple of suggestions for minor optimizations in the on_write_complete method to improve efficiency by reducing redundant operations. Overall, this is an excellent pull request.

Comment on lines 91 to 94
"""
Update the table schema to accommodate incoming data using union-by-name semantics.
property_as_bool = PropertyUtil.property_as_bool
This is called from the driver after reconciling all schemas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it clear that this only can be called from the driver

(Also think about how we can assert that it's only called from the driver)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_driver = ray.get_runtime_context().worker.mode != WORKER_MODE should work

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale force-pushed the goutam/iceberg_schema_thing_v1 branch from bb2f196 to ac28c03 Compare December 6, 2025 04:14
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Callback skipped for small datasets in all_inputs_done

The _on_first_input_callback is only invoked in _add_input_inner() but not in all_inputs_done(). When processing small datasets where all bundles don't meet the min_rows_per_bundle threshold during normal processing, the bundles are deferred to all_inputs_done(), which calls _add_bundled_input() directly without invoking the callback. For IcebergDatasink, this means on_write_start() (which handles schema evolution) is never called for small datasets, potentially causing write failures when incoming data has new columns.

python/ray/data/_internal/execution/operators/map_operator.py#L566-L576

def all_inputs_done(self):
self._block_ref_bundler.done_adding_bundles()
if self._block_ref_bundler.has_bundle():
# Handle any leftover bundles in the bundler.
(
_,
bundled_input,
) = self._block_ref_bundler.get_next_bundle()
self._add_bundled_input(bundled_input)
super().all_inputs_done()

python/ray/data/_internal/planner/plan_write_op.py#L136-L145

if not isinstance(datasink, _FileDatasink):
if isinstance(datasink, IcebergDatasink):
# Iceberg needs the schema for schema evolution, use deferred callback
def on_first_input(bundle: RefBundle):
schema: Optional["pa.Schema"] = _get_pyarrow_schema_from_bundle(
bundle
)
datasink.on_write_start(schema)
map_op.set_on_first_input_callback(on_first_input)

Fix in Cursor Fix in Web


Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comments

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini summary

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request significantly enhances Ray Data's integration with Apache Iceberg by upgrading the underlying Pyiceberg library and introducing robust schema evolution capabilities. This allows users to append data to Iceberg tables with varying schemas, as the system will automatically adapt the table's schema to accommodate new columns and promote types as needed, simplifying data ingestion workflows.

Highlights

  • Pyiceberg Upgrade: The Pyiceberg library has been upgraded to version 0.10.0, bringing the latest features and improvements from the Iceberg community.
  • Iceberg Schema Evolution: Ray Data's Iceberg datasink now supports automatic schema evolution for append writes. New columns in incoming data are automatically added to the table schema, and type promotion across data blocks is handled through schema reconciliation on the driver.
  • Improved Write Logic: The internal write mechanism has been refactored to collect schemas from all data blocks on workers, reconcile them on the driver, and perform an atomic commit that includes both schema updates and data file appends.
Changelog
  • ci/lint/pydoclint-baseline.txt
    • Removed outdated docstring linting errors for IcebergDatasink.
  • python/ray/data/_internal/datasource/iceberg_datasink.py
    • Updated imports, removing uuid and packaging.version.
    • Modified IcebergDatasink to return a tuple of DataFile objects and pa.Schema objects from the write method, enabling schema collection from all blocks.
    • Refactored __init__, __getstate__, and __setstate__ to manage the _table object more effectively for pickling and unpickling.
    • Introduced _reload_table and _update_schema methods for managing table metadata and performing schema evolution.
    • Added _append_and_commit for encapsulating the transactional append logic.
    • Reworked on_write_start to initialize the table and perform an initial schema evolution based on the first data bundle.
    • Refactored the write method to ensure the table is reloaded on workers if not present and to return schemas from all non-empty blocks.
    • Completely re-implemented on_write_complete to collect all data files and schemas from workers, reconcile them using unify_schemas with type promotion, and then perform an atomic transaction to update the table schema (if necessary) and commit the new data files.
  • python/ray/data/dataset.py
    • Updated the write_iceberg docstring to clearly state that schema evolution is automatically enabled.
    • Added a new example demonstrating how schema evolution works when appending data with new columns.
    • Modified the IcebergDatasink instantiation to use explicit keyword arguments.
  • python/ray/data/tests/test_iceberg.py
    • Added new imports for Catalog and Table from pyiceberg.
    • Corrected an assertion in test_read_basic for string type instead of large_string.
    • Introduced a new TestSchemaEvolution class with helper functions (clean_table, _create_typed_dataframe, _write_to_iceberg, _read_from_iceberg, _verify_schema) to thoroughly test schema evolution.
    • Added specific test cases for test_schema_evolution_add_column, test_multiple_schema_evolutions, and test_column_order_independence.
  • python/requirements/ml/data-test-requirements.txt
    • Upgraded pyiceberg[sql-sqlite] from 0.9.0 to 0.10.0.
  • python/requirements_compiled.txt
    • Upgraded pyiceberg from 0.9.0 to 0.10.0.
    • Added pyroaring==1.0.3 as a new dependency, likely due to the Pyiceberg upgrade.
Activity
  • The author requested a summary of the pull request.
  • A bot identified a bug where only the first block's schema was captured, potentially leading to incorrect type promotion or missing column detection.
  • A bot identified a high-severity bug regarding shared write UUIDs potentially causing file collisions across workers.
  • A bot suggested an optimization for schema reconciliation by combining unify_schemas calls.
  • A bot suggested making _reload_table() conditional in on_write_complete to avoid redundant network calls.
  • A reviewer suggested a type hint change for schema in on_write_start.
  • A reviewer suggested assigning incoming_schema to _target_schema.
  • A reviewer commented on ensuring _update_schema is only called from the driver, with the author suggesting a runtime context check for assertion.
  • A reviewer questioned the necessity of retrying transactions, and the author confirmed retries could be removed as schema updates were no longer worker-side.
  • A reviewer asked about reloading the table on workers, and the author clarified it's for handling unpickled _table=None.
  • A reviewer inquired if schema updates happen before writing, which the author confirmed.
  • A reviewer suggested renaming _update_schema to _try_update_schema.
  • A reviewer asked for elaboration on a condition, leading to a discussion about schema evolution for new columns and type promotion, with the author eventually stating the function could be removed.
  • A reviewer suggested a specific way to unify schemas.
  • A reviewer suggested updating the schema within the write transaction.
  • A reviewer commented "We can't do that" in response to a suggestion, and the author noted the alternative is explicit schema passing by the user.
  • A bot identified a high-severity bug where the schema update transaction in on_write_complete was never committed.
  • A bot identified a medium-severity bug regarding duplicate on_write_start calls for file-based datasinks.
  • A bot identified a low-severity bug about premature table reload before transaction commit.
  • A bot identified a high-severity bug where TensorDtype was not handled correctly in schema conversion, leading to incorrect pyarrow.string() type.
  • A reviewer suggested always invoking on_write_start in the callback.
  • A reviewer suggested making all operations immutable and passing them as constructor arguments.
  • A reviewer suggested passing only the schema, not the input, to _on_start.
  • A bot identified a high-severity bug where schema evolution limited to the first bundle might fail if subsequent blocks introduce new columns. The author acknowledged this as a known limitation.
  • A bot identified a medium-severity bug regarding tensor type inconsistency in schema conversion functions.
  • A reviewer commented that running an RD pipeline from a Ray task (not uncommon) would fail.

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Comment on lines 86 to 100
def _update_schema(self, incoming_schema: "pa.Schema") -> None:
"""
Update the table schema to accommodate incoming data using union-by-name semantics.
property_as_bool = PropertyUtil.property_as_bool
.. warning::
This method must only be called from the driver process.
It performs schema evolution which requires exclusive table access.
catalog = self._get_catalog()
table = catalog.load_table(self.table_identifier)
self._txn = table.transaction()
self._io = self._txn._table.io
self._table_metadata = self._txn.table_metadata
self._uuid = uuid.uuid4()

if unsupported_partitions := [
field
for field in self._table_metadata.spec().fields
if not field.transform.supports_pyarrow_transform
]:
raise ValueError(
f"Not all partition types are supported for writes. Following partitions cannot be written using pyarrow: {unsupported_partitions}."
)

self._manifest_merge_enabled = property_as_bool(
self._table_metadata.properties,
TableProperties.MANIFEST_MERGE_ENABLED,
TableProperties.MANIFEST_MERGE_ENABLED_DEFAULT,
)
Args:
incoming_schema: The PyArrow schema to merge with the table schema
"""
with self._table.update_schema() as update:
update.union_by_name(incoming_schema)
# Succeeded, reload to get latest table version and exit.
self._reload_table()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in 1 place, let's inline

Comment on lines 185 to 186
# Reload table to get latest metadata
self._reload_table()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to reload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove this.

)
assert rows_same(result_df, expected)

def test_multiple_schema_evolutions(self, clean_table):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a test promoting type (as separate test)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_schema_evolution_type_promotion

Signed-off-by: Goutam <goutam@anyscale.com>
@alexeykudinkin alexeykudinkin merged commit 91cf075 into ray-project:master Dec 10, 2025
6 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/iceberg_schema_thing_v1 branch December 10, 2025 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

2 participants