Skip to content

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale commented Aug 25, 2025

Why are these changes needed?

Find all diverging schemas, coalesce them if possible, and do so recursively in the presence of structs.
Perform a single pass to gather stats for all columns across all schemas.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Goutam V <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner August 25, 2025 09:33
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the performance of unify_schemas by refactoring it to use a single pass for gathering column statistics. The new implementation is not only faster but also more readable and maintainable. The use of a ColAgg dataclass to hold column statistics is a clean approach. I've found one potential issue with override precedence that could lead to incorrect type unification in some cases. Otherwise, this is an excellent improvement.

Signed-off-by: Goutam V <goutam@anyscale.com>
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Aug 25, 2025
Signed-off-by: Goutam V <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Aug 25, 2025
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner August 27, 2025 20:25
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Copy link
Contributor

@srinathk10 srinathk10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be good to add in a unify_schema test case on lots of schema (10) and wide schemas (10k) with CI assuming it all get's done < 1sec.

Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Comment on lines +380 to +384
schemas[0].remove_metadata()
schemas_to_unify = [schemas[0]]
for schema in schemas[1:]:
schema.remove_metadata()
if not schema.equals(schemas[0]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's actually do a set and (later we can raise a PR in Pyarrow to start caching the hashes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use dict.fromkeys() instead to preserve ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually spark schemas are dicts and they're unhashable. Fails this test: test_raydp: df = ds.to_spark(spark)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input has to be PA schema, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at this stack trace:


[2025-09-10T22:10:49Z] _____________________________ test_raydp_roundtrip _____________________________
--
  | [2025-09-10T22:10:49Z]
  | [2025-09-10T22:10:49Z] spark = <pyspark.sql.session.SparkSession object at 0x7f086c7c2190>
  | [2025-09-10T22:10:49Z]
  | [2025-09-10T22:10:49Z]     def test_raydp_roundtrip(spark):
  | [2025-09-10T22:10:49Z]         spark_df = spark.createDataFrame([(1, "a"), (2, "b"), (3, "c")], ["one", "two"])
  | [2025-09-10T22:10:49Z]         rows = [(r.one, r.two) for r in spark_df.take(3)]
  | [2025-09-10T22:10:49Z]         ds = ray.data.from_spark(spark_df)
  | [2025-09-10T22:10:49Z]         values = [(r["one"], r["two"]) for r in ds.take(6)]
  | [2025-09-10T22:10:49Z]         assert values == rows
  | [2025-09-10T22:10:49Z] >       df = ds.to_spark(spark)
  | [2025-09-10T22:10:49Z]
  | [2025-09-10T22:10:49Z] python/ray/data/tests/test_raydp.py:30:
  | [2025-09-10T22:10:49Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [2025-09-10T22:10:49Z] /rayci/python/ray/data/dataset.py:5594: in to_spark
  | [2025-09-10T22:10:49Z]     schema = self.schema()
  | [2025-09-10T22:10:49Z] /rayci/python/ray/data/dataset.py:3459: in schema
  | [2025-09-10T22:10:49Z]     base_schema = self._plan.schema(fetch_if_missing=False)
  | [2025-09-10T22:10:49Z] /rayci/python/ray/data/_internal/plan.py:395: in schema
  | [2025-09-10T22:10:49Z]     schema = self._logical_plan.dag.infer_schema()
  | [2025-09-10T22:10:49Z] /rayci/python/ray/data/_internal/logical/operators/from_operators.py:77: in infer_schema
  | [2025-09-10T22:10:49Z]     return unify_ref_bundles_schema(self._input_data)
  | [2025-09-10T22:10:49Z] /rayci/python/ray/data/_internal/util.py:791: in unify_ref_bundles_schema
  | [2025-09-10T22:10:49Z]     return unify_schemas_with_validation(schemas_to_unify)
  | [2025-09-10T22:10:49Z] /rayci/python/ray/data/_internal/util.py:775: in unify_schemas_with_validation
  | [2025-09-10T22:10:49Z]     return unify_schemas(schemas_to_unify, promote_types=True)
  | [2025-09-10T22:10:49Z] /rayci/python/ray/data/_internal/arrow_ops/transform_pyarrow.py:325: in unify_schemas
  | [2025-09-10T22:10:49Z]     schemas_to_unify = list(dict.fromkeys(schemas))
  | [2025-09-10T22:10:49Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [2025-09-10T22:10:49Z]
  | [2025-09-10T22:10:49Z] >   ???
  | [2025-09-10T22:10:49Z] E   TypeError: unhashable type: 'dict'

It seems that the schema becomes a dict.infer_schema() seems to be the one that converts it.

Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Comment on lines +349 to +350
# If we raise only on non tensor errors, it fails to unify PythonObjectType and pyarrow primitives.
# Look at test_pyarrow_conversion_error_handling for an example.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin just fyi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. What do exceptions look like in this cases?

I want to limit the scope of it as much as possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow.lib.ArrowTypeError: Unable to merge: Field my_data has incompatible types: string vs extension<ray.data.arrow_pickled_object<ArrowPythonObjectType>>

Comment on lines +380 to +384
schemas[0].remove_metadata()
schemas_to_unify = [schemas[0]]
for schema in schemas[1:]:
schema.remove_metadata()
if not schema.equals(schemas[0]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input has to be PA schema, right?

Comment on lines +349 to +350
# If we raise only on non tensor errors, it fails to unify PythonObjectType and pyarrow primitives.
# Look at test_pyarrow_conversion_error_handling for an example.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. What do exceptions look like in this cases?

I want to limit the scope of it as much as possible

Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comments

if not (pyarrow.types.is_list(t) and pyarrow.types.is_null(t.value_type)):
return t
# Let PyArrow handle other cases
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this phase, it will error out because Arrow can't handle the case and we can't reconcile either. I'll clarify the comment.

Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) September 12, 2025 05:27
@alexeykudinkin alexeykudinkin merged commit df5951e into ray-project:master Sep 12, 2025
6 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/fix_schema_unification branch September 12, 2025 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants