[Data] Fixing handling of renames in projection pushdown (#58033) #58037

alexeykudinkin · 2025-10-23T07:12:11Z

Description

Cherry-pick of #58033

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

Additional information

Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

## Description This change properly handles of pushing of the renaming projections into read ops (that support projections, like parquet reads). ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

gemini-code-assist

Code Review

This pull request refactors the projection pushdown logic to correctly handle column renames, which is a great improvement. The core idea is to distinguish simple projections (selects/renames) from complex ones and push down the rename map to the data source for simple cases, avoiding an extra MapBatches operator. The changes are well-structured, introducing a collapse_transitive_map utility for chained renames and updating the ParquetDatasource and logical operators accordingly. The logic seems sound, but I have one suggestion to improve the robustness and clarity of a new helper function.

gemini-code-assist · 2025-10-23T07:13:58Z

python/ray/data/_internal/datasource/parquet_datasource.py

+def _combine_rename_map(
+    prev_column_rename_map: Optional[Dict[str, str]],
+    new_column_rename_map: Optional[Dict[str, str]],
+):
+    if not prev_column_rename_map:
+        combined = new_column_rename_map
+    elif not new_column_rename_map:
+        combined = prev_column_rename_map
+    else:
+        combined = prev_column_rename_map | new_column_rename_map
+
+    return collapse_transitive_map(combined)


This function is missing a return type hint. Based on its usage, it should be -> Dict[str, str].

Additionally, the combined variable can be None if one of the input rename maps is None. While collapse_transitive_map currently handles None input by returning {}, relying on this implicit behavior can be brittle. It's safer and clearer to ensure a dictionary is always passed.

I suggest adding the type hint and making the None handling explicit by using combined or {}.

Suggested change

def _combine_rename_map(

prev_column_rename_map: Optional[Dict[str, str]],

new_column_rename_map: Optional[Dict[str, str]],

):

if not prev_column_rename_map:

combined = new_column_rename_map

elif not new_column_rename_map:

combined = prev_column_rename_map

else:

combined = prev_column_rename_map | new_column_rename_map

return collapse_transitive_map(combined)

def _combine_rename_map(

prev_column_rename_map: Optional[Dict[str, str]],

new_column_rename_map: Optional[Dict[str, str]],

) -> Dict[str, str]:

if not prev_column_rename_map:

combined = new_column_rename_map

elif not new_column_rename_map:

combined = prev_column_rename_map

else:

combined = prev_column_rename_map | new_column_rename_map

return collapse_transitive_map(combined or {})

alexeykudinkin requested a review from a team as a code owner October 23, 2025 07:12

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Oct 23, 2025

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

aslonnie enabled auto-merge (squash) October 23, 2025 11:31

aslonnie disabled auto-merge October 23, 2025 11:32

aslonnie merged commit 0e6b21a into releases/2.51.0 Oct 23, 2025
7 checks passed

aslonnie deleted the ak/prj-pdwn-fix-cp branch October 23, 2025 11:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Fixing handling of renames in projection pushdown (#58033) #58037

[Data] Fixing handling of renames in projection pushdown (#58033) #58037

Uh oh!

alexeykudinkin commented Oct 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Data] Fixing handling of renames in projection pushdown (#58033) #58037

[Data] Fixing handling of renames in projection pushdown (#58033) #58037

Uh oh!

Conversation

alexeykudinkin commented Oct 23, 2025

Description

Related issues

Additional information

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants