[Data] - Fix Pushdown Optimizations with Hive Partitioning #58723

goutamvenkat-anyscale · 2025-11-18T01:04:44Z

Description

When hive partitioned, partition cols don't reside in the physical schema of the table, so you can't do projection and predicate pushdown of that subset of columns into the read layer. Basically we filter those out before pushing down.

Related issues

Fixes #58714

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-11-18T01:05:43Z

/gemini summary

goutamvenkat-anyscale · 2025-11-18T01:06:21Z

/gemini review

gemini-code-assist

Code Review

This pull request correctly addresses the issue of pushdown optimizations with Hive partitioning by distinguishing between data columns and partition columns. The logic to prevent predicate pushdown for expressions involving partition columns is sound and well-implemented across the datasource, logical operator, and optimizer rule. The new tests are comprehensive and cover many combinations of operations, ensuring the fix is robust.

I have a couple of suggestions for minor improvements: one to cache the partition column set for better performance and another to make the test helper ColumnTracker more robust. Overall, this is a great contribution.

python/ray/data/_internal/datasource/parquet_datasource.py

python/ray/data/tests/test_parquet.py

gemini-code-assist · 2025-11-18T01:07:18Z

Summary of Changes

This pull request resolves a critical bug in Ray Data's handling of Hive-partitioned Parquet datasets, specifically concerning projection and predicate pushdown optimizations. Previously, these optimizations would incorrectly attempt to push down operations on partition columns, which do not exist in the physical Parquet files, leading to errors. The changes introduce logic to differentiate between data columns and partition columns, ensuring that only operations on data columns are pushed down to the underlying read layer (PyArrow), while predicates on partition columns are handled at a higher logical level within Ray Data. This ensures correct and efficient processing of Hive-partitioned data.

Highlights

Corrected Pushdown for Hive Partitioning: Fixed an issue where projection and predicate pushdown optimizations failed for Hive-partitioned Parquet datasets because partition columns are not part of the physical file schema.
Intelligent Predicate Handling: Predicates involving partition columns are now identified and not pushed down to the read layer, ensuring they are applied correctly after partition columns are added to the dataset.
Enhanced Test Coverage: Added comprehensive tests for various data operations (select, rename, filter, with_column) on Hive-partitioned Parquet datasets to validate the fix and ensure robustness.

Changelog

python/ray/data/_internal/datasource/parquet_datasource.py
- Introduced _get_partition_columns_set to identify partition columns.
- Added _get_data_columns to filter out partition columns from projection maps.
- Modified apply_predicate to prevent pushing down predicates that reference partition columns.
python/ray/data/_internal/logical/operators/read_operator.py
- Updated apply_predicate to conditionally return the original operator if the predicate cannot be pushed down by the datasource.
python/ray/data/_internal/logical/rules/predicate_pushdown.py
- Adjusted _try_push_down_predicate to retain the Filter operator in the plan if the predicate pushdown to the datasource was unsuccessful.
python/ray/data/_internal/planner/plan_expression/expression_visitors.py
- Added get_column_references utility function to extract column names from an expression, aiding in predicate analysis.
python/ray/data/tests/test_parquet.py
- Added a new hive_partitioned_dataset fixture for creating test data.
- Implemented test_hive_partitioned_parquet_operations with extensive parameterized tests covering various data transformations on Hive-partitioned datasets.

Activity

The author, goutamvenkat-anyscale, initiated the summary and review process.
The gemini-code-assist bot provided a medium-priority review suggestion to optimize _get_partition_columns_set by making it a functools.cached_property.
The gemini-code-assist bot provided a medium-priority review suggestion to improve the ColumnTracker in tests by dynamically initializing it from DataFrame columns.

gemini-code-assist

Code Review

This pull request fixes an issue with predicate and projection pushdown for Hive-partitioned Parquet datasets. The core change is to prevent pushdown for operations on partition columns, as they don't exist in the physical file schema. The logic is correctly implemented by checking for partition column references in ParquetDatasource.apply_predicate and propagating the decision to not push down through the logical plan execution. The addition of get_column_references is a useful utility, and the extensive new tests for Hive partitioning are excellent and cover many complex scenarios.

I have two suggestions for improvement:

In ParquetDatasource, caching the set of partition columns to avoid re-computation.
In the new test file, refactoring the long apply_operation helper function to improve readability and reduce code duplication.

Overall, this is a solid contribution that addresses an important correctness issue in data processing.

python/ray/data/_internal/datasource/parquet_datasource.py

python/ray/data/tests/test_parquet.py

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-11-18T01:38:52Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a fix for pushdown optimizations (projection and predicate) when using Hive partitioning in Parquet datasources. The core problem is that partition columns don't exist in the physical file schema, causing pushdown attempts to fail. The solution correctly separates data columns from partition columns. Predicates on partition columns are no longer pushed down to the file reader, and projections are handled by filtering out partition columns before reading. The changes are logical and well-implemented, and they are supported by an excellent, comprehensive new test suite that validates numerous combinations of operations. I have one suggestion for a minor performance improvement by caching.

python/ray/data/_internal/datasource/parquet_datasource.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/_internal/datasource/parquet_datasource.py

Signed-off-by: Goutam <goutam@anyscale.com>

bveeramani · 2025-11-19T04:10:04Z

python/ray/data/_internal/datasource/parquet_datasource.py

+        # No partition columns in projection map:
+        # - If _partition_columns is None: user only specified data columns during
+        #   initialization, so include all partition columns (return None)
+        # - If _partition_columns is not None: partition columns were requested during
+        #   initialization but are missing from _projection_map, which means projection
+        #   pushdown excluded them, so exclude all partition columns (return [])
+        return None if self._partition_columns is None else []


I think the polymorphism of self._partition_columns (that it can be None, an empty list, or a non-empty list) leads to complex logic like this and makes the attribute harder to reason about.

Would it be simpler if we refactored the code so that self._partition_columns always refers to the selected partition columns (i.e., can't be None)?

bveeramani · 2025-11-19T04:11:11Z

python/ray/data/_internal/datasource/parquet_datasource.py

+        #   pushdown excluded them, so exclude all partition columns (return [])
+        return None if self._partition_columns is None else []
+
+    def _get_data_columns(self) -> Optional[List[str]]:


Nit: This seems like the counterpart to _get_partition_columns_from_projection, but the naming is inconsistent.

Suggested change

def _get_data_columns(self) -> Optional[List[str]]:

def _get_data_columns_from_projection(self) -> Optional[List[str]]:

This method is from the parent mixin class and is used in other datasources.

I'll rename the _get_partition_columns_from_projection to _get_partition_columns for better alignment

bveeramani · 2025-11-19T04:12:38Z

python/ray/data/_internal/logical/operators/read_operator.py

+        # If datasource is unchanged (e.g., predicate references partition columns
+        # that can't be pushed down), return self unchanged so Filter operator remains
+        if predicated_datasource is self._datasource:
+            return self
+


This logic adds the has the implicit assumption that if _DatasourceProjectionPushdownMixin.apply_predicate returns itself unchanged, then it means that the filter can't be pushed down.

Is there a way we can make this more explciit?

Actually I can nuke this line

bveeramani · 2025-11-19T04:13:25Z

python/ray/data/_internal/datasource/parquet_datasource.py

+        if referenced_cols & partition_cols:
+            # Don't push down predicates on partition columns
+            return self
+


Dumb question -- can't we still pushdown the data columns, and just filter the partition values in the reader implementation? Like, is it necessary to disable predicate pushdown altogether?

We can, but then we have to handle cases like: ds.filter(col('data_col') > 5 & col('partition_col') == 2) where the BinaryExpr has to be split up

Alright I handled it.

bveeramani · 2025-11-19T04:34:34Z

python/ray/data/tests/test_parquet.py

+    ],
+    ids=lambda ops: "_".join(ops) if isinstance(ops, tuple) else ops,
+)
+def test_hive_partitioned_parquet_operations(


How long does this test take to run? I think this test add ~30 parameterizations of E2E tests, and that might substantially add to the runtime of our test suite.

Would it be difficult to write some or all these as unit tests? If it is difficult, what would we need to refactor?

This test is super quick. It takes 5 secs for all combos

OOC is this on a laptop or on a devbox?

On a laptop

… expr Signed-off-by: Goutam <goutam@anyscale.com>

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/datasource/partitioning.py

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-11-19T20:36:30Z

/gemini review

goutamvenkat-anyscale · 2025-11-19T20:39:07Z

/gemini summary

python/ray/data/_internal/datasource/parquet_datasource.py

gemini-code-assist

Code Review

This pull request introduces a significant improvement for query optimization on Hive-partitioned Parquet datasets. By splitting predicates into data-column and partition-column components, it enables both partition pruning (for partition columns) and predicate pushdown to PyArrow (for data columns). The new logic is well-structured and accompanied by extensive tests.

I've identified two main issues: a critical correctness bug in how conservative partition pruning interacts with the filter pushdown logic, and a potential bug in type checking that could lead to incorrect pruning. Please see the detailed comments for more information.

python/ray/data/_internal/datasource/parquet_datasource.py

python/ray/data/datasource/partitioning.py

gemini-code-assist · 2025-11-19T20:40:20Z

Summary of Changes

This pull request addresses a long-standing issue where Ray Data's pushdown optimizations failed for Hive-partitioned Parquet datasets due to the distinct nature of partition columns (which are not present in physical files). The solution involves a sophisticated mechanism to analyze and split user-defined predicates, allowing data-related conditions to be pushed down to the underlying Parquet reader (PyArrow) for row-level filtering, while partition-related conditions are used for early file-level pruning. This ensures both correctness and significant performance improvements when querying partitioned data.

Highlights

Improved Pushdown for Hive Partitioned Data: Ray Data now correctly handles projection and predicate pushdown for Hive-partitioned Parquet datasets by distinguishing between data and partition columns, resolving previous failures where partition columns were incorrectly treated as physical schema columns.
Intelligent Predicate Splitting and Pruning: Introduced new logic to split complex predicates into data-only and partition-only components. Data predicates are efficiently pushed down to PyArrow for row-level filtering, while partition predicates are used for early file-level pruning, optimizing read operations.
Robust Logical Plan Handling: The logical plan now correctly retains Filter operators when predicates cannot be fully pushed down to the datasource (e.g., due to mixed column types or conservative evaluation), ensuring correctness and preventing unintended data inclusion.
Comprehensive Test Coverage: Added extensive unit and integration tests, including a new parameterized test suite for various data operations on Hive-partitioned datasets, to validate the new predicate splitting and pushdown logic across diverse scenarios.

Changelog

python/ray/data/_internal/datasource/parquet_datasource.py
- Introduced _SplitPredicateResult dataclass and _split_predicate_by_columns function for predicate decomposition.
- Enhanced ParquetDatasource.__init__ to correctly manage _projection_map and _partition_columns based on user selection and actual partitioning.
- Added _get_partition_columns and _get_data_columns methods to accurately extract column lists for projection.
- Overhauled apply_predicate to implement combined data pushdown and partition pruning based on predicate splitting.
python/ray/data/_internal/logical/operators/read_operator.py
- Modified apply_predicate to conditionally return the original Read operator if the underlying datasource cannot fully push down the predicate, preserving the Filter operator in the logical plan.
python/ray/data/_internal/logical/rules/predicate_pushdown.py
- Updated _try_push_down_predicate to check if the Read operator was unchanged after predicate application, and if so, to retain the Filter operator in the plan.
python/ray/data/_internal/planner/plan_expression/expression_visitors.py
- Added get_column_references utility to extract column names from expression trees.
python/ray/data/datasource/partitioning.py
- Implemented evaluate_predicate_on_partition in PathPartitionParser to evaluate predicates against partition values from file paths for pruning.
python/ray/data/tests/test_parquet.py
- Added hive_partitioned_dataset fixture and test_hive_partitioned_parquet_operations with parameterized tests for various data transformations on Hive-partitioned datasets.
python/ray/data/tests/test_partitioning.py
- Added new tests for evaluate_predicate_on_partition to verify partition predicate evaluation.
python/ray/data/tests/unit/test_parquet_predicate_split.py
- New file containing unit tests for the _split_predicate_by_columns function, covering various predicate structures.

Activity

The author initiated the summary and review process.
Reviewers provided feedback on caching _get_partition_columns_set for performance, leading to optimizations.
Suggestions were made to refactor test helpers for better readability and robustness, which were implemented.
Several bugs were identified and addressed, including issues with projection pushdown for partition columns, incorrect handling of empty datasets, and overly broad type checks in predicate evaluation.
Discussions led to the implementation of predicate splitting to allow partial pushdown of data columns while handling partition columns separately, significantly improving optimization capabilities.

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-11-19T21:11:44Z

/gemini review

python/ray/data/_internal/datasource/parquet_datasource.py

gemini-code-assist

Code Review

This pull request introduces a significant optimization for reading Hive-partitioned Parquet datasets by enabling predicate and projection pushdown for partition columns. The core logic involves splitting filter predicates into data and partition components, allowing for simultaneous partition pruning and row-level filtering. The implementation is well-structured and includes comprehensive tests. I've identified one critical potential correctness issue regarding filter removal and have a couple of medium-severity suggestions to improve logging and code style.

python/ray/data/_internal/datasource/parquet_datasource.py

python/ray/data/datasource/partitioning.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/_internal/datasource/parquet_datasource.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/_internal/datasource/parquet_datasource.py

Signed-off-by: Goutam <goutam@anyscale.com>

…ct#58723) ## Description When hive partitioned, partition cols don't reside in the physical schema of the table, so you can't do projection and predicate pushdown of that subset of columns into the read layer. Basically we filter those out before pushing down. ## Related issues Fixes ray-project#58714 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>

…ct#58723) ## Description When hive partitioned, partition cols don't reside in the physical schema of the table, so you can't do projection and predicate pushdown of that subset of columns into the read layer. Basically we filter those out before pushing down. ## Related issues Fixes ray-project#58714 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>

…ct#58723) ## Description When hive partitioned, partition cols don't reside in the physical schema of the table, so you can't do projection and predicate pushdown of that subset of columns into the read layer. Basically we filter those out before pushing down. ## Related issues Fixes ray-project#58714 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>

Fix Pushdown Optimizations with Hive Partitioning

3de7739

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from a team as a code owner November 18, 2025 01:04

goutamvenkat-anyscale changed the title ~~Fix Pushdown Optimizations with Hive Partitioning~~ [Data] - Fix Pushdown Optimizations with Hive Partitioning Nov 18, 2025

goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Nov 18, 2025

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_parquet.py Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_parquet.py Outdated Show resolved Hide resolved

Fix bug

8ec71d1

Signed-off-by: Goutam <goutam@anyscale.com>

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Outdated Show resolved Hide resolved

cursor bot reviewed Nov 18, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Outdated Show resolved Hide resolved

Couple more fixes

88176b4

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 18, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Outdated Show resolved Hide resolved

goutamvenkat-anyscale added 2 commits November 18, 2025 08:47

Merge branch 'master' into goutam/hive_pushdown

e2c0047

One more edge case

097ae39

Signed-off-by: Goutam <goutam@anyscale.com>

bveeramani reviewed Nov 19, 2025

View reviewed changes

goutamvenkat-anyscale added 4 commits November 19, 2025 10:04

Merge branch 'master' into goutam/hive_pushdown

9acb1ba

Address comments + breakup expression to handle both predicate & data…

f6481fa

… expr Signed-off-by: Goutam <goutam@anyscale.com>

one more

440dbeb

Signed-off-by: Goutam <goutam@anyscale.com>

Merge from master conflict

1eb6591

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 19, 2025

View reviewed changes

python/ray/data/datasource/partitioning.py Outdated Show resolved Hide resolved

goutamvenkat-anyscale added 2 commits November 19, 2025 12:35

Bug fix

9cae3f1

Signed-off-by: Goutam <goutam@anyscale.com>

Bug fix

ecd6e20

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 19, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Outdated Show resolved Hide resolved

python/ray/data/datasource/partitioning.py Outdated Show resolved Hide resolved

Address some more comments

01e3ec7

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 19, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Outdated Show resolved Hide resolved

python/ray/data/_internal/datasource/parquet_datasource.py Show resolved Hide resolved

python/ray/data/datasource/partitioning.py Show resolved Hide resolved

goutamvenkat-anyscale added 4 commits November 19, 2025 13:45

Clean up

461df85

Signed-off-by: Goutam <goutam@anyscale.com>

only run pyarrow >= 14.0

129dda0

Signed-off-by: Goutam <goutam@anyscale.com>

nuke line

e647191

Signed-off-by: Goutam <goutam@anyscale.com>

one more skip

2b7ce5b

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 19, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Outdated Show resolved Hide resolved

Fix docs

f143c26

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from bveeramani November 20, 2025 01:31

goutamvenkat-anyscale added 3 commits November 19, 2025 17:33

more doc

d8e3334

Signed-off-by: Goutam <goutam@anyscale.com>

Doctests...

fde8bca

Signed-off-by: Goutam <goutam@anyscale.com>

Remove doctest

a660e1f

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 20, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Outdated Show resolved Hide resolved

goutamvenkat-anyscale added 2 commits November 19, 2025 19:34

fix

690a038

Signed-off-by: Goutam <goutam@anyscale.com>

handle None data pred

9ae3dcd

Signed-off-by: Goutam <goutam@anyscale.com>

bveeramani approved these changes Nov 20, 2025

View reviewed changes

raulchen merged commit 863d5c9 into ray-project:master Nov 20, 2025
6 checks passed

goutamvenkat-anyscale deleted the goutam/hive_pushdown branch November 21, 2025 03:52

	def _get_data_columns(self) -> Optional[List[str]]:
	def _get_data_columns_from_projection(self) -> Optional[List[str]]:

[Data] - Fix Pushdown Optimizations with Hive Partitioning #58723

[Data] - Fix Pushdown Optimizations with Hive Partitioning #58723

Uh oh!

Conversation

goutamvenkat-anyscale commented Nov 18, 2025

Description

Related issues

Additional information

Uh oh!

goutamvenkat-anyscale commented Nov 18, 2025

Uh oh!

goutamvenkat-anyscale commented Nov 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Summary of Changes

Highlights

Uh oh!

This comment was marked as resolved.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale commented Nov 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

goutamvenkat-anyscale commented Nov 19, 2025

Uh oh!

goutamvenkat-anyscale commented Nov 19, 2025

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

goutamvenkat-anyscale Nov 19, 2025 •

edited

Loading

goutamvenkat-anyscale Nov 19, 2025 •

edited

Loading