-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Data] - Fix Pushdown Optimizations with Hive Partitioning #58723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] - Fix Pushdown Optimizations with Hive Partitioning #58723
Conversation
Signed-off-by: Goutam <goutam@anyscale.com>
|
/gemini summary |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses the issue of pushdown optimizations with Hive partitioning by distinguishing between data columns and partition columns. The logic to prevent predicate pushdown for expressions involving partition columns is sound and well-implemented across the datasource, logical operator, and optimizer rule. The new tests are comprehensive and cover many combinations of operations, ensuring the fix is robust.
I have a couple of suggestions for minor improvements: one to cache the partition column set for better performance and another to make the test helper ColumnTracker more robust. Overall, this is a great contribution.
Summary of ChangesThis pull request resolves a critical bug in Ray Data's handling of Hive-partitioned Parquet datasets, specifically concerning projection and predicate pushdown optimizations. Previously, these optimizations would incorrectly attempt to push down operations on partition columns, which do not exist in the physical Parquet files, leading to errors. The changes introduce logic to differentiate between data columns and partition columns, ensuring that only operations on data columns are pushed down to the underlying read layer (PyArrow), while predicates on partition columns are handled at a higher logical level within Ray Data. This ensures correct and efficient processing of Hive-partitioned data. Highlights
Changelog
Activity
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes an issue with predicate and projection pushdown for Hive-partitioned Parquet datasets. The core change is to prevent pushdown for operations on partition columns, as they don't exist in the physical file schema. The logic is correctly implemented by checking for partition column references in ParquetDatasource.apply_predicate and propagating the decision to not push down through the logical plan execution. The addition of get_column_references is a useful utility, and the extensive new tests for Hive partitioning are excellent and cover many complex scenarios.
I have two suggestions for improvement:
- In
ParquetDatasource, caching the set of partition columns to avoid re-computation. - In the new test file, refactoring the long
apply_operationhelper function to improve readability and reduce code duplication.
Overall, this is a solid contribution that addresses an important correctness issue in data processing.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a fix for pushdown optimizations (projection and predicate) when using Hive partitioning in Parquet datasources. The core problem is that partition columns don't exist in the physical file schema, causing pushdown attempts to fail. The solution correctly separates data columns from partition columns. Predicates on partition columns are no longer pushed down to the file reader, and projections are handled by filtering out partition columns before reading. The changes are logical and well-implemented, and they are supported by an excellent, comprehensive new test suite that validates numerous combinations of operations. I have one suggestion for a minor performance improvement by caching.
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
| # No partition columns in projection map: | ||
| # - If _partition_columns is None: user only specified data columns during | ||
| # initialization, so include all partition columns (return None) | ||
| # - If _partition_columns is not None: partition columns were requested during | ||
| # initialization but are missing from _projection_map, which means projection | ||
| # pushdown excluded them, so exclude all partition columns (return []) | ||
| return None if self._partition_columns is None else [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the polymorphism of self._partition_columns (that it can be None, an empty list, or a non-empty list) leads to complex logic like this and makes the attribute harder to reason about.
Would it be simpler if we refactored the code so that self._partition_columns always refers to the selected partition columns (i.e., can't be None)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
| # pushdown excluded them, so exclude all partition columns (return []) | ||
| return None if self._partition_columns is None else [] | ||
|
|
||
| def _get_data_columns(self) -> Optional[List[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This seems like the counterpart to _get_partition_columns_from_projection, but the naming is inconsistent.
| def _get_data_columns(self) -> Optional[List[str]]: | |
| def _get_data_columns_from_projection(self) -> Optional[List[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is from the parent mixin class and is used in other datasources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rename the _get_partition_columns_from_projection to _get_partition_columns for better alignment
| # If datasource is unchanged (e.g., predicate references partition columns | ||
| # that can't be pushed down), return self unchanged so Filter operator remains | ||
| if predicated_datasource is self._datasource: | ||
| return self | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic adds the has the implicit assumption that if _DatasourceProjectionPushdownMixin.apply_predicate returns itself unchanged, then it means that the filter can't be pushed down.
Is there a way we can make this more explciit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I can nuke this line
| if referenced_cols & partition_cols: | ||
| # Don't push down predicates on partition columns | ||
| return self | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question -- can't we still pushdown the data columns, and just filter the partition values in the reader implementation? Like, is it necessary to disable predicate pushdown altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but then we have to handle cases like: ds.filter(col('data_col') > 5 & col('partition_col') == 2) where the BinaryExpr has to be split up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright I handled it.
| ], | ||
| ids=lambda ops: "_".join(ops) if isinstance(ops, tuple) else ops, | ||
| ) | ||
| def test_hive_partitioned_parquet_operations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this test take to run? I think this test add ~30 parameterizations of E2E tests, and that might substantially add to the runtime of our test suite.
Would it be difficult to write some or all these as unit tests? If it is difficult, what would we need to refactor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is super quick. It takes 5 secs for all combos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOC is this on a laptop or on a devbox?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a laptop
… expr Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
|
/gemini review |
|
/gemini summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant improvement for query optimization on Hive-partitioned Parquet datasets. By splitting predicates into data-column and partition-column components, it enables both partition pruning (for partition columns) and predicate pushdown to PyArrow (for data columns). The new logic is well-structured and accompanied by extensive tests.
I've identified two main issues: a critical correctness bug in how conservative partition pruning interacts with the filter pushdown logic, and a potential bug in type checking that could lead to incorrect pruning. Please see the detailed comments for more information.
Summary of ChangesThis pull request addresses a long-standing issue where Ray Data's pushdown optimizations failed for Hive-partitioned Parquet datasets due to the distinct nature of partition columns (which are not present in physical files). The solution involves a sophisticated mechanism to analyze and split user-defined predicates, allowing data-related conditions to be pushed down to the underlying Parquet reader (PyArrow) for row-level filtering, while partition-related conditions are used for early file-level pruning. This ensures both correctness and significant performance improvements when querying partitioned data. Highlights
Changelog
Activity
|
Signed-off-by: Goutam <goutam@anyscale.com>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant optimization for reading Hive-partitioned Parquet datasets by enabling predicate and projection pushdown for partition columns. The core logic involves splitting filter predicates into data and partition components, allowing for simultaneous partition pruning and row-level filtering. The implementation is well-structured and includes comprehensive tests. I've identified one critical potential correctness issue regarding filter removal and have a couple of medium-severity suggestions to improve logging and code style.
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
…ct#58723) ## Description When hive partitioned, partition cols don't reside in the physical schema of the table, so you can't do projection and predicate pushdown of that subset of columns into the read layer. Basically we filter those out before pushing down. ## Related issues Fixes ray-project#58714 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>
…ct#58723) ## Description When hive partitioned, partition cols don't reside in the physical schema of the table, so you can't do projection and predicate pushdown of that subset of columns into the read layer. Basically we filter those out before pushing down. ## Related issues Fixes ray-project#58714 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
…ct#58723) ## Description When hive partitioned, partition cols don't reside in the physical schema of the table, so you can't do projection and predicate pushdown of that subset of columns into the read layer. Basically we filter those out before pushing down. ## Related issues Fixes ray-project#58714 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>
Description
When hive partitioned, partition cols don't reside in the physical schema of the table, so you can't do projection and predicate pushdown of that subset of columns into the read layer. Basically we filter those out before pushing down.
Related issues
Fixes #58714
Additional information