Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyarrow filter pushdowns #735

Merged
merged 2 commits into from
Jun 19, 2024

Conversation

Michael-J-Ward
Copy link
Contributor

Which issue does this PR close?

Closes #703.

Rationale for this change

The conversion for IsNull had a bug.

datafusion-python users requested pyarrow predicate pushdown support for temporal types.

What changes are included in this PR?

IsNull bug
The conversion was incorrectly passing the column-expression as an argument to the pyarrow method is_null. This would silently fail and the predicate would be excluded from the plan.

The argument should be a scalar for nan_is_null. I do not currently have a way for users to pass that in, so please suggest how I might do so.

Temporal Scalars
Similar to #731, I used ScalarValue::to_pyarrow for the scalar conversion. pyarrow filters can now accept anything that already has an upstream conversion.

Are there any user-facing changes?

A bugfix and expanded functionality.

Additional Context

I tested the predicate pushdown in two separate ways.

  1. Ensuring that explain plan contains the appropriate string.
  2. Ensuring that a query on a partitioned dataset doesn't touch the file.

Both of these seem non-ideal. If you have a suggestion for more efficiently testing this, please share!

The conversion was incorrectly passing in the expression itself as the `nan_as_null` argument. This caused the pushdown to silently fail.
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Checking the explain plan is a good approach IMO. We do this extensively in DataFusion and Comet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing pushdowns for pyarrow datasets
2 participants