Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: scan partitioned tables with datafusion #1303

Merged
merged 7 commits into from
Apr 30, 2023

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Apr 20, 2023

Description

This PR builds on #1293 and tries to address some of the issues we have seen with scanning
partitioned tables with datafusion. And while all our tests (mostly - more on that later) pass,
the fix involves some behaviours that we may or may not want to adopt. Specifically, Datafusion
appends partition columns at the end of the schema fields, while we have been reporting them
as leading columns.

In recent datafusion versions also changed the default for dictionary encoding partition columns
to be opt in. My thinking was that for the vast majority of tables keeping dictionary encoding for
partition columns would be the desired behaviour. (@wjones127, do you have an opinion on that?).
This was also a root cause or at least related to the second failing test.

I did have to comment out some caeses within out file pruning tests where we create expression with
nulls, as I have thus far not been able to create an expression that datafusion is happy with. I'll
keep trying, but have some work on expression parsing for handling user inputs planned as well.
There already is a draft PR open (#1267), which does not contain that yet, but where I plan to
address this.

cc @cmackenzie1

Related Issue(s)

@github-actions github-actions bot added binding/rust Issues for the Rust crate rust labels Apr 20, 2023
@roeap roeap removed request for xianwill and mosyp April 21, 2023 06:11
@Blajda Blajda mentioned this pull request Apr 24, 2023
rtyler
rtyler previously approved these changes Apr 25, 2023
Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All things considered I think this is fine to land. Are there some issues filed with Data Fusion that can be dropped in these comments for the missing kernels?

@wjones127
Copy link
Collaborator

wjones127 commented Apr 28, 2023

Specifically, Datafusion appends partition columns at the end of the schema fields, while we have been reporting them as leading columns.

Eventually, I think our desired behavior is going to be put the partition columns where they are in the Delta table schema.

In recent datafusion versions also changed the default for dictionary encoding partition columns to be opt in. My thinking was that for the vast majority of tables keeping dictionary encoding for partition columns would be the desired behaviour.

Looking at the PR for that apache/datafusion#5545 (comment), they make a good case that dictionary encoding partition columns is really only helpful for string columns. So perhaps we should only use them for string / binary columns (any maybe dates too)? I think dictionary encoding most columns makes sense. The ones I would exclude are ones that are so small that dictionary arrays are a waste of space, such as boolean and i8. And if we wanted to get fancy we could choose the dictionary index bit width based on the known number of unique values (that is, i8 if 256 or fewer, i16 if there are much more). 1

Footnotes

  1. Eventually, I think we'd like them to use the new REE (run-end encoded) arrays. But that's far in the future.

wjones127
wjones127 previously approved these changes Apr 28, 2023
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions for comments but otherwise seems good.

rust/src/operations/transaction/state.rs Outdated Show resolved Hide resolved
rust/src/operations/transaction/state.rs Outdated Show resolved Hide resolved
@rtyler rtyler dismissed stale reviews from wjones127 and themself via b60995c April 29, 2023 19:42
@github-actions
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@rtyler rtyler enabled auto-merge (rebase) April 29, 2023 19:43
cmackenzie1 and others added 5 commits April 30, 2023 10:58
This commit adds a unit test demonstrating the issue described
in delta-io#1292.
This commit adds a test case to demonstate being unable
to query a partitioned table using `>=` as type coercion
fails.
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
@wjones127 wjones127 force-pushed the df-scan-partitioned branch from ef7649d to b905626 Compare April 30, 2023 17:58
@rtyler rtyler merged commit 9ad6276 into delta-io:main Apr 30, 2023
@roeap roeap deleted the df-scan-partitioned branch April 30, 2023 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate rust
Projects
None yet
4 participants