fix: scan partitioned tables with datafusion #1303

roeap · 2023-04-20T22:56:08Z

Description

This PR builds on #1293 and tries to address some of the issues we have seen with scanning
partitioned tables with datafusion. And while all our tests (mostly - more on that later) pass,
the fix involves some behaviours that we may or may not want to adopt. Specifically, Datafusion
appends partition columns at the end of the schema fields, while we have been reporting them
as leading columns.

In recent datafusion versions also changed the default for dictionary encoding partition columns
to be opt in. My thinking was that for the vast majority of tables keeping dictionary encoding for
partition columns would be the desired behaviour. (@wjones127, do you have an opinion on that?).
This was also a root cause or at least related to the second failing test.

I did have to comment out some caeses within out file pruning tests where we create expression with
nulls, as I have thus far not been able to create an expression that datafusion is happy with. I'll
keep trying, but have some work on expression parsing for handling user inputs planned as well.
There already is a draft PR open (#1267), which does not contain that yet, but where I plan to
address this.

cc @cmackenzie1

Related Issue(s)

rtyler

All things considered I think this is fine to land. Are there some issues filed with Data Fusion that can be dropped in these comments for the missing kernels?

wjones127 · 2023-04-28T02:56:11Z

Specifically, Datafusion appends partition columns at the end of the schema fields, while we have been reporting them as leading columns.

Eventually, I think our desired behavior is going to be put the partition columns where they are in the Delta table schema.

In recent datafusion versions also changed the default for dictionary encoding partition columns to be opt in. My thinking was that for the vast majority of tables keeping dictionary encoding for partition columns would be the desired behaviour.

Looking at the PR for that apache/datafusion#5545 (comment), ~~they make a good case that dictionary encoding partition columns is really only helpful for string columns. So perhaps we should only use them for string / binary columns (any maybe dates too)?~~ I think dictionary encoding most columns makes sense. The ones I would exclude are ones that are so small that dictionary arrays are a waste of space, such as boolean and i8. And if we wanted to get fancy we could choose the dictionary index bit width based on the known number of unique values (that is, i8 if 256 or fewer, i16 if there are much more). ¹

Eventually, I think we'd like them to use the new REE (run-end encoded) arrays. But that's far in the future. ↩

wjones127

Some minor suggestions for comments but otherwise seems good.

rust/src/operations/transaction/state.rs

github-actions · 2023-04-29T19:42:29Z

ACTION NEEDED

delta-rs follows the Conventional Commits
specification for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

This commit adds a unit test demonstrating the issue described in delta-io#1292.

This commit adds a test case to demonstate being unable to query a partitioned table using `>=` as type coercion fails.

Co-authored-by: Will Jones <willjones127@gmail.com>

rust/src/operations/transaction/state.rs

roeap requested review from houqp, xianwill, wjones127, fvaleye, rtyler and mosyp as code owners April 20, 2023 22:56

github-actions bot added binding/rust Issues for the Rust crate rust labels Apr 20, 2023

roeap removed request for xianwill and mosyp April 21, 2023 06:11

Blajda mentioned this pull request Apr 24, 2023

feat: delete operation #1176

Merged

rtyler previously approved these changes Apr 25, 2023

View reviewed changes

wjones127 previously approved these changes Apr 28, 2023

View reviewed changes

rust/src/operations/transaction/state.rs Outdated Show resolved Hide resolved

rust/src/operations/transaction/state.rs Outdated Show resolved Hide resolved

rtyler dismissed stale reviews from wjones127 and themself via b60995c April 29, 2023 19:42

rtyler enabled auto-merge (rebase) April 29, 2023 19:43

cmackenzie1 and others added 5 commits April 30, 2023 10:58

issue-1292: sql projection test case

3d31c37

This commit adds a unit test demonstrating the issue described in delta-io#1292.

issue-1291: add test case

69fa6b8

This commit adds a test case to demonstate being unable to query a partitioned table using `>=` as type coercion fails.

fix: scan partitioned tables with datafusion

0b916b4

Update rust/src/operations/transaction/state.rs

90ab1b2

Co-authored-by: Will Jones <willjones127@gmail.com>

Update rust/src/operations/transaction/state.rs

b905626

Co-authored-by: Will Jones <willjones127@gmail.com>

wjones127 force-pushed the df-scan-partitioned branch from ef7649d to b905626 Compare April 30, 2023 17:58

fix: omit some more types from dictionay encoding

de856cc

wjones127 reviewed Apr 30, 2023

View reviewed changes

rust/src/operations/transaction/state.rs Outdated Show resolved Hide resolved

fix: encode ints again.

f4e0cc2

wjones127 approved these changes Apr 30, 2023

View reviewed changes

rtyler merged commit 9ad6276 into delta-io:main Apr 30, 2023

roeap deleted the df-scan-partitioned branch April 30, 2023 20:57

roeap mentioned this pull request May 1, 2023

issue-1292,issue-1291: test cases to reproduce issue #1293

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: scan partitioned tables with datafusion #1303

fix: scan partitioned tables with datafusion #1303

roeap commented Apr 20, 2023 •

edited

Loading

rtyler left a comment

wjones127 commented Apr 28, 2023 •

edited

Loading

wjones127 left a comment

github-actions bot commented Apr 29, 2023

fix: scan partitioned tables with datafusion #1303

fix: scan partitioned tables with datafusion #1303

Conversation

roeap commented Apr 20, 2023 • edited Loading

Description

Related Issue(s)

rtyler left a comment

Choose a reason for hiding this comment

wjones127 commented Apr 28, 2023 • edited Loading

Footnotes

wjones127 left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 29, 2023

roeap commented Apr 20, 2023 •

edited

Loading

wjones127 commented Apr 28, 2023 •

edited

Loading