Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File skipping according to documentation #2427

Closed
t1g0rz opened this issue Apr 20, 2024 · 2 comments · Fixed by #2428
Closed

File skipping according to documentation #2427

t1g0rz opened this issue Apr 20, 2024 · 2 comments · Fixed by #2428
Labels
bug Something isn't working

Comments

@t1g0rz
Copy link
Contributor

t1g0rz commented Apr 20, 2024

Environment

Delta-rs version: 0.16.4

Binding: python

Description

Here documentation says:

Ensure the transaction log stores metadata stats for all the columns that benefit from file skipping.

And here guides us to explicitly specify columns for file skipping:

It takes some time to compute column statistics when writing files, and it isn’t worth the effort if you cannot use the column for file skipping.
Suppose you have a table column containing a long string of arbitrary text. It’s unlikely that this column would ever provide any data-skipping benefits. So, you can just avoid the overhead of collecting the statistics for this particular column.

I couldn't find a way to exclude some columns from file skipping purposes. In my case, I have quite wide tables (> 200 columns), and out of these, 195 will never be used for file skipping.
An unintended consequence of including all these columns in statistics calculation is the explosive growth of the delta log size because it writes very long strings of min/max/nulls. This indirectly creates all conditions for issue #2301 and this from delta slack.

I found the delta.dataSkippingNumIndexedCols setting here and I wonder if it's possible to explicitly specify column names. Or should I reorder the table to have skipping columns first and then set delta.dataSkippingNumIndexedCols?

@t1g0rz t1g0rz added the bug Something isn't working label Apr 20, 2024
@ion-elgreco
Copy link
Collaborator

@t1g0rz I'm not sure if we are properly respecting these configurations yet.

To select only columns this should be the one: delta.dataSkippingStatsColumns.

But again I'm not sure if we use this yet correctly, I'll take a deeper look into this tomorrow

@ion-elgreco
Copy link
Collaborator

@t1g0rz I checked, but we don't respect it yet 😄

Buttt I am working on a fix 🕺

ion-elgreco added a commit that referenced this issue May 11, 2024
…2428)

# Description
All of the Rust and Python write actions will now properly adhere to the
configuration regarding the amount of columns stats have to be collected
for. Either by dataSkippingNumIndexedCols or dataSkippingStatsColumns.

# Related Issue(s)
- closes #2427

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants