Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

z_order does not skip unnecessary files #3159

Closed
aldder opened this issue Jan 24, 2025 · 1 comment
Closed

z_order does not skip unnecessary files #3159

aldder opened this issue Jan 24, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@aldder
Copy link

aldder commented Jan 24, 2025

Environment

Delta-rs version:
0.24.0

Binding:

Environment:

  • Cloud provider:
  • OS: Win11
  • Other:

Bug

What happened:
I don't know whether this can be classified as a bug or is an intentional behavior. Unfortunately, I have not been able to find any more information about this

When runnin a z_order optimization it process all the files in the delta table even though some of them could be skipped saving execution time and memory

How to reproduce it:

from deltalake import DeltaTable, write_deltalake
import pandas as pd

write_deltalake(
    'tmp', 
    pd.DataFrame({'A': ['a'], 'B': [1]}), 
    mode='append', 
    partition_by=['A']
)
write_deltalake(
    'tmp', 
    pd.DataFrame({'A': ['b'], 'B': [2]}), 
    mode='append', 
    partition_by=['A']
)

dt = DeltaTable('tmp')
res = dt.optimize.z_order(columns=['B'])
print(res['numFilesRemoved'], res['numFilesAdded'])
print(dt.files())

write_deltalake(
    'tmp', 
    pd.DataFrame({'A': ['a'], 'B': [3]}), 
    mode='append', 
    partition_by=['A']
)

dt = DeltaTable('tmp')
res = dt.optimize.z_order(columns=['B'])
print(res['numFilesRemoved'], res['numFilesAdded'])
print(dt.files())

Ouput:

2 2
['A=a/part-00001-e4abd22a-cba1-4851-8fae-10b448496e42-c000.zstd.parquet', 
 'A=b/part-00001-e28594fd-4ad4-4497-a783-d185d0b94103-c000.zstd.parquet']
3 2
['A=b/part-00001-cdb63b7c-da93-4606-964d-47780998249c-c000.zstd.parquet', 
 'A=a/part-00001-d53c052c-5357-47bd-9a92-572c08ca7be3-c000.zstd.parquet']

What you expected to happen:
In this simple case we can see how in the first optimization all files are deleted and re-added even if there is only one file per partition.

In the second optimization only the partition 'a' files could be processed since the ones of partition 'b' are already been optimized and no new writes occurred

I know about the partition_filters parameter of the z_order method, but sometimes is not possibile to define which are the partition to optimize since writings and optimizations could be on separate and indipendent processes

@aldder aldder added the bug Something isn't working label Jan 24, 2025
@FrankPortman
Copy link

Disclaimer: I am not a maintainer or even contributor to delta-rs.

My understanding of Z-Ordering is this is exactly correct. It is not an inherently incremental operation unlike Liquid Clustering. It will always run on whatever where clause you give it. In this case it will correctly run on the entire table because you did not specify anything. I can't imagine "fixing" this is in scope for delta-rs because this is fundamental Delta behavior.

It would be on you to supply the partition(s) to run on, even if you do that by dynamically detecting the data you just wrote and only supplying those partition(s) to the where clause.

@aldder aldder closed this as completed Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants