Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vacuum is very slow on Cloudflare R2 #1366

Closed
djouallah opened this issue May 14, 2023 · 8 comments
Closed

vacuum is very slow on Cloudflare R2 #1366

djouallah opened this issue May 14, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@djouallah
Copy link

djouallah commented May 14, 2023

Environment

0.9

Binding:
Python

Environment:
Cloudflare R2


Bug

Edit : the issue is with vacuum, it is very slow for a delete operation

I am running a cloud function to vacuum and optimize a small delta table in Cloudflare R2, the table has currently 45 partition (per day( and every day, I insert 288 new small files.

the function take nearly 10 minute to finish, that's seems very slow, and I am not sure if it will scale later when the table increase in size

here is the code I use

from deltalake import DeltaTable
import os
delta_path = 's3://delta/scada'
storage_options = {
"Region": "us-east-1",   
"AWS_ACCESS_KEY_ID":     os.environ.get("aws_access_key_id_secret") ,
"AWS_SECRET_ACCESS_KEY": os.environ.get("aws_secret_access_key_secret")   ,   
"AWS_ENDPOINT_URL" :     os.environ.get("endpoint_url_secret") ,
"AWS_S3_ALLOW_UNSAFE_RENAME":"true"
}
def compaction(request):
    dt = DeltaTable(delta_path,storage_options=storage_options)
    dt.optimize()
    dt.vacuum(retention_hours=24,dry_run=False,  enforce_retention_duration=False)
    return 'done'
@djouallah djouallah added the bug Something isn't working label May 14, 2023
@Blajda
Copy link
Collaborator

Blajda commented May 15, 2023

Hi @djouallah,
Do you have any insights on which operation takes the longest?
In terms of optimize we are looking to support parallel processing which is outlined here #1171.
Since R2 is using the S3 API, vacuum should be deleting multiple files within in a single API call so I suspect it's not the limiting factor here.

@djouallah djouallah changed the title Optimize and vacuum are very slow on Cloudflare R2 vacuum is very slow on Cloudflare R2 May 15, 2023
@djouallah
Copy link
Author

@Blajda turn out the issue is vacuum !!! I have remove it from the cloud function and it went from 10 minutes to 1 minute

@djouallah
Copy link
Author

now i am testing vacuum alone and it did time out after 5 minutes, delete should not take that much ?
image

@Blajda
Copy link
Collaborator

Blajda commented May 16, 2023

How many files do you have under s3://delta/scada? Vacuum recursively lists all files under the delta root and reconciles them with the log. To determine if the scan is the cause of the issue you can set dryRun to True

@djouallah
Copy link
Author

I vacuum and optimised already today, so I don't have much files left : but I produce 288 new files daily, anyway

running this with only 68 files is rather fast, got 3 second
dt.vacuum(retention_hours=1,dry_run=True, enforce_retention_duration=False)

@roeap
Copy link
Collaborator

roeap commented May 16, 2023

do you have any insights on memory consumption? in my experience, whenever we seen such difference once we get to a certain size, it often relates to limitations in memory.

vacuum should be deleting multiple files within in a single API

Unfortunately, this is not yet implemented on the object-store side, but there is some previous work apache/arrow-rs#4060

@djouallah
Copy link
Author

memory seems fine, fir that run, I think it did delete total of 288 x 4 days

image

wjones127 added a commit that referenced this issue May 21, 2023
# Description

We don't yet have batch deletes in object store (but will soon). In the
meantime, we can at least issue multiple requests in parallel. Set the
default at 10. I think that should be reasonable for now; later we can
optimize to try to find the right rate to avoid rate-limiting.

# Related Issue(s)

* helps with #1366

# Documentation

<!---
Share links to useful documentation
--->
@djouallah
Copy link
Author

djouallah commented Jun 10, 2023

thank you, vacuum went from 5 minutes to 26 second, beautiful works.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants