Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support s3 object store without dynamodb lock #974

Closed
mpetri opened this issue Nov 29, 2022 · 10 comments
Closed

Support s3 object store without dynamodb lock #974

mpetri opened this issue Nov 29, 2022 · 10 comments
Labels
enhancement New feature or request

Comments

@mpetri
Copy link

mpetri commented Nov 29, 2022

Description

In the rust crate is it possible to support s3 based delta lakes without the need to pull in and use the dynamodb lock client? I understand the need for the lock client (after reading the paper) but if I know I will ever only have one writer for the deltalake I don't really need to locking mechanism.

Could I somehow achieve this by manually creating an object store somehow (with s3 backend) and passing it to deltalake?

@mpetri mpetri added the enhancement New feature or request label Nov 29, 2022
@wjones127
Copy link
Collaborator

Yes, we recently added an option in the S3 backend called AWS_S3_ALLOW_UNSAFE_RENAME that allows building the S3 storage backend without any lock configured. I haven't tested yet whether it worked without compiling dynamodb dependencies though; I'll need to check on that.

@roeap
Copy link
Collaborator

roeap commented Nov 29, 2022

@mpetri - just had a quick scan of our code, and you should be able to pass in a custom object store using the DeltaTableBuilder option with_object_store, you could then pull in object_store as a separate crate with AWS feature. Unfortuantely you would have to write a thing wrapper, since we are calling the *_if_not_exists methods which will raise "not implemented" in the object_store crate.

That said, we should probably look into providing a feature, that allows compiling for single reader scenarios.

@mpetri
Copy link
Author

mpetri commented Dec 1, 2022

I'm currently blocked in the other bug I reported (can't compile create with s3 support) so I might give this a try thanks.

Should I keep this issue open? It seems like a valid request.

@wjones127
Copy link
Collaborator

Yes, please keep it open.

@cmackenzie1
Copy link
Contributor

Giving a bump to this FR as I am using an S3-compatible object store (Cloudflare R2) and would like some way to support concurrent writes across processes - currently this is managed via a single process and a Mutex.

Perhaps we could replace the locking implementation with a trait, similar to tokio::sync::Mutex as it would likely need to span across .await's if using an external service for locking. For example, etcd, Cloudflare Durable Objects, ZooKeeper and such.

@wjones127
Copy link
Collaborator

Do you care about something that works across S3-compatible APIs? Or just about R2?

If you care specifically about R2, I think the more optimal solution is to support it through the object store rather than have some separate locking mechanism. Unlike S3, R2 has support for conditional PutObject (docs). I think that could be used to implement a workable rename_if_not_exist operation (or maybe the same headers are supported in Copy / Replace operations?).

(Though also note that R2 doesn't work well right now because their multi-part upload doesn't seem to be compatible with S3.)

If S3 ever comes out with support with atomic rename_if_not_exist or copy_if_not_exist, then the whole lock client thing will be moot. GCS and Azure Blog store don't need any locking client because they support these operations out-of-the-box.

@cmackenzie1
Copy link
Contributor

I am mostly just interested in R2 - let me check with the R2 team to see if CopyObject supports those conditional headers.

I figured switching to a trait would "plugin" better to the existing locking that uses DynamoDB, but I am fine with either approach. S3-compatible providers all have their own quirks, so that was the most straightforward approach that lets the user deal with those.

(Though also note that R2 doesn't work well right now because their multi-part upload doesn't seem to be compatible with S3.)

Kind of an aside, but can you send me details on the issue you are referencing there? I am on the Slack and would be interested in hearing it to provide the feedback to the R2 team.

@wjones127
Copy link
Collaborator

@cmackenzie1 I need confirmation from the R2 team, but the implementation in object-store-rs is based on the one in Arrow C++, and I think there's an issue where they don't support non-equal part sizes: apache/arrow#34363 (comment)

@cmackenzie1
Copy link
Contributor

I followed up with the R2 team, and they confirmed that it is still the case for S3 multipart uploads requiring parts to be the same size (except the last).

For the CopyObject operation, they do support the headers listed here: x-amz-copy-source-*

@roeap
Copy link
Collaborator

roeap commented Jan 28, 2024

closing this, as you can now bypass the requirement (while still being safe) by specifying the respective option on the object store. most recently this was re-added here.

@roeap roeap closed this as completed Jan 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants