-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support bulk deletes in object_store #4060
Conversation
object_store/src/lib.rs
Outdated
/// this method to customize the parallelism or provide a progress indicator. | ||
/// | ||
/// This method may create multiple threads to perform the deletions in parallel. | ||
async fn delete_all(&self, locations: Vec<Path>) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about making this instead something like
async fn delete_all(&self, locations: Vec<Path>) -> Result<()> { | |
async fn delete_all(&self, locations: Vec<Path>) -> Result<BoxStream<'_, Result<()>>> { |
This would allow for granular error and progress reporting? One could even go so far as to make the input also a BoxStream
🤔
I think to implement Azure blob store and GCS, we'll need to do work upstream: They both used the |
✅ I've run the integration tests against AWS S3 and Cloudflare R2. Both pass 👍 (although Cloudflare fails the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this, I left some comments
@@ -129,6 +153,43 @@ struct MultipartPart { | |||
part_number: usize, | |||
} | |||
|
|||
#[derive(Deserialize)] | |||
#[serde(rename_all = "PascalCase", rename = "DeleteResult")] | |||
struct BatchDeleteResponse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we seem to have enabled a feature to get this to deserialize correctly, could we perhaps get a basic test of this deserialization logic - i.e. given a payload with mixed success and failures, it deserialized correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
object_store/src/aws/client.rs
Outdated
|
||
let inner_body = paths | ||
.iter() | ||
.map(|path| format!("<Object><Key>{}</Key></Object>", path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might run into weirdness due to XML escaping, could we use quick-xml to serialize this payload instead of string formatting?
I think Path allows '&' for example, a test of this would be superb...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a test with characters special to XML
object_store/src/util.rs
Outdated
@@ -170,6 +178,22 @@ fn merge_ranges( | |||
ret | |||
} | |||
|
|||
/// Common implementation for delete_all | |||
#[allow(dead_code)] | |||
pub(crate) fn delete_all_helper<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could follow the pattern in #4220 of using extension traits for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I think the helper turned out not to be necessary, so I removed it entirely.
object_store/src/aws/mod.rs
Outdated
fn delete_stream<'a>( | ||
&'a self, | ||
locations: BoxStream<'a, Path>, | ||
) -> BoxStream<'a, BoxFuture<'a, Result<Vec<Result<Path>>>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This signature whilst it does provide the maximum flexibility to the upstreams, is kind of obnoxious to use
What do you think of returning BoxStream<'a, Result<Path>>
and letting the individual stores control the concurrency, much like we do for coalesce_ranges? This would have the advantage of letting them choose an appropriate value
It would also avoid overheads for stores that don't support bulk deletes, FWIW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this signature is fine, but how would I implement limit store with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither signature would really let you do this meaningfully, as neither exposes the granularity of the requests. I personally think this is fine, I suspect we may add request limiting to ClientOptions eventually to handle this
object_store/src/lib.rs
Outdated
let paths = flatten_list_stream(storage, None).await.unwrap(); | ||
|
||
for f in &paths { | ||
storage.delete(f).await.unwrap(); | ||
} | ||
storage | ||
.delete_stream(futures::stream::iter(paths).boxed()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this use-case, I wonder if the input should be a fallible stream, what do you think? You could then feed the output of the list operation directly into the bulk delete API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes the API a little funky but that might be fine. In most of my downstream use cases, I'm probably passing in a Vec
or iterator, so I'm already going to wrap in future::stream::iter
. Having to add a .map(Ok)
seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the alternative would be to just pass in Vec<Path>
, but I think if we're going to go the path of providing a streaming interface we should at least make it usable. I would be happy with either tbh, Vec<Path>
does have the advantage of being simpler... Up to you 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really nice, thank you for sticking with this. I think this is good to go, but left one final thought on the signature that I'd be interested to here your take on
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
object_store/src/aws/mod.rs
Outdated
) -> BoxStream<'a, Result<Path>> { | ||
locations | ||
.chunks(1_000) | ||
.map(move |locations| async { | ||
let locations: Vec<Path> = | ||
locations.into_iter().collect::<Result<_>>()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to use try_chunks, as it will short-circuit on the first error
object_store/src/lib.rs
Outdated
@@ -1578,9 +1578,15 @@ mod tests { | |||
} | |||
|
|||
async fn delete_fixtures(storage: &DynObjectStore) { | |||
let paths = flatten_list_stream(storage, None).await.unwrap(); | |||
// let paths = flatten_list_stream(storage, None).await.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// let paths = flatten_list_stream(storage, None).await.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking nice 😄
Shall we merge it? |
I'll do some final clean ups later today, then it should be ready. |
Which issue does this PR close?
Closes #2615.
Rationale for this change
Provides methods for quickly deleting large numbers of objects, such as when dropping a Parquet table.
What changes are included in this PR?
Introduces two new methods on
ObjectStore
, each with a default implementation. One provides the bulk deletion method. Another provides the number of objects that can be deleted in one underlying call. The latter can be used if the user wants to control the parallelism themselves or if they want to implement progress tracking.Are there any user-facing changes?
Adds new APIs, with inline documentation.