-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38700: [C++][FS][Azure] Implement DeleteDir()
#38793
Conversation
|
@Tom-Newton @felipecrv You may want to review this. |
cpp/src/arrow/filesystem/azurefs.cc
Outdated
if (!hierarchical_namespace_enabled) { | ||
// Without hierarchical namespace enabled Azure blob storage has no directories. | ||
// Therefore we can't, and don't need to delete one. | ||
return Status::OK(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no actual directory but there could be blobs that are considered part of this implied directory. I think in this case we should delete those blobs.
I think that will require listing blobs for the prefix (internal::EnsureTrailingSlash(location.path)
) then iterating through the result and deleting all those blobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @Tom-Newton here. Azure API might have an endpoint to delete all blobs with a certain prefix, so we don't necessarily have to loop from the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it makes sense. I'll do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It seems that Azure Blob Storage doesn't provide an API that deletes blobs by prefix.
I'll implement it with the list and delete approach.
TEST_F(AzuriteFileSystemTest, DeleteContainerDirSuccess) { | ||
auto container_name = RandomContainerName(); | ||
ASSERT_OK(fs_->CreateDir(container_name)); | ||
ASSERT_OK(fs_->DeleteDir(container_name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a arrow::fs::AssertFileInfo(fs_.get(), path, FileType::NotFound);
at the end? Personally I would probably also add an assertion that the container does exist before deleting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll add them.
ASSERT_OK(fs_->DeleteDir(container_name)); | ||
} | ||
|
||
TEST_F(AzuriteFileSystemTest, DeleteDirSuccess) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add a test case for a non-empty "virtual directory" where we expect the contents of the directory to be deleted.
Would be nice to add a non-empty test case for hierarchical namespace too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add a test case for a non-empty "virtual directory" where we expect the contents of the directory to be deleted.
OK. I'll add it.
Would be nice to add a non-empty test case for hierarchical namespace too.
AzureHierarchicalNamespaceFileSystemTest.DeleteDirSuccess
is for the case.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
if (!hierarchical_namespace_enabled) { | ||
// Without hierarchical namespace enabled Azure blob storage has no directories. | ||
// Therefore we can't, and don't need to delete one. | ||
return Status::OK(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @Tom-Newton here. Azure API might have an endpoint to delete all blobs with a certain prefix, so we don't necessarily have to loop from the client.
const auto path = | ||
internal::ConcatAbstractPath(PreexistingContainerName(), RandomDirectoryName()); | ||
// There is only virtual directory without hierarchical namespace | ||
// support. So the DeleteDir() does nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it delete all the blobs that start with path/to/dir/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It make sense. I'll implement it and add a test for the case.
7e25d13
to
7212e6a
Compare
Updated:
|
cpp/src/arrow/filesystem/azurefs.cc
Outdated
return Status::IOError("Failed to delete a blob: ", blob_item.Name, | ||
": " + container_client.GetUrl()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would save this on a Status
variable and let all the loops continue, then check the saved status at the end.
This way we don't waste the effort made to list the blobs and a caller retrying after we return an error at the end is more likely to have less work to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could cancel the outer loop listing blobs if an error was detected, but I think GetResponse
should be called on all the deferred responses we have sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to continue the loop to avoid wasting the list effort. Because we already submitted all delete requests by SubmitBatch()
. This loop just checks the results.
I think
GetResponse
should be called on all the deferred responses we have sent.
Why? Is it for avoiding a resource leak? I think that not calling GetResponse()
doesn't leak any resource. Because I think that GetResponse()
just returns a sub response returned by SubmitBatch()
. I think that all sub responses are managed by SubmitBatch()
response and the response is already read.
See also a SubmitBatch()
response example: https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch#sample-response
Anyway, I'll check all deferred responses to show all failed blob names in error message. It'll help debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I thought GetResponse
would block if necessary, but all the futures are waited for during SubmitBatch
.
It could block, but in the specific case of batched requests, they are all at ready state ->
https://github.com/Azure/azure-sdk-for-cpp/blob/4a32d7266cfac8bfc0eb87feb56011361a36f43c/sdk/storage/azure-storage-blobs/src/blob_batch.cpp#L248
This a bit disappointing because it reduces the parallelism that could be exploited, but surely reduces the risk of API misuse.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
auto list_response = container_client.ListBlobs(options); | ||
while (list_response.HasPage() && !list_response.Blobs.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are the ListBlobs
errors handled here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry. I missed it. I'll add try
/catch
for it.
It's an Azurite bug: Azure/Azurite#2302 Can we skip the test only on macOS? |
Updated:
I'll merge this once CI is finished with green. |
When running Nothing is wrong with the buffers passed to the function, so it must be a false positive coming from OpenSSL code.
|
@felipecrv Oh... Could you open a new issue for this with full log? I'll merge this for now because our CI doesn't have ASAN on macOS. (We have it for Linux but it doesn't report this error: https://github.com/apache/arrow/actions/runs/6970987063/job/18970050943?pr=38793 ) We can work on it as a separated task. |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7da7895. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change `DeleteDir()` deletes the given directory recursively like other filesystem implementations. ### What changes are included in this PR? * Container can be deleted with/without hierarchical namespace support. * Directory can be deleted with hierarchical namespace support. * Directory can't be deleted without hierarchical namespace support. But blobs under the given path can be deleted. So these blobs are deleted and the given virtual directory is also deleted. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * Closes: apache#38700 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Rationale for this change
DeleteDir()
deletes the given directory recursively like other filesystem implementations.What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.
DeleteDir()
#38700