Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38700: [C++][FS][Azure] Implement DeleteDir() #38793

Merged
merged 7 commits into from
Nov 24, 2023

Conversation

kou
Copy link
Member

@kou kou commented Nov 20, 2023

Rationale for this change

DeleteDir() deletes the given directory recursively like other filesystem implementations.

What changes are included in this PR?

  • Container can be deleted with/without hierarchical namespace support.
  • Directory can be deleted with hierarchical namespace support.
  • Directory can't be deleted without hierarchical namespace support. But blobs under the given path can be deleted. So these blobs are deleted and the given virtual directory is also deleted.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

Copy link

⚠️ GitHub issue #38700 has been automatically assigned in GitHub to PR creator.

@kou
Copy link
Member Author

kou commented Nov 20, 2023

@Tom-Newton @felipecrv You may want to review this.

Comment on lines 755 to 759
if (!hierarchical_namespace_enabled) {
// Without hierarchical namespace enabled Azure blob storage has no directories.
// Therefore we can't, and don't need to delete one.
return Status::OK();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no actual directory but there could be blobs that are considered part of this implied directory. I think in this case we should delete those blobs.

I think that will require listing blobs for the prefix (internal::EnsureTrailingSlash(location.path)) then iterating through the result and deleting all those blobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @Tom-Newton here. Azure API might have an endpoint to delete all blobs with a certain prefix, so we don't necessarily have to loop from the client.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it makes sense. I'll do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It seems that Azure Blob Storage doesn't provide an API that deletes blobs by prefix.

I'll implement it with the list and delete approach.

TEST_F(AzuriteFileSystemTest, DeleteContainerDirSuccess) {
auto container_name = RandomContainerName();
ASSERT_OK(fs_->CreateDir(container_name));
ASSERT_OK(fs_->DeleteDir(container_name));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a arrow::fs::AssertFileInfo(fs_.get(), path, FileType::NotFound); at the end? Personally I would probably also add an assertion that the container does exist before deleting it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll add them.

ASSERT_OK(fs_->DeleteDir(container_name));
}

TEST_F(AzuriteFileSystemTest, DeleteDirSuccess) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add a test case for a non-empty "virtual directory" where we expect the contents of the directory to be deleted.

Would be nice to add a non-empty test case for hierarchical namespace too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add a test case for a non-empty "virtual directory" where we expect the contents of the directory to be deleted.

OK. I'll add it.

Would be nice to add a non-empty test case for hierarchical namespace too.

AzureHierarchicalNamespaceFileSystemTest.DeleteDirSuccess is for the case.

Comment on lines 755 to 759
if (!hierarchical_namespace_enabled) {
// Without hierarchical namespace enabled Azure blob storage has no directories.
// Therefore we can't, and don't need to delete one.
return Status::OK();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @Tom-Newton here. Azure API might have an endpoint to delete all blobs with a certain prefix, so we don't necessarily have to loop from the client.

const auto path =
internal::ConcatAbstractPath(PreexistingContainerName(), RandomDirectoryName());
// There is only virtual directory without hierarchical namespace
// support. So the DeleteDir() does nothing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it delete all the blobs that start with path/to/dir/?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It make sense. I'll implement it and add a test for the case.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 21, 2023
@kou kou force-pushed the cpp-azurefs-delete-dir branch from 7e25d13 to 7212e6a Compare November 21, 2023 07:31
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 21, 2023
@kou
Copy link
Member Author

kou commented Nov 21, 2023

Updated:

  • Add more tests.
  • Delete blobs under the given directory when hierarchical namespace isn't enabled (list blobs and delete them approach)

Comment on lines 1052 to 1053
return Status::IOError("Failed to delete a blob: ", blob_item.Name,
": " + container_client.GetUrl());
Copy link
Contributor

@felipecrv felipecrv Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would save this on a Status variable and let all the loops continue, then check the saved status at the end.

This way we don't waste the effort made to list the blobs and a caller retrying after we return an error at the end is more likely to have less work to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could cancel the outer loop listing blobs if an error was detected, but I think GetResponse should be called on all the deferred responses we have sent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to continue the loop to avoid wasting the list effort. Because we already submitted all delete requests by SubmitBatch(). This loop just checks the results.

I think GetResponse should be called on all the deferred responses we have sent.

Why? Is it for avoiding a resource leak? I think that not calling GetResponse() doesn't leak any resource. Because I think that GetResponse() just returns a sub response returned by SubmitBatch(). I think that all sub responses are managed by SubmitBatch() response and the response is already read.
See also a SubmitBatch() response example: https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch#sample-response

Anyway, I'll check all deferred responses to show all failed blob names in error message. It'll help debug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I thought GetResponse would block if necessary, but all the futures are waited for during SubmitBatch.

It could block, but in the specific case of batched requests, they are all at ready state ->
https://github.com/Azure/azure-sdk-for-cpp/blob/4a32d7266cfac8bfc0eb87feb56011361a36f43c/sdk/storage/azure-storage-blobs/src/blob_batch.cpp#L248

This a bit disappointing because it reduces the parallelism that could be exploited, but surely reduces the risk of API misuse.

Comment on lines 1030 to 1031
auto list_response = container_client.ListBlobs(options);
while (list_response.HasPage() && !list_response.Blobs.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are the ListBlobs errors handled here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. I missed it. I'll add try/catch for it.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Nov 21, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 22, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Nov 22, 2023
@kou
Copy link
Member Author

kou commented Nov 22, 2023

SubmitBatch() failed only on macOS:

https://github.com/Azure/azure-sdk-for-cpp/blob/4a32d7266cfac8bfc0eb87feb56011361a36f43c/sdk/storage/azure-storage-blobs/src/blob_batch.cpp#L247-L266

'fs_->DeleteDir(directory_path)' failed with IOError: Failed to delete blobs in a directory: nvpq8dvsiy28kxtqwqa8qxqgddkka1ck: http://127.0.0.1:10000/devstoreaccount1/vai1bmd4cf5fd5s15ag8xybj1pc12z9a Azure Error: 400 One of the request inputs is not valid.

It's an Azurite bug: Azure/Azurite#2302

Can we skip the test only on macOS?

@github-actions github-actions bot removed the awaiting changes Awaiting changes label Nov 23, 2023
@github-actions github-actions bot added the awaiting change review Awaiting change review label Nov 23, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Nov 23, 2023
@kou
Copy link
Member Author

kou commented Nov 23, 2023

Updated:

  • I've skipped the test only on macOS.
  • I've added a missing try/catch for GetResponse(). I confirmed that it may throw an exception when a subrequest is failed.

I'll merge this once CI is finished with green.

@felipecrv
Copy link
Contributor

I've skipped the test only on macOS.

When running arrow-azurefs-test on macOS (ARM) I get an ASAN error coming from OpenSSL's libcrypto.dylib.

Nothing is wrong with the buffers passed to the function, so it must be a false positive coming from OpenSSL code.

==73846==ERROR: AddressSanitizer: container-overflow on address 0x000104d4cfa0 at pc 0x0001026ef258 bp 0x00016fdfc060 sp 0x00016fdfb820
READ of size 64 at 0x000104d4cfa0 thread T0
    #0 0x1026ef254 in wrap_memcpy+0x13c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x1b254) (BuildId: f0a7ac5c49bc3abc851181b6f92b308a32000000200000000100000000000b00)
    #1 0x100ef1908 in hmac_setkey+0x64 (libcrypto.3.dylib:arm64+0x1c9908) (BuildId: 07025f7028533fc886375706991455ee32000000200000000100000000000d00)
    #2 0x100e1ca88 in EVP_Q_mac+0x12c (libcrypto.3.dylib:arm64+0xf4a88) (BuildId: 07025f7028533fc886375706991455ee32000000200000000100000000000d00)
    #3 0x100e2a29c in HMAC+0x94 (libcrypto.3.dylib:arm64+0x10229c) (BuildId: 07025f7028533fc886375706991455ee32000000200000000100000000000d00)
    ...

@kou
Copy link
Member Author

kou commented Nov 24, 2023

@felipecrv Oh... Could you open a new issue for this with full log? I'll merge this for now because our CI doesn't have ASAN on macOS. (We have it for Linux but it doesn't report this error: https://github.com/apache/arrow/actions/runs/6970987063/job/18970050943?pr=38793 )

We can work on it as a separated task.

@kou kou merged commit 7da7895 into apache:main Nov 24, 2023
35 checks passed
@kou kou deleted the cpp-azurefs-delete-dir branch November 24, 2023 21:58
@kou kou removed the awaiting changes Awaiting changes label Nov 24, 2023
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7da7895.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
### Rationale for this change

`DeleteDir()` deletes the given directory recursively like other filesystem implementations.

### What changes are included in this PR?

* Container can be deleted with/without hierarchical namespace support.
* Directory can be deleted with hierarchical namespace support.
* Directory can't be deleted without hierarchical namespace support. But blobs under the given path can be deleted. So these blobs are deleted and the given virtual directory is also deleted.
    
### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* Closes: apache#38700

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][FS][Azure] Implement DeleteDir()
4 participants