Skip to content

Commit

Permalink
GH-44256: [C++][FS][Azure] Fix edgecase where GetFileInfo incorrectly…
Browse files Browse the repository at this point in the history
… returns NotFound on flat namespace and Azurite (#44302)

### Rationale for this change
Fix a bug where `GetFileInfo` incorrectly returns `FileType::NotFound` on flat namespace and Azurite.

### What changes are included in this PR?
Fix by detecting the exact edgecase and doing an extra listing operation to disambiguate. 

### Are these changes tested?
Yes, updated automated test

### Are there any user-facing changes?
Only a bug fix.

* GitHub Issue: #44256

Authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
  • Loading branch information
Tom-Newton authored Oct 4, 2024
1 parent 7817e3c commit 3fb7777
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 0 deletions.
21 changes: 21 additions & 0 deletions cpp/src/arrow/filesystem/azurefs.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1793,6 +1793,8 @@ class AzureFileSystem::Impl {
// BlobPrefixes. A BlobPrefix always ends with kDelimiter ("/"), so we can
// distinguish between a directory and a file by checking if we received a
// prefix or a blob.
// This strategy allows us to implement GetFileInfo with just 1 blob storage
// operation in almost every case.
if (!list_response.BlobPrefixes.empty()) {
// Ensure the returned BlobPrefixes[0] string doesn't contain more characters than
// the requested Prefix. For instance, if we request with Prefix="dir/abra" and
Expand All @@ -1814,6 +1816,25 @@ class AzureFileSystem::Impl {
info.set_mtime(
std::chrono::system_clock::time_point{blob.Details.LastModified});
return info;
} else if (blob.Name[options.Prefix.Value().length()] < internal::kSep) {
// First list result did not indicate a directory and there is definitely no
// exactly matching blob. However, there may still be a directory that we
// initially missed because the first list result came before
// `options.Prefix + internal::kSep` lexigraphically.
// For example the flat namespace storage account has the following blobs:
// - container/dir.txt
// - container/dir/file.txt
// GetFileInfo(container/dir) should return FileType::Directory but in this
// edge case `blob = "dir.txt"`, so without further checks we would incorrectly
// return FileType::NotFound.
// Therefore we make an extra list operation with the trailing slash to confirm
// whether the path is a directory.
options.Prefix = internal::EnsureTrailingSlash(location.path);
auto list_with_trailing_slash_response = container_client.ListBlobs(options);
if (!list_with_trailing_slash_response.Blobs.empty()) {
info.set_type(FileType::Directory);
return info;
}
}
}
info.set_type(FileType::NotFound);
Expand Down
14 changes: 14 additions & 0 deletions cpp/src/arrow/filesystem/azurefs_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2065,6 +2065,20 @@ void TestAzureFileSystem::TestGetFileInfoObjectWithNestedStructure() {
ASSERT_OK(output->Write(lorem_ipsum));
ASSERT_OK(output->Close());

// . is immediately before "/" lexicographically, ensure that this doesn't
// cause unexpected issues. NOTE: Its seems real Azure blob storage doesn't
// allow blob names to end in `.`
ASSERT_OK_AND_ASSIGN(output, fs()->OpenOutputStream(
data.ContainerPath("test-object-dir/some_other_dir.a"),
/*metadata=*/{}));
ASSERT_OK(output->Write(lorem_ipsum));
ASSERT_OK(output->Close());
ASSERT_OK_AND_ASSIGN(output,
fs()->OpenOutputStream(data.ContainerPath(kObjectName + ".a"),
/*metadata=*/{}));
ASSERT_OK(output->Write(lorem_ipsum));
ASSERT_OK(output->Close());

AssertFileInfo(fs(), data.ContainerPath(kObjectName), FileType::File);
AssertFileInfo(fs(), data.ContainerPath(kObjectName) + "/", FileType::NotFound);
AssertFileInfo(fs(), data.ContainerPath("test-object-dir"), FileType::Directory);
Expand Down

0 comments on commit 3fb7777

Please sign in to comment.