Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38699: [C++][FS][Azure] Implement CreateDir() #38708

Merged
merged 7 commits into from
Nov 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 112 additions & 1 deletion cpp/src/arrow/filesystem/azurefs.cc
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,19 @@ Status ValidateFilePath(const AzurePath& path) {
return Status::OK();
}

Status StatusFromErrorResponse(const std::string& url,
Azure::Core::Http::RawResponse* raw_response,
const std::string& context) {
const auto& body = raw_response->GetBody();
// There isn't an Azure specification that response body on error
// doesn't contain any binary data but we assume it. We hope that
// error response body has useful information for the error.
std::string_view body_text(reinterpret_cast<const char*>(body.data()), body.size());
return Status::IOError(context, ": ", url, ": ", raw_response->GetReasonPhrase(), " (",
static_cast<int>(raw_response->GetStatusCode()),
"): ", body_text);
}

template <typename ArrowType>
std::string FormatValue(typename TypeTraits<ArrowType>::CType value) {
struct StringAppender {
Expand Down Expand Up @@ -611,6 +624,99 @@ class AzureFileSystem::Impl {
RETURN_NOT_OK(ptr->Init());
return ptr;
}

Status CreateDir(const AzurePath& path) {
if (path.container.empty()) {
return Status::Invalid("Cannot create an empty container");
}

if (path.path_to_file.empty()) {
auto container_client =
blob_service_client_->GetBlobContainerClient(path.container);
try {
auto response = container_client.Create();
if (response.Value.Created) {
return Status::OK();
} else {
return StatusFromErrorResponse(
container_client.GetUrl(), response.RawResponse.get(),
"Failed to create a container: " + path.container);
}
} catch (const Azure::Storage::StorageException& exception) {
return internal::ExceptionToStatus(
"Failed to create a container: " + path.container + ": " +
container_client.GetUrl(),
exception);
}
}

ARROW_ASSIGN_OR_RAISE(auto hierarchical_namespace_enabled,
hierarchical_namespace_.Enabled(path.container));
if (!hierarchical_namespace_enabled) {
// Without hierarchical namespace enabled Azure blob storage has no directories.
// Therefore we can't, and don't need to create one. Simply creating a blob with `/`
// in the name implies directories.
return Status::OK();
}

auto directory_client = datalake_service_client_->GetFileSystemClient(path.container)
.GetDirectoryClient(path.path_to_file);
Copy link
Contributor

@felipecrv felipecrv Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why path_to_file is actually the name of the directory in this context, but maybe a different name for this struct field would make things less confusing? This segment of filesystem paths is usually called "basename" [1].

[1] https://en.wikipedia.org/wiki/Basename

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent quite some time trying to figure out what if (path.path_to_file.empty()) { meant here.

path.basename.empty() would be more clear IMO.

cc @Tom-Newton

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that "basename" isn't suitable for this case.
I think that "d" is the basename of "a/b/c/d" but path_to_file is "b/c/d". ("a" is container.)

I think that path is suitable for "b/c/d" but AzurePath::path is strange... How about renaming AzurePath to AzureLocation and using container for a and path for b/c/d?

FYI:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I misinterpreted the meaning of path_to_file. I think path would be OK. And Azure{Path->Location} is also a good rename.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for @kou's the suggested re-name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened a new issue for this: #38758
I'll do it after we merge this.

try {
auto response = directory_client.Create();
if (response.Value.Created) {
return Status::OK();
} else {
return StatusFromErrorResponse(
directory_client.GetUrl(), response.RawResponse.get(),
"Failed to create a directory: " + path.path_to_file);
}
} catch (const Azure::Storage::StorageException& exception) {
return internal::ExceptionToStatus(
"Failed to create a directory: " + path.path_to_file + ": " +
directory_client.GetUrl(),
exception);
}
}

Status CreateDirRecursive(const AzurePath& path) {
if (path.container.empty()) {
return Status::Invalid("Cannot create an empty container");
}

auto container_client = blob_service_client_->GetBlobContainerClient(path.container);
try {
container_client.CreateIfNotExists();
} catch (const Azure::Storage::StorageException& exception) {
return internal::ExceptionToStatus(
"Failed to create a container: " + path.container + " (" +
container_client.GetUrl() + ")",
exception);
}

ARROW_ASSIGN_OR_RAISE(auto hierarchical_namespace_enabled,
hierarchical_namespace_.Enabled(path.container));
if (!hierarchical_namespace_enabled) {
// We can't create a directory without hierarchical namespace
// support. There is only "virtual directory" without
// hierarchical namespace support. And a "virtual directory" is
// (virtually) created a blob with ".../.../blob" blob name
// automatically.
return Status::OK();
}

auto directory_client = datalake_service_client_->GetFileSystemClient(path.container)
.GetDirectoryClient(path.path_to_file);
try {
directory_client.CreateIfNotExists();
} catch (const Azure::Storage::StorageException& exception) {
return internal::ExceptionToStatus(
"Failed to create a directory: " + path.path_to_file + " (" +
directory_client.GetUrl() + ")",
exception);
}

return Status::OK();
}
};

const AzureOptions& AzureFileSystem::options() const { return impl_->options(); }
Expand All @@ -636,7 +742,12 @@ Result<FileInfoVector> AzureFileSystem::GetFileInfo(const FileSelector& select)
}

Status AzureFileSystem::CreateDir(const std::string& path, bool recursive) {
return Status::NotImplemented("The Azure FileSystem is not fully implemented");
ARROW_ASSIGN_OR_RAISE(auto p, AzurePath::FromString(path));
if (recursive) {
return impl_->CreateDirRecursive(p);
} else {
return impl_->CreateDir(p);
}
}

Status AzureFileSystem::DeleteDir(const std::string& path) {
Expand Down
113 changes: 112 additions & 1 deletion cpp/src/arrow/filesystem/azurefs_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
#include <azure/storage/common/storage_credential.hpp>
#include <azure/storage/files/datalake.hpp>

#include "arrow/filesystem/path_util.h"
#include "arrow/filesystem/test_util.h"
#include "arrow/result.h"
#include "arrow/testing/gtest_util.h"
Expand Down Expand Up @@ -225,6 +226,10 @@ class AzureFileSystemTest : public ::testing::Test {
return s;
}

std::string RandomContainerName() { return RandomChars(32); }

std::string RandomDirectoryName() { return RandomChars(32); }

void UploadLines(const std::vector<std::string>& lines, const char* path_to_file,
int total_size) {
// TODO(GH-38333): Switch to using Azure filesystem to write once its implemented.
Expand Down Expand Up @@ -267,6 +272,22 @@ class AzureFlatNamespaceFileSystemTest : public AzureFileSystemTest {
}
};

// How to enable this test:
//
// You need an Azure account. You should be able to create a free
// account at https://azure.microsoft.com/en-gb/free/ . You should be
// able to create a storage account through the portal Web UI.
//
// See also the official document how to create a storage account:
// https://learn.microsoft.com/en-us/azure/storage/blobs/create-data-lake-storage-account
//
// A few suggestions on configuration:
//
// * Use Standard general-purpose v2 not premium
// * Use LRS redundancy
// * Obviously you need to enable hierarchical namespace.
// * Set the default access tier to hot
// * SFTP, NFS and file shares are not required.
class AzureHierarchicalNamespaceFileSystemTest : public AzureFileSystemTest {
Result<AzureOptions> MakeOptions() override {
AzureOptions options;
Expand Down Expand Up @@ -396,6 +417,96 @@ TEST_F(AzureHierarchicalNamespaceFileSystemTest, GetFileInfoObject) {
RunGetFileInfoObjectTest();
}

TEST_F(AzuriteFileSystemTest, CreateDirFailureNoContainer) {
ASSERT_RAISES(Invalid, fs_->CreateDir("", false));
}

TEST_F(AzuriteFileSystemTest, CreateDirSuccessContainerOnly) {
auto container_name = RandomContainerName();
ASSERT_OK(fs_->CreateDir(container_name, false));
arrow::fs::AssertFileInfo(fs_.get(), container_name, FileType::Directory);
}

TEST_F(AzuriteFileSystemTest, CreateDirSuccessContainerAndDirectory) {
const auto path = PreexistingContainerPath() + RandomDirectoryName();
ASSERT_OK(fs_->CreateDir(path, false));
// There is only virtual directory without hierarchical namespace
// support. So the CreateDir() does nothing.
arrow::fs::AssertFileInfo(fs_.get(), path, FileType::NotFound);
}

TEST_F(AzureHierarchicalNamespaceFileSystemTest, CreateDirSuccessContainerAndDirectory) {
const auto path = PreexistingContainerPath() + RandomDirectoryName();
ASSERT_OK(fs_->CreateDir(path, false));
arrow::fs::AssertFileInfo(fs_.get(), path, FileType::Directory);
}

TEST_F(AzuriteFileSystemTest, CreateDirFailureDirectoryWithMissingContainer) {
const auto path = std::string("not-a-container/new-directory");
ASSERT_RAISES(IOError, fs_->CreateDir(path, false));
}

TEST_F(AzuriteFileSystemTest, CreateDirRecursiveFailureNoContainer) {
ASSERT_RAISES(Invalid, fs_->CreateDir("", true));
}

TEST_F(AzureHierarchicalNamespaceFileSystemTest, CreateDirRecursiveSuccessContainerOnly) {
auto container_name = RandomContainerName();
ASSERT_OK(fs_->CreateDir(container_name, true));
arrow::fs::AssertFileInfo(fs_.get(), container_name, FileType::Directory);
}

TEST_F(AzuriteFileSystemTest, CreateDirRecursiveSuccessContainerOnly) {
auto container_name = RandomContainerName();
ASSERT_OK(fs_->CreateDir(container_name, true));
arrow::fs::AssertFileInfo(fs_.get(), container_name, FileType::Directory);
}

TEST_F(AzureHierarchicalNamespaceFileSystemTest, CreateDirRecursiveSuccessDirectoryOnly) {
const auto parent = PreexistingContainerPath() + RandomDirectoryName();
const auto path = internal::ConcatAbstractPath(parent, "new-sub");
ASSERT_OK(fs_->CreateDir(path, true));
arrow::fs::AssertFileInfo(fs_.get(), path, FileType::Directory);
arrow::fs::AssertFileInfo(fs_.get(), parent, FileType::Directory);
}

TEST_F(AzuriteFileSystemTest, CreateDirRecursiveSuccessDirectoryOnly) {
const auto parent = PreexistingContainerPath() + RandomDirectoryName();
const auto path = internal::ConcatAbstractPath(parent, "new-sub");
ASSERT_OK(fs_->CreateDir(path, true));
// There is only virtual directory without hierarchical namespace
// support. So the CreateDir() does nothing.
arrow::fs::AssertFileInfo(fs_.get(), path, FileType::NotFound);
arrow::fs::AssertFileInfo(fs_.get(), parent, FileType::NotFound);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can disambiguate "file exists?" queries in the filesystem API, we should probably always reply true when the caller is asking if a directory exists. If creating a directory is a no-op that succeeds, the post-condition of CreateDir -- the directory now exists -- should be true.

There are might be bad consequences of this, so this is more of an idea than a suggestion.

Copy link
Contributor

@Tom-Newton Tom-Newton Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an interring point. I don't feel strongly but personally I think the current behaviour is the best option.

GetFileInfo will return that a directory is present if at least one "file" has been created in that "directory". I think this behaviour is consistent with the GCS filesystem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the opinion of Tom-Newton but let's discuss this further on #38772.

}

TEST_F(AzureHierarchicalNamespaceFileSystemTest,
CreateDirRecursiveSuccessContainerAndDirectory) {
auto container_name = RandomContainerName();
const auto parent = internal::ConcatAbstractPath(container_name, RandomDirectoryName());
const auto path = internal::ConcatAbstractPath(parent, "new-sub");
ASSERT_OK(fs_->CreateDir(path, true));
arrow::fs::AssertFileInfo(fs_.get(), path, FileType::Directory);
arrow::fs::AssertFileInfo(fs_.get(), parent, FileType::Directory);
arrow::fs::AssertFileInfo(fs_.get(), container_name, FileType::Directory);
}

TEST_F(AzuriteFileSystemTest, CreateDirRecursiveSuccessContainerAndDirectory) {
auto container_name = RandomContainerName();
const auto parent = internal::ConcatAbstractPath(container_name, RandomDirectoryName());
const auto path = internal::ConcatAbstractPath(parent, "new-sub");
ASSERT_OK(fs_->CreateDir(path, true));
// There is only virtual directory without hierarchical namespace
// support. So the CreateDir() does nothing.
arrow::fs::AssertFileInfo(fs_.get(), path, FileType::NotFound);
arrow::fs::AssertFileInfo(fs_.get(), parent, FileType::NotFound);
arrow::fs::AssertFileInfo(fs_.get(), container_name, FileType::Directory);
}

TEST_F(AzuriteFileSystemTest, CreateDirUri) {
ASSERT_RAISES(Invalid, fs_->CreateDir("abfs://" + RandomContainerName(), true));
}

TEST_F(AzuriteFileSystemTest, OpenInputStreamString) {
std::shared_ptr<io::InputStream> stream;
ASSERT_OK_AND_ASSIGN(stream, fs_->OpenInputStream(PreexistingObjectPath()));
Expand Down Expand Up @@ -455,7 +566,7 @@ TEST_F(AzuriteFileSystemTest, OpenInputStreamInfoInvalid) {
}

TEST_F(AzuriteFileSystemTest, OpenInputStreamUri) {
ASSERT_RAISES(Invalid, fs_->OpenInputStream("abfss://" + PreexistingObjectPath()));
ASSERT_RAISES(Invalid, fs_->OpenInputStream("abfs://" + PreexistingObjectPath()));
}

TEST_F(AzuriteFileSystemTest, OpenInputStreamTrailingSlash) {
Expand Down
Loading