Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38335: [C++] Implement GetFileInfo for a single file in Azure filesystem #38505

Merged
merged 45 commits into from
Nov 9, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
abf6412
Paste in s3 GetFileInfo tests
Tom-Newton Oct 28, 2023
80b12ee
Tests build correctly
Tom-Newton Oct 28, 2023
16b6b9a
Add test for root of storage account
Tom-Newton Oct 28, 2023
fa869ea
Partially complete GetFileInfo
Tom-Newton Oct 28, 2023
b3f7157
Avoid relying on filesystem writes in tests
Tom-Newton Oct 29, 2023
7ccafbf
Implement get info for objects, avoiding HNS specific APIs
Tom-Newton Oct 29, 2023
fe87c17
Create a datalake service client and start logic for detecting HNS
Tom-Newton Oct 29, 2023
5067a6e
Mostly correct directory detection
Tom-Newton Oct 29, 2023
1416e57
Working directory detection for flat namespace accounts
Tom-Newton Oct 29, 2023
3459836
Tidy up
Tom-Newton Oct 31, 2023
d51b4e6
Add first test against real blob storage for HNS tests
Tom-Newton Oct 31, 2023
4adc5e7
Main directory detection test for HNS accounts
Tom-Newton Nov 1, 2023
8de8ddf
Fix handling of file with trailing slash on HNS account
Tom-Newton Nov 1, 2023
655ac3b
Test empty directory on HNS account
Tom-Newton Nov 1, 2023
49a7c69
Working handling of HNS
Tom-Newton Nov 2, 2023
cc6f156
Naive move hierarchical namespace detection
Tom-Newton Nov 2, 2023
b3ea915
Avoid copying the datalake client
Tom-Newton Nov 2, 2023
379090a
Tests for HierachicalNamespaceDetecter
Tom-Newton Nov 2, 2023
2f59183
Working HNS detection vs flat with soft delete
Tom-Newton Nov 2, 2023
5906a6a
HierachicalNamespaceDetecter.Enable returns Status, supports azurite …
Tom-Newton Nov 2, 2023
a030a99
Addresss TODO(GH-38335)s for using GetFileInfo in tests
Tom-Newton Nov 2, 2023
147b9a0
Avoid holding reference to service client in HierachicalNamespaceDete…
Tom-Newton Nov 3, 2023
d94f961
Tidy
Tom-Newton Nov 5, 2023
b9e8991
Move some logic to azurefs_internal
Tom-Newton Nov 5, 2023
4c89729
Avoid unnecessary creation of filesystem clients for hierachical name…
Tom-Newton Nov 5, 2023
218312b
Rename test fixtures
Tom-Newton Nov 5, 2023
22d64d3
Skip tests against real Azure blob storage if credentials are not pro…
Tom-Newton Nov 5, 2023
80d0a55
Spelling
Tom-Newton Nov 5, 2023
d0f3311
Tidy
Tom-Newton Nov 5, 2023
8912dbf
Update error messages
Tom-Newton Nov 5, 2023
3b920f1
Add missing include
Tom-Newton Nov 5, 2023
6ab232a
Lint
Tom-Newton Nov 5, 2023
89320c0
Adjust some more error messges
Tom-Newton Nov 5, 2023
e868125
Reduce duplication in tests
Tom-Newton Nov 5, 2023
2cd51c1
Lint
Tom-Newton Nov 5, 2023
6ece0c9
Fix clang linker error
Tom-Newton Nov 5, 2023
c562c7e
PR comments 1
Tom-Newton Nov 6, 2023
2486c86
PR comments 2
Tom-Newton Nov 6, 2023
fb03dd8
PR comments: shared_ptr -> unique_ptr
Tom-Newton Nov 6, 2023
c8370b3
Adjust handling of missing credentials
Tom-Newton Nov 6, 2023
25a0c76
Fix namespacing issue after rebase
Tom-Newton Nov 7, 2023
5540c70
Auto-formatting
Tom-Newton Nov 7, 2023
3fd35c6
PR comments
Tom-Newton Nov 8, 2023
44f6aa5
Use unique_ptr for datalake_service_client
Tom-Newton Nov 8, 2023
0659a39
Use auto
kou Nov 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions cpp/src/arrow/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -502,8 +502,8 @@ if(ARROW_FILESYSTEM)
filesystem/util_internal.cc)

if(ARROW_AZURE)
list(APPEND ARROW_SRCS filesystem/azurefs.cc)
set_source_files_properties(filesystem/azurefs.cc
list(APPEND ARROW_SRCS filesystem/azurefs.cc filesystem/azurefs_internal.cc)
set_source_files_properties(filesystem/azurefs.cc filesystem/azurefs_internal.cc
PROPERTIES SKIP_PRECOMPILE_HEADERS ON
SKIP_UNITY_BUILD_INCLUSION ON)
endif()
Expand Down
158 changes: 134 additions & 24 deletions cpp/src/arrow/filesystem/azurefs.cc
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@
// under the License.

#include "arrow/filesystem/azurefs.h"
#include "arrow/filesystem/azurefs_internal.h"

#include <azure/storage/blobs.hpp>
#include <azure/storage/files/datalake.hpp>

#include "arrow/buffer.h"
#include "arrow/filesystem/path_util.h"
Expand Down Expand Up @@ -59,6 +61,7 @@ Status AzureOptions::ConfigureAccountKeyCredentials(const std::string& account_n
credentials_kind = AzureCredentialsKind::StorageCredentials;
return Status::OK();
}

namespace {

// An AzureFileSystem represents a single Azure storage account. AzurePath describes a
Expand All @@ -79,18 +82,17 @@ struct AzurePath {
"Expected an Azure object path of the form 'container/path...', got a URI: '",
s, "'");
}
const auto src = internal::RemoveTrailingSlash(s);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was preventing GetFileInfo working on directories. The other filesystems did not have this.

auto first_sep = src.find_first_of(internal::kSep);
auto first_sep = s.find_first_of(internal::kSep);
if (first_sep == 0) {
return Status::Invalid("Path cannot start with a separator ('", s, "')");
}
if (first_sep == std::string::npos) {
return AzurePath{std::string(src), std::string(src), "", {}};
return AzurePath{std::string(s), std::string(s), "", {}};
}
AzurePath path;
path.full_path = std::string(src);
path.container = std::string(src.substr(0, first_sep));
path.path_to_file = std::string(src.substr(first_sep + 1));
path.full_path = std::string(s);
path.container = std::string(s.substr(0, first_sep));
path.path_to_file = std::string(s.substr(first_sep + 1));
path.path_to_file_parts = internal::SplitAbstractPath(path.path_to_file);
RETURN_NOT_OK(Validate(path));
return path;
Expand Down Expand Up @@ -146,11 +148,6 @@ Status ValidateFilePath(const AzurePath& path) {
return Status::OK();
}

Status ErrorToStatus(const std::string& prefix,
const Azure::Storage::StorageException& exception) {
return Status::IOError(prefix, " Azure Error: ", exception.what());
}

template <typename ArrowType>
std::string FormatValue(typename TypeTraits<ArrowType>::CType value) {
struct StringAppender {
Expand Down Expand Up @@ -316,11 +313,13 @@ class ObjectInputFile final : public io::RandomAccessFile {
return Status::OK();
} catch (const Azure::Storage::StorageException& exception) {
if (exception.StatusCode == Azure::Core::Http::HttpStatusCode::NotFound) {
// Could be either container or blob not found.
return PathNotFound(path_);
}
return ErrorToStatus(
"When fetching properties for '" + blob_client_->GetUrl() + "': ", exception);
return internal::ExceptionToStatus(
"GetProperties failed for '" + blob_client_->GetUrl() +
"' with an unexpected Azure error. Can not initialise an ObjectInputFile "
"without knowing the file size.",
exception);
}
}

Expand Down Expand Up @@ -397,10 +396,12 @@ class ObjectInputFile final : public io::RandomAccessFile {
->DownloadTo(reinterpret_cast<uint8_t*>(out), nbytes, download_options)
.Value.ContentRange.Length.Value();
} catch (const Azure::Storage::StorageException& exception) {
return ErrorToStatus("When reading from '" + blob_client_->GetUrl() +
"' at position " + std::to_string(position) + " for " +
std::to_string(nbytes) + " bytes: ",
exception);
return internal::ExceptionToStatus("DownloadTo from '" + blob_client_->GetUrl() +
"' at position " + std::to_string(position) +
" for " + std::to_string(nbytes) +
" bytes failed with an Azure error. ReadAt "
"failed to read the required byte range.",
exception);
}
}

Expand Down Expand Up @@ -444,7 +445,6 @@ class ObjectInputFile final : public io::RandomAccessFile {
int64_t content_length_ = kNoSize;
std::shared_ptr<const KeyValueMetadata> metadata_;
};

} // namespace

// -----------------------------------------------------------------------
Expand All @@ -453,27 +453,136 @@ class ObjectInputFile final : public io::RandomAccessFile {
class AzureFileSystem::Impl {
public:
io::IOContext io_context_;
std::shared_ptr<Azure::Storage::Blobs::BlobServiceClient> service_client_;
std::unique_ptr<Azure::Storage::Files::DataLake::DataLakeServiceClient>
datalake_service_client_;
std::unique_ptr<Azure::Storage::Blobs::BlobServiceClient> blob_service_client_;
AzureOptions options_;
internal::HierarchicalNamespaceDetector hierarchical_namespace_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that HierarchicalNamespaceDetector is enough simple to move to Impl. (HierarchicalNamespaceDetector::Enabled() is the only important method in the class.)

How about moving HierarchicalNamespaceDetector::Enabled() to Impl::IsHierarchicalNamespaceEnabled() and removing HierarchicalNamespaceDetector (or something)?
If we do it, we can make datalake_service_client_ std::unique_ptr.

Copy link
Contributor Author

@Tom-Newton Tom-Newton Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it separate because I wanted to keep the cached value enabled_ private from the rest of Impl. I was a bit concerned that people might try to directly access the cached state without realising that everything should use the Enabled() function. Additionally making it a separate class made it easier to test.

I think one possibility is to use a non-smart pointer in HierarchicalNamespaceDetector because HierarchicalNamespaceDetector will always be destructed at the same time as Impl. https://stackoverflow.com/questions/7657718/when-to-use-shared-ptr-and-when-to-use-raw-pointers. I think that should allow us to use a unique_ptr for datalake_service_client_. I think this would be my preferred solution. What do you think?

Copy link
Contributor Author

@Tom-Newton Tom-Newton Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to just make my preferred change. If you think its a bad idea I'm happy to change it again to something else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's use the approach.


explicit Impl(AzureOptions options, io::IOContext io_context)
: io_context_(io_context), options_(std::move(options)) {}

Status Init() {
service_client_ = std::make_shared<Azure::Storage::Blobs::BlobServiceClient>(
blob_service_client_ = std::make_unique<Azure::Storage::Blobs::BlobServiceClient>(
options_.account_blob_url, options_.storage_credentials_provider);
datalake_service_client_ =
std::make_unique<Azure::Storage::Files::DataLake::DataLakeServiceClient>(
options_.account_dfs_url, options_.storage_credentials_provider);
RETURN_NOT_OK(hierarchical_namespace_.Init(datalake_service_client_.get()));
return Status::OK();
}

const AzureOptions& options() const { return options_; }

public:
Result<FileInfo> GetFileInfo(const AzurePath& path) {
FileInfo info;
info.set_path(path.full_path);

if (path.container.empty()) {
DCHECK(path.path_to_file.empty()); // The path is invalid if the container is empty
// but not path_to_file.
// path must refer to the root of the Azure storage account. This is a directory,
// and there isn't any extra metadata to fetch.
info.set_type(FileType::Directory);
return info;
}
if (path.path_to_file.empty()) {
// path refers to a container. This is a directory if it exists.
auto container_client =
blob_service_client_->GetBlobContainerClient(path.container);
try {
auto properties = container_client.GetProperties();
info.set_type(FileType::Directory);
info.set_mtime(
std::chrono::system_clock::time_point(properties.Value.LastModified));
return info;
} catch (const Azure::Storage::StorageException& exception) {
if (exception.StatusCode == Azure::Core::Http::HttpStatusCode::NotFound) {
info.set_type(FileType::NotFound);
return info;
}
return internal::ExceptionToStatus(
"GetProperties for '" + container_client.GetUrl() +
"' failed with an unexpected Azure error. GetFileInfo is unable to "
"determine whether the container exists.",
exception);
}
}
auto file_client = datalake_service_client_->GetFileSystemClient(path.container)
.GetFileClient(path.path_to_file);
try {
auto properties = file_client.GetProperties();
if (properties.Value.IsDirectory) {
info.set_type(FileType::Directory);
} else if (internal::HasTrailingSlash(path.path_to_file)) {
// For a path with a trailing slash a hierarchical namespace may return a blob
// with that trailing slash removed. For consistency with flat namespace and
// other filesystems we chose to return NotFound.
info.set_type(FileType::NotFound);
return info;
} else {
info.set_type(FileType::File);
info.set_size(properties.Value.FileSize);
}
info.set_mtime(
std::chrono::system_clock::time_point(properties.Value.LastModified));
return info;
} catch (const Azure::Storage::StorageException& exception) {
if (exception.StatusCode == Azure::Core::Http::HttpStatusCode::NotFound) {
ARROW_ASSIGN_OR_RAISE(auto hierarchical_namespace_enabled,
hierarchical_namespace_.Enabled(path.container));
if (hierarchical_namespace_enabled) {
// If the hierarchical namespace is enabled, then the storage account will have
// explicit directories. Neither a file nor a directory was found.
info.set_type(FileType::NotFound);
return info;
}
// On flat namespace accounts there are no real directories. Directories are only
// implied by using `/` in the blob name.
Azure::Storage::Blobs::ListBlobsOptions list_blob_options;

// If listing the prefix `path.path_to_file` with trailing slash returns at least
// one result then `path` refers to an implied directory.
auto prefix = internal::EnsureTrailingSlash(path.path_to_file);
list_blob_options.Prefix = prefix;
// We only need to know if there is at least one result, so minimise page size
// for efficiency.
list_blob_options.PageSizeHint = 1;

try {
auto paged_list_result =
blob_service_client_->GetBlobContainerClient(path.container)
.ListBlobs(list_blob_options);
if (paged_list_result.Blobs.size() > 0) {
info.set_type(FileType::Directory);
} else {
info.set_type(FileType::NotFound);
}
return info;
} catch (const Azure::Storage::StorageException& exception) {
return internal::ExceptionToStatus(
"ListBlobs for '" + prefix +
"' failed with an unexpected Azure error. GetFileInfo is unable to "
"determine whether the path should be considered an implied directory.",
exception);
}
}
return internal::ExceptionToStatus(
"GetProperties for '" + file_client.GetUrl() +
"' failed with an unexpected "
"Azure error. GetFileInfo is unable to determine whether the path exists.",
exception);
}
}

Result<std::shared_ptr<ObjectInputFile>> OpenInputFile(const std::string& s,
AzureFileSystem* fs) {
ARROW_RETURN_NOT_OK(internal::AssertNoTrailingSlash(s));
ARROW_ASSIGN_OR_RAISE(auto path, AzurePath::FromString(s));
RETURN_NOT_OK(ValidateFilePath(path));
auto blob_client = std::make_shared<Azure::Storage::Blobs::BlobClient>(
service_client_->GetBlobContainerClient(path.container)
blob_service_client_->GetBlobContainerClient(path.container)
.GetBlobClient(path.path_to_file));

auto ptr =
Expand All @@ -494,7 +603,7 @@ class AzureFileSystem::Impl {
ARROW_ASSIGN_OR_RAISE(auto path, AzurePath::FromString(info.path()));
RETURN_NOT_OK(ValidateFilePath(path));
auto blob_client = std::make_shared<Azure::Storage::Blobs::BlobClient>(
service_client_->GetBlobContainerClient(path.container)
blob_service_client_->GetBlobContainerClient(path.container)
.GetBlobClient(path.path_to_file));

auto ptr = std::make_shared<ObjectInputFile>(blob_client, fs->io_context(),
Expand All @@ -518,7 +627,8 @@ bool AzureFileSystem::Equals(const FileSystem& other) const {
}

Result<FileInfo> AzureFileSystem::GetFileInfo(const std::string& path) {
return Status::NotImplemented("The Azure FileSystem is not fully implemented");
ARROW_ASSIGN_OR_RAISE(auto p, AzurePath::FromString(path));
return impl_->GetFileInfo(p);
}

Result<FileInfoVector> AzureFileSystem::GetFileInfo(const FileSelector& select) {
Expand Down
88 changes: 88 additions & 0 deletions cpp/src/arrow/filesystem/azurefs_internal.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#include "arrow/filesystem/azurefs_internal.h"

#include <azure/storage/files/datalake.hpp>

#include "arrow/result.h"

namespace arrow::fs::internal {

Status ExceptionToStatus(const std::string& prefix,
const Azure::Storage::StorageException& exception) {
return Status::IOError(prefix, " Azure Error: ", exception.what());
}

Status HierarchicalNamespaceDetector::Init(
Azure::Storage::Files::DataLake::DataLakeServiceClient* datalake_service_client) {
datalake_service_client_ = datalake_service_client;
return Status::OK();
}

Result<bool> HierarchicalNamespaceDetector::Enabled(const std::string& container_name) {
// Hierarchical namespace can't easily be changed after the storage account is created
// and its common across all containers in the storage account. Do nothing until we've
// checked for a cached result.
if (enabled_.has_value()) {
return enabled_.value();
}

// This approach is inspired by hadoop-azure
// https://github.com/apache/hadoop/blob/7c6af6a5f626d18d68b656d085cc23e4c1f7a1ef/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystemStore.java#L356.
// Unfortunately `blob_service_client->GetAccountInfo()` requires significantly
// elevated permissions.
// https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-service-properties?tabs=azure-ad#authorization
auto filesystem_client = datalake_service_client_->GetFileSystemClient(container_name);
auto directory_client = filesystem_client.GetDirectoryClient("/");
try {
directory_client.GetAccessControlList();
enabled_ = true;
} catch (const Azure::Storage::StorageException& exception) {
// GetAccessControlList will fail on storage accounts without hierarchical
// namespace enabled.

if (exception.StatusCode == Azure::Core::Http::HttpStatusCode::BadRequest ||
exception.StatusCode == Azure::Core::Http::HttpStatusCode::Conflict) {
// Flat namespace storage accounts with soft delete enabled return
// Conflict - This endpoint does not support BlobStorageEvents or SoftDelete
// otherwise it returns: BadRequest - This operation is only supported on a
// hierarchical namespace account.
enabled_ = false;
} else if (exception.StatusCode == Azure::Core::Http::HttpStatusCode::NotFound) {
// Azurite returns NotFound.
try {
filesystem_client.GetProperties();
enabled_ = false;
} catch (const Azure::Storage::StorageException& exception) {
return ExceptionToStatus("Failed to confirm '" + filesystem_client.GetUrl() +
"' is an accessible container. Therefore the "
"hierarchical namespace check was invalid.",
exception);
}
} else {
return ExceptionToStatus(
"GetAccessControlList for '" + directory_client.GetUrl() +
"' failed with an unexpected Azure error, while checking "
"whether the storage account has hierarchical namespace enabled.",
exception);
}
}
return enabled_.value();
}

} // namespace arrow::fs::internal
42 changes: 42 additions & 0 deletions cpp/src/arrow/filesystem/azurefs_internal.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#pragma once

#include <optional>

#include <azure/storage/files/datalake.hpp>

#include "arrow/result.h"

namespace arrow::fs::internal {

Status ExceptionToStatus(const std::string& prefix,
const Azure::Storage::StorageException& exception);

class HierarchicalNamespaceDetector {
public:
Status Init(
Azure::Storage::Files::DataLake::DataLakeServiceClient* datalake_service_client);
Result<bool> Enabled(const std::string& container_name);

private:
Azure::Storage::Files::DataLake::DataLakeServiceClient* datalake_service_client_;
std::optional<bool> enabled_;
};

} // namespace arrow::fs::internal
Loading
Loading