Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38335: [C++] Implement GetFileInfo for a single file in Azure filesystem #38505

Merged
merged 45 commits into from
Nov 9, 2023

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Oct 29, 2023

Rationale for this change

GetFileInfo is an important part of an Arrow filesystem implementation.

What changes are included in this PR?

  • Start azurefs_internal similar to GCS and S3 filesystems.
  • Implement HierarchicalNamespaceDetector.
    • This does not use the obvious and simple implementation. It uses a more complicated option inspired by hadoop-azure that avoids requiring the significantly elevated permissions needed for blob_service_client->GetAccountInfo().
    • This can't be detected an initialisation time of the filesystem because it requires a container_name. Its packed into its only class so that the result can be cached.
  • Implement GetFileInfo for single paths.
    • Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the GetFileInfoObjectWithNestedStructure test against real flat and hierarchical accounts. Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace.
  • Update tests with TODO([C++] Implement GetFileInfo for a single file in Azure filesystem #38335) to now use this implementation of GetFileInfo to replace the temporary direct Azure SDK usage.
  • Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped.

Are these changes tested?

Yes. There are new Azurite based tests for everything that can be tested with Azurite.

There are also some tests that are designed to test against a real blob storage account. This is because Azurite cannot emulate a hierarchical namespace account. Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts.

Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate.

Are there any user-facing changes?

Yes. GetFileInfo is now supported on the Azure filesystem.

@github-actions
Copy link

⚠️ GitHub issue #38335 has been automatically assigned in GitHub to PR creator.

@Tom-Newton Tom-Newton force-pushed the tomnewton/azure_getfileinfo/GH-38335 branch from 544299f to 4142733 Compare November 1, 2023 22:21
@@ -78,18 +81,17 @@ struct AzurePath {
"Expected an Azure object path of the form 'container/path...', got a URI: '",
s, "'");
}
const auto src = internal::RemoveTrailingSlash(s);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was preventing GetFileInfo working on directories. The other filesystems did not have this.

FileType::NotFound);

AssertFileInfo(fs_.get(), PreexistingContainerPath() + "test-empty-object-dir",
FileType::Directory);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would have liked to add an assertion here which confirms that with the hierarchical namespace there are no calls to ListBlobs. That would require patching an Azure container client, which I didn't know how to do. If anyone was any suggestions that would be appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it by adding internal ListBlobs call counter and exporting it only for testing.
Or we may be able to provide AzureFileSystem::GetStatistics() and the return value provides statistics including the number of ListBlobs calles.

(I think that we don't need test it. If we want to test it, we can open a new issue for it and defer it as a separated task to merge this as soon as possible.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm happy to leave out such an assertion at least initially. If it was python I would have done it seems like mocking in C++ would be more complicated even if I did understand the language 😅

@Tom-Newton Tom-Newton marked this pull request as ready for review November 5, 2023 18:38
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 5, 2023
@Tom-Newton Tom-Newton force-pushed the tomnewton/azure_getfileinfo/GH-38335 branch from 1ab87dc to 28357b0 Compare November 5, 2023 20:33
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Nov 6, 2023
@Tom-Newton
Copy link
Contributor Author

Thanks for reviewing kou. I have addressed most of the comments and I should be able to address the remaining ones this evening.

@Tom-Newton Tom-Newton requested a review from kou November 6, 2023 22:41
@Tom-Newton Tom-Newton force-pushed the tomnewton/azure_getfileinfo/GH-38335 branch from 42e3d31 to 7fe94f1 Compare November 7, 2023 09:04
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
AzureOptions options_;
internal::HierarchicalNamespaceDetector hierarchical_namespace_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that HierarchicalNamespaceDetector is enough simple to move to Impl. (HierarchicalNamespaceDetector::Enabled() is the only important method in the class.)

How about moving HierarchicalNamespaceDetector::Enabled() to Impl::IsHierarchicalNamespaceEnabled() and removing HierarchicalNamespaceDetector (or something)?
If we do it, we can make datalake_service_client_ std::unique_ptr.

Copy link
Contributor Author

@Tom-Newton Tom-Newton Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it separate because I wanted to keep the cached value enabled_ private from the rest of Impl. I was a bit concerned that people might try to directly access the cached state without realising that everything should use the Enabled() function. Additionally making it a separate class made it easier to test.

I think one possibility is to use a non-smart pointer in HierarchicalNamespaceDetector because HierarchicalNamespaceDetector will always be destructed at the same time as Impl. https://stackoverflow.com/questions/7657718/when-to-use-shared-ptr-and-when-to-use-raw-pointers. I think that should allow us to use a unique_ptr for datalake_service_client_. I think this would be my preferred solution. What do you think?

Copy link
Contributor Author

@Tom-Newton Tom-Newton Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to just make my preferred change. If you think its a bad idea I'm happy to change it again to something else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's use the approach.

FileType::NotFound);

AssertFileInfo(fs_.get(), PreexistingContainerPath() + "test-empty-object-dir",
FileType::Directory);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it by adding internal ListBlobs call counter and exporting it only for testing.
Or we may be able to provide AzureFileSystem::GetStatistics() and the return value provides statistics including the number of ListBlobs calles.

(I think that we don't need test it. If we want to test it, we can open a new issue for it and defer it as a separated task to merge this as soon as possible.)

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Nov 8, 2023
AzureOptions options_;
internal::HierarchicalNamespaceDetector hierarchical_namespace_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's use the approach.

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Nov 9, 2023
@kou
Copy link
Member

kou commented Nov 9, 2023

The lint failure was fixed by #38639.
I'll rebase on main before we merge this.

@kou kou force-pushed the tomnewton/azure_getfileinfo/GH-38335 branch from 737d926 to 0659a39 Compare November 9, 2023 02:16
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 9, 2023
@kou
Copy link
Member

kou commented Nov 9, 2023

I'll merge this.

@kou kou merged commit 75a0403 into apache:main Nov 9, 2023
32 of 33 checks passed
@kou kou removed the awaiting change review Awaiting change review label Nov 9, 2023
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 75a0403.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…ure filesystem (apache#38505)

### Rationale for this change

`GetFileInfo` is an important part of an Arrow filesystem implementation. 

### What changes are included in this PR?
- Start `azurefs_internal` similar to GCS and S3 filesystems. 
- Implement `HierarchicalNamespaceDetector`. 
  - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`.
  - This can't be detected an initialisation time of the filesystem because it requires a `container_name`.  Its packed into its only class so that the result can be cached. 
- Implement `GetFileInfo` for single paths. 
  - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts.  Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace.
- Update tests with TODO(apacheGH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage.
- Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. 

### Are these changes tested?

Yes. There are new Azurite based tests for everything that can be tested with Azurite. 

There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. 

Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. 

### Are there any user-facing changes?
Yes. `GetFileInfo` is now supported on the Azure filesystem. 

* Closes: apache#38335

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…ure filesystem (apache#38505)

### Rationale for this change

`GetFileInfo` is an important part of an Arrow filesystem implementation. 

### What changes are included in this PR?
- Start `azurefs_internal` similar to GCS and S3 filesystems. 
- Implement `HierarchicalNamespaceDetector`. 
  - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`.
  - This can't be detected an initialisation time of the filesystem because it requires a `container_name`.  Its packed into its only class so that the result can be cached. 
- Implement `GetFileInfo` for single paths. 
  - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts.  Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace.
- Update tests with TODO(apacheGH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage.
- Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. 

### Are these changes tested?

Yes. There are new Azurite based tests for everything that can be tested with Azurite. 

There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. 

Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. 

### Are there any user-facing changes?
Yes. `GetFileInfo` is now supported on the Azure filesystem. 

* Closes: apache#38335

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Implement GetFileInfo for a single file in Azure filesystem
2 participants