-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem #39009
Conversation
TODO:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'll review GetFileInfoWithSelector()
and GetFileInfoWithSelectorFromContainer()
logic carefully later.)
cpp/src/arrow/filesystem/azurefs.cc
Outdated
static std::string_view BasenameView(std::string_view s) { | ||
auto offset = s.find_last_of(internal::kSep); | ||
auto tail = (offset == std::string_view::npos) ? s : s.substr(offset); | ||
return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving this to path_util.{cc,h}
?
Because extracting basename from a path is a generic path operation.
We will be able to use GetAbstractPathParent()
to implement this like return GetAbstractPathParent(s).second;
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for factoring out generic path operations, and also unit-test them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with path_util.h
functions is that they don't follow any concept of path normalization for inputs and have to check the input for many variations, this BasenameView
works well in this context, but not in a general context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm changing it to this:
static std::string_view BasenameView(std::string_view s) {
DCHECK(!internal::HasTrailingSlash(s));
auto offset = s.find_last_of(internal::kSep);
auto result = (offset == std::string_view::npos) ? s : s.substr(offset);
DCHECK(!result.empty() && result.back() != internal::kSep);
return result;
}
Azure::Core::Context context; | ||
Azure::Nullable<int32_t> page_size_hint; // unspecified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't object them but why do you want to specify them explicitly?
Our other methods don't specify them explicitly. (They use the default value.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want all the wiring to be in place if we need to specify the page_size_hint
which I'm pretty sure we will need to tweak when this is used in practice.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
} | ||
return Status::OK(); | ||
}; | ||
return ListContainers(context, std::move(on_container)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we pass page_size_hint
here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
page_size_hint
only applies to the pagination of blobs. If I were to pass page size here, it would be a separate constant, but I thought this would be an overkill.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
} else if (cmp > 0) { | ||
RETURN_NOT_OK(process_prefix(prefix)); | ||
blob_prefix_index += 1; | ||
} else { // there is a blob (empty dir marker) and a prefix with the same name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really happen?
emptydir/
is a case of this in the test, right?
In this case, emptydir/
may exist in Blobs
(with options.Prefix = emptydir/
) or BlobPrefixes
(with no options.Prefix
) but doesn't exit in both of them. (I checked this with gdb
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the direction the whole implementation will move towards. CreateDir
should create a marker blob, so we can check that an empty directory exists after it's created or after its contents are deleted.
As @pitrou said in another comment, all the Filesystem implementations should behave the same way.
|
||
try { | ||
auto list_response = | ||
container_client.ListBlobsByHierarchy(/*delimiter=*/"/", options, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use ListBlobs()
instead when select.max_recursion == INT32_MAX
as an optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would require that we build a trie of all the blob names to list them hierarchically. I'm open to doing it, but I think we can delay doing it until there is clear indication that it would help a concrete use-case.
I reviewed the tests, so I'm happy with the behaviour. I haven't had a chance to understand and review the implementation yet. |
options.Prefix = internal::EnsureTrailingSlash(base_location.path); | ||
} | ||
options.PageSizeHint = page_size_hint; | ||
options.Include = Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question: was it considered to have a namespace alias at the top of this file such as namespace BlobsModels = Azure::Storage::Blobs::Models;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I prefer to do once all PRs land because it would be a noisy change.
9ddef84
to
bf66d06
Compare
I rebased to fix a conflict, but the new changes are on the last two commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Could you fix a lint failure?
I will rebase and force-push to see if all CI passes now. |
`length` could come from `FileSelector::max_recursion` which is default initialized to `INT32_MAX` and cause an overflow on the `offset+length` sum here.
These functions were added recently so no risk of ABI changes.
e237bb8
to
e248400
Compare
CI failure is unrelated so I'm merging this. |
@felipecrv Did you use our merge script? GH-38597's milestone isn't set. In general, it's set by the merge script. |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 476c78f. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 32 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Oh no. Sorry. I had no idea. I will use it from now on. Anything we can do to make this right for this PR? |
I set GH-37597's milestone manually. :-) |
…ystem (apache#39009) ### Rationale for this change Part of Azure FS implementation. ### What changes are included in this PR? The version of `GetFileInfo` that takes a prefix and can optionally recurse into directories. ### Are these changes tested? By unit tests present in this PR. Separate from this PR, I'm thinking of way to fuzz-test the FS API. * Closes: apache#38597
…ystem (apache#39009) ### Rationale for this change Part of Azure FS implementation. ### What changes are included in this PR? The version of `GetFileInfo` that takes a prefix and can optionally recurse into directories. ### Are these changes tested? By unit tests present in this PR. Separate from this PR, I'm thinking of way to fuzz-test the FS API. * Closes: apache#38597
…ystem (apache#39009) ### Rationale for this change Part of Azure FS implementation. ### What changes are included in this PR? The version of `GetFileInfo` that takes a prefix and can optionally recurse into directories. ### Are these changes tested? By unit tests present in this PR. Separate from this PR, I'm thinking of way to fuzz-test the FS API. * Closes: apache#38597
Rationale for this change
Part of Azure FS implementation.
What changes are included in this PR?
The version of
GetFileInfo
that takes a prefix and can optionally recurse into directories.Are these changes tested?
By unit tests present in this PR. Separate from this PR, I'm thinking of way to fuzz-test the FS API.
GetFileInfo
selector for Azure filesystem #38597