-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38699: [C++][FS][Azure] Implement CreateDir()
#38708
Conversation
It seems that we can't create a directory explicitly without hierarchical namespace support. It seems that Azure Blob Storage supports only virtual directory. There is no directory. If a file (blob) name has "/", it's treated that the file (blob) exists under a virtual directory. It seems that Azure Data Lake Storage Gen2 supports a real directory. See also: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction This change chooses the following behavior: * Container can be created with/without hierarchical namespace support. * Directory can be created with hierarchical namespace support. * Directory can't be created without hierarchical namespace support. (`arrow::Status::NotImplemented` is returned for this case.) Azurite doesn't support hierarchical namespace yet. So I can't test the implementation for hierarchical namespace yet. Sorry.
@Tom-Newton What do you think about this behavior?
Do you have any (simple) document how to setup an account for |
The only part I'm questioning is whether to return I think the right choice will depend on how much |
You will need an azure account. You should be able to create a free account at https://azure.microsoft.com/en-gb/free/. You should the. Be able to create a storage account through the portal web UI. I can probably write a more specific doc if needed but this is Azure's doc https://learn.microsoft.com/en-us/azure/storage/blobs/create-data-lake-storage-account A few suggestions on configuration:
|
cpp/src/arrow/filesystem/azurefs.cc
Outdated
std::string_view body_text(reinterpret_cast<const char*>(body.data()), | ||
body.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to put this in a string? We are sure this is plain text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope so. I expect that the body includes a detailed error message and it's helpful for debug.
But it's not an Azure spec.
Should we avoid this for safety?
Or can we keep this as-is with a concern comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to keep as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll add a comment.
"Cannot create a directory without hierarchical namespace: ", path.full_path); | ||
} | ||
auto directory_client = datalake_service_client_->GetFileSystemClient(path.container) | ||
.GetDirectoryClient(path.path_to_file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why path_to_file
is actually the name of the directory in this context, but maybe a different name for this struct field would make things less confusing? This segment of filesystem paths is usually called "basename" [1].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent quite some time trying to figure out what if (path.path_to_file.empty()) {
meant here.
path.basename.empty()
would be more clear IMO.
cc @Tom-Newton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that "basename" isn't suitable for this case.
I think that "d" is the basename of "a/b/c/d" but path_to_file
is "b/c/d". ("a" is container
.)
I think that path
is suitable for "b/c/d" but AzurePath::path
is strange... How about renaming AzurePath
to AzureLocation
and using container
for a
and path
for b/c/d
?
FYI:
- Azure Blob Storage calls "a" as "container name" and "b/c/d" as "blob name" in https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction .
- Azure Data Lake Storage Gen2 calls "a/b/c" as "path" and "d" as "file_name" in https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri .
- We don't support
abfs://...
URI yet. (FileSystemFromUri()
doesn't acceptabfs://...
yet.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I misinterpreted the meaning of path_to_file
. I think path
would be OK. And Azure{Path->Location}
is also a good rename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for @kou's the suggested re-name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened a new issue for this: #38758
I'll do it after we merge this.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
hierarchical_namespace_.Enabled(path.container)); | ||
if (!hierarchical_namespace_enabled) { | ||
return Status::NotImplemented( | ||
"Cannot create a directory without hierarchical namespace: ", path.full_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would write "...without hierarchical namespace support" so the user is less likely to interpret this as some missing part of the path passed as argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Wording suggestion is always welcome for me because I'm not a native English speaker!
I'll add "support".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Wording suggestion is always welcome for me because I'm not a native English speaker!
No worries. This could happen to native English speakers as well. :)
Good point! I'll change the behavior to just return |
Thanks! I could create an account and confirm that the implementation passes the added tests. I'll add the provided information as a comment for other developers. |
I've updated. |
"Cannot create a directory without hierarchical namespace: ", path.full_path); | ||
} | ||
auto directory_client = datalake_service_client_->GetFileSystemClient(path.container) | ||
.GetDirectoryClient(path.path_to_file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for @kou's the suggested re-name.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
} | ||
|
||
std::string current_path; | ||
for (const auto& part : path.path_to_file_parts) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did a quick test. It seems like directory_client.CreateIfNotExists()
actually creates the directories recursively by default, so probably we don't need this for loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! Really!? I'll also check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed that we don't need this loop too.
I'll simplify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the AzureHierarchicalNamespaceFileSystemTest, CreateDirRecursiveSuccessContainerOnly
test is actually broken on main
. It seems like directory_client.CreateIfNotExists()
fails if the directory_client
is just referencing the root of the container. Probably we need some logic to skip the call to directory_client.CreateIfNotExists()
if location.path.empty()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry.
I'll fix it: #38782
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR: #38783
Co-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
I've updated. |
// There is only virtual directory without hierarchical namespace | ||
// support. So the CreateDir() does nothing. | ||
arrow::fs::AssertFileInfo(fs_.get(), path, FileType::NotFound); | ||
arrow::fs::AssertFileInfo(fs_.get(), parent, FileType::NotFound); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can disambiguate "file exists?" queries in the filesystem API, we should probably always reply true when the caller is asking if a directory exists. If creating a directory is a no-op that succeeds, the post-condition of CreateDir -- the directory now exists -- should be true.
There are might be bad consequences of this, so this is more of an idea than a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an interring point. I don't feel strongly but personally I think the current behaviour is the best option.
GetFileInfo
will return that a directory is present if at least one "file" has been created in that "directory". I think this behaviour is consistent with the GCS filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the opinion of Tom-Newton but let's discuss this further on #38772.
I'll merge this. |
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit a394a39. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 18 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change It seems that we can't create a directory explicitly without hierarchical namespace support. It seems that Azure Blob Storage supports only virtual directory. There is no directory. If a file (blob) name has "/", it's treated that the file (blob) exists under a virtual directory. It seems that Azure Data Lake Storage Gen2 supports a real directory. See also: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction ### What changes are included in this PR? This change chooses the following behavior: * Container can be created with/without hierarchical namespace support. * Directory can be created with hierarchical namespace support. * Directory can't be created without hierarchical namespace support. So do nothing without hierachical namespace support. (`arrow::Status::OK()` is just returned.) ### Are these changes tested? Azurite doesn't support hierarchical namespace yet. So I can't test the implementation for hierarchical namespace yet. Sorry. ### Are there any user-facing changes? Yes. * Closes: apache#38699 Lead-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Thomas Newton <thomas.w.newton@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Rationale for this change
It seems that we can't create a directory explicitly without hierarchical namespace support.
It seems that Azure Blob Storage supports only virtual directory. There is no directory. If a file (blob) name has "/", it's treated that the file (blob) exists under a virtual directory.
It seems that Azure Data Lake Storage Gen2 supports a real directory.
See also:
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction
What changes are included in this PR?
This change chooses the following behavior:
arrow::Status::OK()
is just returned.)Are these changes tested?
Azurite doesn't support hierarchical namespace yet. So I can't test the implementation for hierarchical namespace yet. Sorry.
Are there any user-facing changes?
Yes.
CreateDir()
#38699