Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38699: [C++][FS][Azure] Implement CreateDir() #38708

Merged
merged 7 commits into from
Nov 18, 2023

Conversation

kou
Copy link
Member

@kou kou commented Nov 14, 2023

Rationale for this change

It seems that we can't create a directory explicitly without hierarchical namespace support.

It seems that Azure Blob Storage supports only virtual directory. There is no directory. If a file (blob) name has "/", it's treated that the file (blob) exists under a virtual directory.

It seems that Azure Data Lake Storage Gen2 supports a real directory.

See also:
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction

What changes are included in this PR?

This change chooses the following behavior:

  • Container can be created with/without hierarchical namespace support.
  • Directory can be created with hierarchical namespace support.
  • Directory can't be created without hierarchical namespace support. So do nothing without hierachical namespace support. (arrow::Status::OK() is just returned.)

Are these changes tested?

Azurite doesn't support hierarchical namespace yet. So I can't test the implementation for hierarchical namespace yet. Sorry.

Are there any user-facing changes?

Yes.

It seems that we can't create a directory explicitly without
hierarchical namespace support.

It seems that Azure Blob Storage supports only virtual directory.
There is no directory. If a file (blob) name has "/", it's treated
that the file (blob) exists under a virtual directory.

It seems that Azure Data Lake Storage Gen2 supports a real directory.

See also:
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction

This change chooses the following behavior:

* Container can be created with/without hierarchical namespace support.
* Directory can be created with hierarchical namespace support.
* Directory can't be created without hierarchical namespace support.
  (`arrow::Status::NotImplemented` is returned for this case.)

Azurite doesn't support hierarchical namespace yet. So I can't test
the implementation for hierarchical namespace yet. Sorry.
@kou
Copy link
Member Author

kou commented Nov 14, 2023

@Tom-Newton What do you think about this behavior?

This change chooses the following behavior:

  • Container can be created with/without hierarchical namespace support.
  • Directory can be created with hierarchical namespace support.
  • Directory can't be created without hierarchical namespace support. (arrow::Status::NotImplemented is returned for this case.)

Do you have any (simple) document how to setup an account for AzureHierarchicalNamespaceFileSystemTest?

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Nov 14, 2023

What do you think about this behavior?

The only part I'm questioning is whether to return arrow::Status::NotImplemented on flat blob storage accounts. Possibly it would be better just to return a success status without doing anything.

I think the right choice will depend on how much CreateDir() is used. For example if CreateDir() is used every time arrow writes a partitioned parquet table, then returning an error status could be a bit of a problem.

@Tom-Newton
Copy link
Contributor

Do you have any (simple) document how to setup an account for AzureHierarchicalNamespaceFileSystemTest?

You will need an azure account. You should be able to create a free account at https://azure.microsoft.com/en-gb/free/. You should the. Be able to create a storage account through the portal web UI.

I can probably write a more specific doc if needed but this is Azure's doc https://learn.microsoft.com/en-us/azure/storage/blobs/create-data-lake-storage-account

A few suggestions on configuration:

  • Use Standard general-purpose v2 not premium
  • Use LRS redundancy
  • Obviously you will want to enable hierarchical namespace.
  • Set the default access tier to hot
  • SFTP, NFS and file shares are not required.

Comment on lines 629 to 630
std::string_view body_text(reinterpret_cast<const char*>(body.data()),
body.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to put this in a string? We are sure this is plain text?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope so. I expect that the body includes a detailed error message and it's helpful for debug.
But it's not an Azure spec.
Should we avoid this for safety?
Or can we keep this as-is with a concern comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to keep as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll add a comment.

"Cannot create a directory without hierarchical namespace: ", path.full_path);
}
auto directory_client = datalake_service_client_->GetFileSystemClient(path.container)
.GetDirectoryClient(path.path_to_file);
Copy link
Contributor

@felipecrv felipecrv Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why path_to_file is actually the name of the directory in this context, but maybe a different name for this struct field would make things less confusing? This segment of filesystem paths is usually called "basename" [1].

[1] https://en.wikipedia.org/wiki/Basename

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent quite some time trying to figure out what if (path.path_to_file.empty()) { meant here.

path.basename.empty() would be more clear IMO.

cc @Tom-Newton

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that "basename" isn't suitable for this case.
I think that "d" is the basename of "a/b/c/d" but path_to_file is "b/c/d". ("a" is container.)

I think that path is suitable for "b/c/d" but AzurePath::path is strange... How about renaming AzurePath to AzureLocation and using container for a and path for b/c/d?

FYI:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I misinterpreted the meaning of path_to_file. I think path would be OK. And Azure{Path->Location} is also a good rename.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for @kou's the suggested re-name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened a new issue for this: #38758
I'll do it after we merge this.

hierarchical_namespace_.Enabled(path.container));
if (!hierarchical_namespace_enabled) {
return Status::NotImplemented(
"Cannot create a directory without hierarchical namespace: ", path.full_path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write "...without hierarchical namespace support" so the user is less likely to interpret this as some missing part of the path passed as argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Wording suggestion is always welcome for me because I'm not a native English speaker!
I'll add "support".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Wording suggestion is always welcome for me because I'm not a native English speaker!

No worries. This could happen to native English speakers as well. :)

@kou
Copy link
Member Author

kou commented Nov 16, 2023

The only part I'm questioning is whether to return arrow::Status::NotImplemented on flat blob storage accounts. Possibly it would be better just to return a success status without doing anything.

I think the right choice will depend on how much CreateDir() is used. For example if CreateDir() is used every time arrow writes a partitioned parquet table, then returning an error status could be a bit of a problem.

Good point! I'll change the behavior to just return arrow::Status::OK().

@kou
Copy link
Member Author

kou commented Nov 16, 2023

Do you have any (simple) document how to setup an account for AzureHierarchicalNamespaceFileSystemTest?

You will need an azure account. You should be able to create a free account at https://azure.microsoft.com/en-gb/free/. You should the. Be able to create a storage account through the portal web UI.

I can probably write a more specific doc if needed but this is Azure's doc https://learn.microsoft.com/en-us/azure/storage/blobs/create-data-lake-storage-account

A few suggestions on configuration:

...

Thanks! I could create an account and confirm that the implementation passes the added tests.

I'll add the provided information as a comment for other developers.

@github-actions github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 16, 2023
@github-actions github-actions bot added awaiting review Awaiting review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Nov 16, 2023
@kou
Copy link
Member Author

kou commented Nov 16, 2023

I've updated.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 16, 2023
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
"Cannot create a directory without hierarchical namespace: ", path.full_path);
}
auto directory_client = datalake_service_client_->GetFileSystemClient(path.container)
.GetDirectoryClient(path.path_to_file);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for @kou's the suggested re-name.

}

std::string current_path;
for (const auto& part : path.path_to_file_parts) {
Copy link
Contributor

@Tom-Newton Tom-Newton Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a quick test. It seems like directory_client.CreateIfNotExists() actually creates the directories recursively by default, so probably we don't need this for loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! Really!? I'll also check it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that we don't need this loop too.
I'll simplify this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the AzureHierarchicalNamespaceFileSystemTest, CreateDirRecursiveSuccessContainerOnly test is actually broken on main. It seems like directory_client.CreateIfNotExists() fails if the directory_client is just referencing the root of the container. Probably we need some logic to skip the call to directory_client.CreateIfNotExists() if location.path.empty()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry.
I'll fix it: #38782

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR: #38783

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 17, 2023
Co-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Nov 17, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 17, 2023
@kou
Copy link
Member Author

kou commented Nov 17, 2023

I've updated.

// There is only virtual directory without hierarchical namespace
// support. So the CreateDir() does nothing.
arrow::fs::AssertFileInfo(fs_.get(), path, FileType::NotFound);
arrow::fs::AssertFileInfo(fs_.get(), parent, FileType::NotFound);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can disambiguate "file exists?" queries in the filesystem API, we should probably always reply true when the caller is asking if a directory exists. If creating a directory is a no-op that succeeds, the post-condition of CreateDir -- the directory now exists -- should be true.

There are might be bad consequences of this, so this is more of an idea than a suggestion.

Copy link
Contributor

@Tom-Newton Tom-Newton Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an interring point. I don't feel strongly but personally I think the current behaviour is the best option.

GetFileInfo will return that a directory is present if at least one "file" has been created in that "directory". I think this behaviour is consistent with the GCS filesystem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the opinion of Tom-Newton but let's discuss this further on #38772.

@kou
Copy link
Member Author

kou commented Nov 18, 2023

I'll merge this.

@kou kou merged commit a394a39 into apache:main Nov 18, 2023
36 of 37 checks passed
@github-actions github-actions bot removed the awaiting change review Awaiting change review label Nov 18, 2023
@kou kou deleted the fs-azure-create-dir branch November 18, 2023 02:09
@github-actions github-actions bot added the awaiting changes Awaiting changes label Nov 18, 2023
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit a394a39.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 18 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
### Rationale for this change

It seems that we can't create a directory explicitly without hierarchical namespace support.

It seems that Azure Blob Storage supports only virtual directory. There is no directory. If a file (blob) name has "/", it's treated that the file (blob) exists under a virtual directory.

It seems that Azure Data Lake Storage Gen2 supports a real directory.

See also:
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction

### What changes are included in this PR?

This change chooses the following behavior:

* Container can be created with/without hierarchical namespace support.
* Directory can be created with hierarchical namespace support.
* Directory can't be created without hierarchical namespace support. So do nothing without hierachical namespace support. (`arrow::Status::OK()` is just returned.)

### Are these changes tested?

Azurite doesn't support hierarchical namespace yet. So I can't test the implementation for hierarchical namespace yet. Sorry.

### Are there any user-facing changes?

Yes.
* Closes: apache#38699

Lead-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][FS][Azure] Implement CreateDir()
3 participants