Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][FS][Azure] Consider how and if to fix disallowed characters in file metadata #40057

Closed
Tom-Newton opened this issue Feb 12, 2024 · 6 comments

Comments

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Feb 12, 2024

Describe the enhancement requested

Child of #18014

Manually modifying #40021 to run the Python tests against a real blob storage account caused a failure when attempting to write the metadata {'Content-Type': 'x-pyarrow/test'}. Looking at the Generic C++ tests it looks like we will run into the same issue there

auto metadata = KeyValueMetadata::Make({"Content-Type", "Content-Language"},
{"x-arrow/filesystem-test", "fr_FR"});
.

E   OSError: CommitBlockList failed for 'https://tomtestflat.blob.core.windows.net/63217166-c9f8-11ee-989a-71cec6336ac8/open-output-stream-metadata'. Committing is required to flush an output/append stream. Azure Error: [InvalidMetadata] 400 The metadata specified is invalid. It has characters that are not permitted.
E   The metadata specified is invalid. It has characters that are not permitted.
E   RequestId:ac61e6da-e01e-0009-2305-5e07de000000
E   Time:2024-02-12T22:45:15.5130135Z
E   Request ID: ac61e6da-e01e-0009-2305-5e07de000000

It turns out real Azure storage doesn't allow -s in the metadata keys. This behaviour is the same on flat and hierarchical namespace storage accounts but azurite accepts it without error.

Apparently the keys (names) of metadata must conform to https://learn.microsoft.com/en-us/rest/api/storageservices/naming-and-referencing-containers--blobs--and-metadata#metadata-names the naming rules for C# identifiers.

We need to decide what we want to do here. I think either we have to accept these limitations or we will need to encode the metadata keys before writing them to azure then decode when reading back. A quick Google search came up with these options for potential encodings https://stackoverflow.com/questions/32037525/encode-to-alphanumeric-in-javascript#:~:text=To%20encode%20to%20an%20alphanumeric,in%20a%20shorter%20encoded%20string.

The downside of encoding the metadata would be that other Azure clients won't know to decode.

Component(s)

C++

@Tom-Newton Tom-Newton changed the title [C++][FS][Azure] Consider supporting more characters in file metadata [C++][FS][Azure] Consider how and if to fix disallowed characters in file metadata Feb 12, 2024
@Tom-Newton
Copy link
Contributor Author

I think probably this blocks #39069

@felipecrv
Copy link
Contributor

We should probably NOT USE an encoding that would make the keys look totally opaque (like base64).

A custom escaper that replaces - with _ and back would work well if we forbid _ in the metadata keys coming from the Arrow layer as every _ stored in Azure Blobs would become - when surfaced to the Arrow users.

@Tom-Newton
Copy link
Contributor Author

I like the idea of disalowing _. That makes handling - much simpler and its more interpretable.

@pitrou
Copy link
Member

pitrou commented Feb 14, 2024

So, the question is: which convention do other Azure clients use? What is the recommended or official way to set a file's content-type?

Edit: it looks like using the corresponding HTTP headers is recommended when possible. This probably requires a conversion layer of some kind, as done for S3.

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Mar 20, 2024

What is the recommended or official way to set a file's content-type?

This is a good point. If we only care about setting specific things like content-type then that simplifies the problem. Azure actually separates some values including content type from the user set metadata. I think #40671 (review) should resolve this.

@kou
Copy link
Member

kou commented Mar 22, 2024

Resolved by #40671.

@kou kou closed this as completed Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants