Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 metadata values no longer support Unicode #478

Closed
kengruven opened this issue Feb 5, 2016 · 5 comments
Closed

S3 metadata values no longer support Unicode #478

kengruven opened this issue Feb 5, 2016 · 5 comments
Labels
bug This issue is a confirmed bug.

Comments

@kengruven
Copy link

In the S3 documentation, I don't see anywhere that defines exactly what user-defined metadata are, but it says:

User-defined metadata is a set of key-value pairs. The size of user-defined metadata is measured by taking the sum of the number of bytes in the UTF-8 encoding of each key and value.

Only text has a UTF-8 encoding, and thus, I conclude that values are Unicode strings. That matches how I've been using them with Boto2 so far.

In Boto2, this was supported. I could do this:

s3 = boto.connect_s3(...)
bucket = s3.get_bucket('my_bucket')
key = bucket.new_key('my_key')
key.metadata = {'foo': u'\U0001f4c8'}
key.set_contents_from_file(f)

and in the S3 Management Console, it appears as key "x-amz-meta-foo", value "%F0%9F%93%88" (the URI encoding of U+1F4C8). It's a little funny that the S3 console is re-encoding this in a different way, but the S3 console is pretty bare-bones, and the re-encoding confirms that everything upstream recognizes that it's a Unicode string.

In Boto3, this doesn't work. When I try to do:

session = boto3.session.Session(...)
s3 = session.resource('s3')
obj = s3.Object('my_bucket', 'my_key')
obj.put(Body=f, Metadata={'foo': u'\U0001f4c8'})

I get the remarkably unhelpful error message:

ClientError: An error occurred (SignatureDoesNotMatch) when calling the PutObject operation: The request signature we calculated does not match the signature you provided. Check your key and signing method.

I looked around for anything that might suggest this changed from Boto2 to Boto3. The Boto3 documentation for put_object says it's of type:

Metadata={
    'string': 'string'
},

and makes no mention of Unicode/ASCII limitations. (Elsewhere in the same function call, b'bytes' is referred to, but Metadata isn't a bytes.)

I tried calling .encode('utf-8') on my string before passing it to Metadata=, but this doesn't work, either. I get an exception that ends with:

  ...
  File "lib/python2.7/site-packages/botocore/signers.py", line 119, in sign
    signer.add_auth(request=request)
  File "lib/python2.7/site-packages/botocore/auth.py", line 627, in add_auth
    auth_path=request.auth_path)
  File "lib/python2.7/site-packages/botocore/auth.py", line 615, in get_signature
    auth_path=auth_path)
  File "lib/python2.7/site-packages/botocore/auth.py", line 601, in canonical_string
    custom_headers = self.canonical_custom_headers(headers)
  File "lib/python2.7/site-packages/botocore/auth.py", line 560, in canonical_custom_headers
    hoi.append("%s:%s" % (key, custom_headers[key]))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 4: ordinal not in range(128)

As far as I can tell, this is a bug in Boto3. With Boto2, I was able to pass metadata {'foo': u'\U0001f4c8'} when putting an object in S3, and with Boto3, I'm not.

@kengruven
Copy link
Author

Interestingly, Boto2 allows uploading files with Unicode metadata, but there's a bug that breaks generate_url with such objects: boto/boto#2556

Boto3 fixes the download half, but breaks the upload half.

That is, I can upload an object with Unicode metadata with Boto2 but not Boto3. Once it's uploaded to S3, I can generate_url for it with Boto3 but not Boto2.

@jamesls
Copy link
Member

jamesls commented Feb 5, 2016

S3 metadata has to be ASCII. From http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html:

Amazon S3 stores user-defined metadata in lowercase. Each name, value pair must conform to US-ASCII when using REST and UTF-8 when using SOAP or browser-based uploads via POST.

If you try to send non-ascii data, it'll get "double encoded" which is what you're seeing in the console. This will also break the signature version 4 signer, which is required for new regions such as eu-central-1.

I think our best option here is to error/warn when we see user-defined metadata that's given that is contains non-ascii characters.

@jamesls jamesls added the bug This issue is a confirmed bug. label Feb 5, 2016
@kengruven
Copy link
Author

That just raises more questions.

  1. Why do the docs say that it must be in UTF-8, then? (That's not even an encoding of ASCII.)
  2. What does "Amazon S3 stores user-defined metadata in lowercase" mean? Empirically, S3 doesn't do this. I've got keys like "x-amz-meta-name" = "John%20Doe", and I can see them this way in the S3 Management Console, so it's clearly storing the original case. Was this sentence intended to apply to key names only?
  3. Is Boto2 doing URL-encoding on the string before sending them to S3, then? Would that be a safe thing to call myself before passing values to Boto3 Metadata?

I think our best option here is to error/warn when we see user-defined metadata that's given that is contains non-ascii characters.

That would certainly be a lot more helpful than what it does now.

@jamesls
Copy link
Member

jamesls commented Feb 6, 2016

To clarify your first question, the docs say they have to be ASCII when using REST, which is what all the AWS SDKs use now. The UTF-8 part is only possible when using SOAP, which we don't use.

jamesls added a commit to jamesls/botocore that referenced this issue Mar 31, 2016
yuvipanda added a commit to yuvipanda/notebooksharing.space that referenced this issue Dec 1, 2021
- Turns out S3 metadata values can only be ASCII, so using that
  to store the filename was problematic.
  boto/boto3#478
- All metadata values are thus percent encoded (strictly, without
  the + for space replacement) on read and write to S3.
  I'm going to run a manual migration to percent encode all the
  existing notebook filenames to prevent annoying long term
  inconsistencies.
- Content-Disposition can also only apparently be ASCII by
  default without some funky encoding fun. We do the funky
  encoding fun.
- Decided to just use quote instead of quote_plus everywhere.

Fixes #35
@vlovich
Copy link

vlovich commented Dec 6, 2022

@jamesls why does https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html say that you can upload UTF-8 in REST and that Unicode values get RFC-2047 encoded/decoded?

When using non US-ASCII characters in your metadata values, the provided unicode string is examined for non US-ASCII characters. Values of such headers are character decoded as per RFC 2047 before storing and encoded as per RFC 2047 to make them mail-safe before returning. If the string contains only US-ASCII characters, it is presented as is.

The examples provided are using the REST API. Is this a Boto bug where it's forgetting to apply RFC-2047 transparently to custom metadata?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug.
Projects
None yet
Development

No branches or pull requests

3 participants