-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage - Truncated upload due to length-calculation prior to utf-8 encoding #469
Comments
Ahh, awesome find! I would be more than happy to accept a PR which solves this. |
I'm having this issue as well, what would a fix to this look like? |
the fix, if anyone is wondering, is to |
@AdeelK93 do you happen to have sample code for how you've done this? I would love to get this patched. |
sure, this is basically the fix in my code. for a utf8 instance of async with aiohttp.ClientSession() as session:
client = Storage(session=session)
await client.upload(bucket, blob, my_string.encode()) |
or in other words, dont ever upload a string, only upload bytes |
Huh, ok, then I'm a bit confused: we do indeed calculate the content length before encoding... but for multipart uploads we actually recompute that length after we do the bytes encode. Does anyone have a sample file which I could test with to understand this behaviour? I've just tried a couple |
try an emoji for the simplest example, or for something more robust, try this test file. for the second file, you'll notice that the original file is 27kb, but the file that gets uploaded to GCS is 22kb. you're not going to get an error. you're going to get an incomplete upload. then try again with |
Encoding utf-8 can increase the calculated length of an object. The length is calculated prior to encoding. If special characters are in the string, the length will be incorrect and the result is a truncated upload.
in upload(), this:
happens before this in _upload_multipart():
The text was updated successfully, but these errors were encountered: