storage - Truncated upload due to length-calculation prior to utf-8 encoding #469

jack1323 · 2022-05-18T00:15:31Z

Encoding utf-8 can increase the calculated length of an object. The length is calculated prior to encoding. If special characters are in the string, the length will be incorrect and the result is a truncated upload.

in upload(), this:

        stream = self._preprocess_data(file_data)
        content_length = self._get_stream_len(stream)

happens before this in _upload_multipart():

       raw_body: AnyStr = stream.read()
       if isinstance(raw_body, str):
            bytes_body: bytes = raw_body.encode('utf-8')

The text was updated successfully, but these errors were encountered:

TheKevJames · 2022-05-25T21:26:10Z

Ahh, awesome find! I would be more than happy to accept a PR which solves this.

AdeelK93 · 2022-10-17T03:23:48Z

I'm having this issue as well, what would a fix to this look like?

AdeelK93 · 2022-10-27T21:12:23Z

the fix, if anyone is wondering, is to encode as bytes

TheKevJames · 2022-12-22T23:05:10Z

@AdeelK93 do you happen to have sample code for how you've done this? I would love to get this patched.

AdeelK93 · 2022-12-23T00:14:08Z

sure, this is basically the fix in my code. for a utf8 instance of my_string, i do the following

async with aiohttp.ClientSession() as session:
    client = Storage(session=session)
    await client.upload(bucket, blob, my_string.encode())

AdeelK93 · 2022-12-23T00:15:09Z

or in other words, dont ever upload a string, only upload bytes

TheKevJames · 2022-12-23T18:12:10Z

Huh, ok, then I'm a bit confused: we do indeed calculate the content length before encoding... but for multipart uploads we actually recompute that length after we do the bytes encode.

Does anyone have a sample file which I could test with to understand this behaviour? I've just tried a couple str uploads which special characters and all seems to be working just fine.

AdeelK93 · 2022-12-23T18:25:27Z

try an emoji for the simplest example, or for something more robust, try this test file.

for the second file, you'll notice that the original file is 27kb, but the file that gets uploaded to GCS is 22kb.

you're not going to get an error. you're going to get an incomplete upload.

then try again with .encode() - the full 27kb file will be uploaded.

TheKevJames added bug help wanted labels May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage - Truncated upload due to length-calculation prior to utf-8 encoding #469

storage - Truncated upload due to length-calculation prior to utf-8 encoding #469

jack1323 commented May 18, 2022

TheKevJames commented May 25, 2022

AdeelK93 commented Oct 17, 2022

AdeelK93 commented Oct 27, 2022

TheKevJames commented Dec 22, 2022

AdeelK93 commented Dec 23, 2022

AdeelK93 commented Dec 23, 2022

TheKevJames commented Dec 23, 2022

AdeelK93 commented Dec 23, 2022

storage - Truncated upload due to length-calculation prior to utf-8 encoding #469

storage - Truncated upload due to length-calculation prior to utf-8 encoding #469

Comments

jack1323 commented May 18, 2022

TheKevJames commented May 25, 2022

AdeelK93 commented Oct 17, 2022

AdeelK93 commented Oct 27, 2022

TheKevJames commented Dec 22, 2022

AdeelK93 commented Dec 23, 2022

AdeelK93 commented Dec 23, 2022

TheKevJames commented Dec 23, 2022

AdeelK93 commented Dec 23, 2022