Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add header enabling gzip downloads #3380

Merged
merged 6 commits into from
Oct 13, 2017
Merged

Conversation

tartavull
Copy link

Not really intended to be merged.
I'm just wondering if there is any other way of downloading gzip files that has the correct content-encoding.

@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label May 6, 2017
@dhermes
Copy link
Contributor

dhermes commented May 6, 2017

@tartavull Can you give a code snippet that doesn't work the way you expect (and explain what you do expect)?

@tartavull
Copy link
Author

Upload a compress file

 def _compress(content):
        stringio = StringIO()
        gzip_obj = gzip.GzipFile(mode='wb', fileobj=stringio)
        gzip_obj.write(content)
        gzip_obj.close()
        return stringio.getvalue()
content = 'highly compressible string'
blob.upload_from_string(compress(content))
blob.content_encoding = "gzip"
blob.patch()

Downloading file

assert content == blob.download_as_string()

I expect that when the http request is made it has the header accept-encoding:gzip
otherwise the server will decompress the blob and send something much larger.

@lukesneeringer
Copy link
Contributor

Sending Accept-Encoding: gzip does not seem like a hack to me. That is the entire point of the Accept-Encoding header. I would be willing to accept this, so long as it does not have problematic side effects (e.g. causing trouble for other compression formats or uncompressed files).

@tartavull
Copy link
Author

I haven't seen problems when retrieving uncompressed files, but I haven't test other compression formats.

@lukesneeringer
Copy link
Contributor

@dhermes Any concerns about this PR? I have no problem adding an Accept-Encoding header.

@dhermes
Copy link
Contributor

dhermes commented May 9, 2017

I'd like to see what @thobrla or someone else from the Storage team says. I'm just not sure if it makes sense.

@thobrla
Copy link

thobrla commented May 9, 2017

It's reasonable (even preferable) to include this header, but it is a substantial semantic change. How does the library handle gzipped bytes in the response? Is it up to the caller to decompress them?

@tartavull
Copy link
Author

https://github.com/GoogleCloudPlatform/google-auth-library-python-httplib2 takes care of decompression, no action required by the caller

@dhermes
Copy link
Contributor

dhermes commented May 9, 2017

As does urllib3 / requests

@thobrla
Copy link

thobrla commented May 9, 2017

Seems fine, then. The caller has always gotten uncompressed bytes and they'll continue to get uncompressed bytes.

Out of curiosity, how were mid-download connection breaks handled for content-encoding:gzip objects previously?

@tartavull
Copy link
Author

tartavull commented May 10, 2017

I think you will now get an IOError: CRC check failed exception
from https://github.com/python-git/python/blob/master/Lib/gzip.py#L316

@lukesneeringer
Copy link
Contributor

@tartavull Can you update the unit tests that fail as a result of your change? Once that is done, we can accept this.

@tseaver tseaver added api: storage Issues related to the Cloud Storage API. in progress type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels May 15, 2017
@tartavull tartavull force-pushed the master branch 3 times, most recently from 8d02bcc to bc031ec Compare May 17, 2017 17:14
@tartavull
Copy link
Author

Do you happen to know how

Warning, treated as error:
/var/code/gcp/docs/logging-usage.rst.rst:25:Over dedent has detected
nox > Command bash ./test_utils/scripts/update_docs.sh failed with exit code 1
nox > Session docs failed. :(
Exited with code 1

is related to the commit changes?

@dhermes
Copy link
Contributor

dhermes commented May 17, 2017

@tartavull Sorry for being quiet here for way too long.

  • No the update_docs.sh failure is not your fault, it's related to the 1.6.1 release of Sphinx. Just rebase your branch on master and everything will be OK
  • I'd really love to discussion discuss some snippets of code with the current implementation that doesn't work as you expect and then contrast that with how the same code runs with the change in this PR

GET requests now contain the header accept-encoding:gzip
This improves performance of compressible strings which
were uploaded as gzips.

The caller is not required to do any decompression
because decompressiong is handle by the library.

Confusing `IOError: CRC check failed` exceptions will be
risen in the case of mid-download connection breaks.
@tartavull
Copy link
Author

@dhermes looking forward to discuss them.

@dhermes
Copy link
Contributor

dhermes commented May 17, 2017

@tartavull Do you have some examples? Or did you mean something else?

@tartavull
Copy link
Author

tartavull commented May 17, 2017

I missed understand you, I don't have any snippets that this PR fixes, it is just a huge performance improvement, that's why I tagged the commit with "perf".
As shown in the snippet above, there are no changes in the output to the caller.

neuroglancer
To provide you with a real use case, we store large 3d stacks of images in small 3d chunks.
These chunks are sized 512x512x64, and are of type uint64. Each uncompressed chunk is 128mb, but as you can see in the image they are mostly the same value so they compress to around 500kb.

@dhermes
Copy link
Contributor

dhermes commented May 17, 2017

Ah I see. Let me play around a bit with this to try to "break things" / investigate the raw payloads.

In the meanwhile, you can check out the underlying library used for uploads (docs and source). You can pass in custom headers to an upload so this perf optimization would be usable immediately for you.

@tartavull
Copy link
Author

Any luck breaking things?

@tartavull
Copy link
Author

@dhermes Is there anything I can do to get this merged?

@dhermes
Copy link
Contributor

dhermes commented Jun 27, 2017

Sorry @tartavull it fell off my plate of things to do! Really bad of me, eek.

I want to test this PR on real use cases before merging. In particular I'd like to test on two files:

  • File A: Plain text
  • File B: File A, but gzip-ed locally

and maybe some other cases I haven't thought of?

I just want to make sure this "does the right thing".

@tartavull
Copy link
Author

@dhermes I understand. Let me know if you need any help from me.

@lukesneeringer
Copy link
Contributor

Poke.

@tartavull
Copy link
Author

This PR modifies a single line of code, and it reduces the traffic to google cloud storage for all your users downloading compressed files. For our particular case that's 250x reduction in bytes downloaded.
But after 90 days of created it hasn't being merged.

@lukesneeringer
Copy link
Contributor

@dhermes At this point I want to just merge this. Giving you a chance to throw up a flag, but this should really get in.

@dhermes
Copy link
Contributor

dhermes commented Aug 7, 2017

@lukesneeringer If you feel it's the right move, go for it.

I worry that it is partially incorrect, e.g. of the thing stored is gzipped then it will accidentally do the wrong thing.

@lukesneeringer
Copy link
Contributor

I am not particularly worried. This is a really common pattern.

Luke Sneeringer and others added 3 commits August 8, 2017 10:46
@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored by someone other than the pull request submitter. We need to confirm that they're okay with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this State. It's up to you to confirm consent of the commit author(s) and merge this pull request when appropriate.

@googlebot googlebot added cla: no This human has *not* signed the Contributor License Agreement. and removed cla: yes This human has signed the Contributor License Agreement. labels Oct 13, 2017
@tseaver tseaver self-requested a review October 13, 2017 14:13
@tseaver tseaver changed the title A hack for enabling gzip downloads Add header enabling gzip downloads Oct 13, 2017
@tseaver tseaver merged commit 8762c34 into googleapis:master Oct 13, 2017
@tartavull
Copy link
Author

Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. cla: no This human has *not* signed the Contributor License Agreement. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants