Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massively broken with gzip encoded streams #36

Open
Count-Count opened this issue Dec 15, 2019 · 8 comments
Open

Massively broken with gzip encoded streams #36

Count-Count opened this issue Dec 15, 2019 · 8 comments

Comments

@Count-Count
Copy link
Contributor

Count-Count commented Dec 15, 2019

Due to the changes made for the short read functionality using streams which are gzip encoded is massively broken. Accessing the raw response content bypasses gzip decoding and thus the event stream cannot be read.

This happens for Wikimedia event streams sometimes (gzip encoding is not used in all cases, not sure in which cases it is used).

See https://requests.readthedocs.io/en/latest/user/quickstart/#raw-response-content

@Count-Count
Copy link
Contributor Author

Count-Count commented Dec 15, 2019

@mutantmonkey FYI

@Count-Count Count-Count changed the title Massively broken with gzip encoding Massively broken with gzip encoded streams Dec 15, 2019
@mutantmonkey
Copy link
Contributor

Related bug: #27
I'm not really sure why it was closed as it seems to be the exact same problem mentioned here.

@mutantmonkey
Copy link
Contributor

There are a few different possible approaches to fix this:

  1. Disable short reads when gzip encoding is used, as you've done in Don't use raw reads for gzipped or chunked encoding (fixes #28, #36) #37. The downside of this is that Default chunk size of 1024 is inappropriate (regression in 0.0.16) #8 and 0.0.18 lags #9 will be resurfaced for users who are also using gzip encoding.
  2. Disable gzip encoding by overriding the Accept-Encoding header that requests sets automatically, as mentioned in sseclient can't detect end of next event? #27. The downside of this is that we won't get the benefit of gzip compression.
  3. Fix short reads so they also work with gzipped content.

I will see if I can come up with a pull request that takes the third approach. I was not aware that requests even supported gzip encoding.

@Count-Count
Copy link
Contributor Author

Count-Count commented Dec 18, 2019

Approach no. 3 sounds good. Shouldn't it just work(tm) if we use the high-level iter_content() with chunk_size=None? stream is already set to True.

def iter_content(self, chunk_size=1, decode_unicode=False):
        """Iterates over the response data.  When stream=True is set on the
        request, this avoids reading the content at once into memory for
        large responses.  The chunk size is the number of bytes it should
        read into memory.  This is not necessarily the length of each item
        returned as decoding can take place.

        chunk_size must be of type int or None. A value of None will
        function differently depending on the value of `stream`.
        stream=True will read data as it arrives in whatever size the
        chunks are received. If stream=False, data is returned as
        a single chunk.

        If decode_unicode is True, content will be decoded using the best
        available encoding based on the response.
        """

@mutantmonkey
Copy link
Contributor

The documentation makes it sound like it should, but that's not the case unfortunately. If you trace things back through urllib3's underlying stream and read functions, through http.client.HTTPResponse.read, you ultimately end up at at a call to io.BufferedReader.read which per the Python docs will block until EOF, so setting chunk_size=None means you will receive no events until EOF.

@Count-Count
Copy link
Contributor Author

Count-Count commented Dec 19, 2019

@mutantmonkey What was the problem with a chunk_size of 1?

@mutantmonkey
Copy link
Contributor

mutantmonkey commented Dec 20, 2019

Using a chunk size of one causes unnecessarily high CPU usage because each time a byte is received, it has to be processed by the Python code instead of just being added to a buffer and processed all at once. This library used to do that, but 6820dc8 changed that behavior.

In any case, I believe I may have a fix that will work, but I need an endpoint with gzip enabled to test on. If you happen to have one handy, please share it, otherwise I can try to set something up but it will take another couple of days.

@TheSandDoctor
Copy link
Collaborator

@Count-Count @mutantmonkey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants