Massively broken with gzip encoded streams #36

Count-Count · 2019-12-15T10:27:28Z

Due to the changes made for the short read functionality using streams which are gzip encoded is massively broken. Accessing the raw response content bypasses gzip decoding and thus the event stream cannot be read.

This happens for Wikimedia event streams sometimes (gzip encoding is not used in all cases, not sure in which cases it is used).

See https://requests.readthedocs.io/en/latest/user/quickstart/#raw-response-content

Count-Count · 2019-12-15T10:29:35Z

@mutantmonkey FYI

mutantmonkey · 2019-12-18T00:33:07Z

Related bug: #27
I'm not really sure why it was closed as it seems to be the exact same problem mentioned here.

mutantmonkey · 2019-12-18T01:03:41Z

There are a few different possible approaches to fix this:

Disable short reads when gzip encoding is used, as you've done in Don't use raw reads for gzipped or chunked encoding (fixes #28, #36) #37. The downside of this is that Default chunk size of 1024 is inappropriate (regression in 0.0.16) #8 and 0.0.18 lags #9 will be resurfaced for users who are also using gzip encoding.
Disable gzip encoding by overriding the Accept-Encoding header that requests sets automatically, as mentioned in sseclient can't detect end of next event? #27. The downside of this is that we won't get the benefit of gzip compression.
Fix short reads so they also work with gzipped content.

I will see if I can come up with a pull request that takes the third approach. I was not aware that requests even supported gzip encoding.

Count-Count · 2019-12-18T18:34:19Z

Approach no. 3 sounds good. Shouldn't it just work(tm) if we use the high-level iter_content() with chunk_size=None? stream is already set to True.

def iter_content(self, chunk_size=1, decode_unicode=False):
        """Iterates over the response data.  When stream=True is set on the
        request, this avoids reading the content at once into memory for
        large responses.  The chunk size is the number of bytes it should
        read into memory.  This is not necessarily the length of each item
        returned as decoding can take place.

        chunk_size must be of type int or None. A value of None will
        function differently depending on the value of `stream`.
        stream=True will read data as it arrives in whatever size the
        chunks are received. If stream=False, data is returned as
        a single chunk.

        If decode_unicode is True, content will be decoded using the best
        available encoding based on the response.
        """

mutantmonkey · 2019-12-18T19:46:24Z

The documentation makes it sound like it should, but that's not the case unfortunately. If you trace things back through urllib3's underlying stream and read functions, through http.client.HTTPResponse.read, you ultimately end up at at a call to io.BufferedReader.read which per the Python docs will block until EOF, so setting chunk_size=None means you will receive no events until EOF.

Count-Count · 2019-12-19T23:28:09Z

@mutantmonkey What was the problem with a chunk_size of 1?

mutantmonkey · 2019-12-20T04:25:49Z

Using a chunk size of one causes unnecessarily high CPU usage because each time a byte is received, it has to be processed by the Python code instead of just being added to a buffer and processed all at once. This library used to do that, but 6820dc8 changed that behavior.

In any case, I believe I may have a fix that will work, but I need an endpoint with gzip enabled to test on. If you happen to have one handy, please share it, otherwise I can try to set something up but it will take another couple of days.

TheSandDoctor · 2020-03-23T01:38:04Z

@Count-Count @mutantmonkey

Count-Count changed the title ~~Massively broken with gzip encoding~~ Massively broken with gzip encoded streams Dec 15, 2019

Count-Count mentioned this issue Dec 17, 2019

Don't use raw reads for gzipped or chunked encoding (fixes #28, #36) #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massively broken with gzip encoded streams #36

Massively broken with gzip encoded streams #36

Count-Count commented Dec 15, 2019 •

edited

Loading

Count-Count commented Dec 15, 2019 •

edited

Loading

mutantmonkey commented Dec 18, 2019

mutantmonkey commented Dec 18, 2019

Count-Count commented Dec 18, 2019 •

edited

Loading

mutantmonkey commented Dec 18, 2019

Count-Count commented Dec 19, 2019 •

edited

Loading

mutantmonkey commented Dec 20, 2019 •

edited

Loading

TheSandDoctor commented Mar 23, 2020

Massively broken with gzip encoded streams #36

Massively broken with gzip encoded streams #36

Comments

Count-Count commented Dec 15, 2019 • edited Loading

Count-Count commented Dec 15, 2019 • edited Loading

mutantmonkey commented Dec 18, 2019

mutantmonkey commented Dec 18, 2019

Count-Count commented Dec 18, 2019 • edited Loading

mutantmonkey commented Dec 18, 2019

Count-Count commented Dec 19, 2019 • edited Loading

mutantmonkey commented Dec 20, 2019 • edited Loading

TheSandDoctor commented Mar 23, 2020

Count-Count commented Dec 15, 2019 •

edited

Loading

Count-Count commented Dec 15, 2019 •

edited

Loading

Count-Count commented Dec 18, 2019 •

edited

Loading

Count-Count commented Dec 19, 2019 •

edited

Loading

mutantmonkey commented Dec 20, 2019 •

edited

Loading