Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Support decoding multiple content encodings #7142

Merged

Conversation

coletdjnz
Copy link
Member

@coletdjnz coletdjnz commented May 27, 2023

Authored by: coletdjnz

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

ADD DESCRIPTION HERE

Inspired by a comment by @dirkf on that urllib does not have support for multi content decoding, as defined in RFC9110. Decided to implement this as an exercise as it was fairly simple while needing to fix the below.

This also fixes the errors that occur in the odd case the webserver returns a brotli encoded response when the user does not have brotli support installed (#4748, #5855). Instead, it will return the raw brotli encoded content in line with the behaviour of the others.

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Copilot Summary

🤖 Generated by Copilot at 3b62ee5

Summary

🗜️🧪🛠️

Add support for different content-encodings in HTTP responses. Improve the YoutubeDLHandler class to decompress gzip files with junk data and handle multiple and unsupported encodings. Update test_http.py to test the new functionality.

Oh we are the coders of the sea, and we work on YoutubeDL
We handle the responses that we get, whether they're compressed or not
We use zlib and brotli to decompress what we fetch
And we heave away on the count of three, to raise our code quality

Walkthrough

  • Decompress HTTP responses with different content-encodings using zlib and brotli modules (link, link)
  • Import zlib and brotli modules from yt_dlp.dependencies in test_http.py (link)
  • Add handler for /content-encoding path in HTTPTestRequestHandler class to mock web server responses with different content-encodings (link)
  • Add test cases to TestHTTP class to test YoutubeDL class's ability to handle different content-encodings (link)

@coletdjnz
Copy link
Member Author

I will note that if brotli is not available and server returns a multi content-encoding (such as gzip,br), then decoding will fail fatally. Extremely unlikely we'll ever see this so no need to handle it imo.

test/test_http.py Show resolved Hide resolved
yt_dlp/utils/_utils.py Outdated Show resolved Hide resolved
test/test_http.py Outdated Show resolved Hide resolved
coletdjnz and others added 2 commits May 27, 2023 09:35
Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>
Comment on lines 1369 to 1377
except OSError as original_ioerror:
# There may be junk add the end of the file
# See http://stackoverflow.com/q/4928560/35070 for details
for i in range(1, 1024):
try:
gz = gzip.GzipFile(fileobj=io.BytesIO(data[:-i]), mode='rb')
return gz.read()
except OSError:
continue
Copy link
Member

@pukkandan pukkandan May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you introspect the error object and see if we have access to the required offset? I don't like the idea of re-reading in a loop if it can be helped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly that doesn't really work:

>>> import gzip
>>> data = gzip.compress(b"01234567890123456789asdfasdfasdf")
>>> data += b"11111"
>>> gzip.decompress(d)
gzip.BadGzipFile: Not a gzipped file (b'11')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.

This raises BadGzipFile (subclass of OSError in 3.8) with the message Not a gzipped file (b'tr'). No other useful attributes are set.

tr in this case is the start of the trailing garbage in the raw response content: b'\x1f\x8b\x08\x00u\xd1qd\x02\xff\xb3\xc9(\xc9\xcd\xb1\xb3)\xcbLI\xcdW(.J\xb6U\xd2\x07\xb2\xf5r\x0bL\x94\x14\xf4\xedl\xf4\xc1\xd2\x00R)\x10\x96%\x00\x00\x00trailing garbage'

Copy link
Member

@Grub4K Grub4K May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative idea: zlib.decompress(data, wbits=31), it seems to ignore trailing data?

>>> zlib.decompress(gzip.compress(b"0123456789") + b"1234123412342134", wbits=31)
b'0123456789'

Since it should only contain one data, the docs hint at calling zlib directly being faster, and since its not trying to decode trailing things as another payload, it does not fail

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative idea: zlib.decompress(data, wbits=31), it seems to ignore trailing data?

could be just a coincidence

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original gzip version tries to decode the "trailing data" as another payload, as specified by the error message (Not a gzipped file). zlib instead only decodes one payload, which is why it shouldn't fail for trailing data. I've tried with various different gzipped payloads and various sizes of trailing data.

Copy link
Member Author

@coletdjnz coletdjnz May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah just saw your edit. Hmm.

Having a look at what urllib3 is doing, they are using zlib too. But with some more complex processing:

https://github.com/urllib3/urllib3/blob/main/src/urllib3/response.py#L119-L139:

Copy link
Member Author

@coletdjnz coletdjnz May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better to leave this to another PR/commit if its something you may want to look at.

as it will likely require its own set of reviews and testing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should maybe be expanded to have more trailing data or sth, but otherwise, wouldn't it be just

    @staticmethod
    def gz(data):
-        gz = gzip.GzipFile(fileobj=io.BytesIO(data), mode='rb')
-        try:
-            return gz.read()
-        except OSError as original_oserror:
-            # There may be junk add the end of the file
-            # See http://stackoverflow.com/q/4928560/35070 for details
-            for i in range(1, 1024):
-                try:
-                    gz = gzip.GzipFile(fileobj=io.BytesIO(data[:-i]), mode='rb')
-                    return gz.read()
-                except OSError:
-                    continue
-            else:
-                raise original_oserror
+        return zlib.decompress(data, 31)

We already have the whole data as bytes, so not fitting into memory should not be a problem. If you want I can open a new PR though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah better as its own PR if you want to. I'll merge this one

@coletdjnz coletdjnz merged commit daafbf4 into yt-dlp:master May 27, 2023
@coletdjnz coletdjnz deleted the nw-backport/urllib-multiple-encodings branch May 27, 2023 10:40
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants