-
-
Notifications
You must be signed in to change notification settings - Fork 7.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Support decoding multiple content encodings #7142
[core] Support decoding multiple content encodings #7142
Conversation
Authored by: coletdjnz
I will note that if brotli is not available and server returns a multi content-encoding (such as |
Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>
yt_dlp/utils/_utils.py
Outdated
except OSError as original_ioerror: | ||
# There may be junk add the end of the file | ||
# See http://stackoverflow.com/q/4928560/35070 for details | ||
for i in range(1, 1024): | ||
try: | ||
gz = gzip.GzipFile(fileobj=io.BytesIO(data[:-i]), mode='rb') | ||
return gz.read() | ||
except OSError: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you introspect the error object and see if we have access to the required offset? I don't like the idea of re-reading in a loop if it can be helped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly that doesn't really work:
>>> import gzip
>>> data = gzip.compress(b"01234567890123456789asdfasdfasdf")
>>> data += b"11111"
>>> gzip.decompress(d)
gzip.BadGzipFile: Not a gzipped file (b'11')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.
This raises BadGzipFile (subclass of OSError in 3.8) with the message Not a gzipped file (b'tr')
. No other useful attributes are set.
tr
in this case is the start of the trailing garbage in the raw response content: b'\x1f\x8b\x08\x00u\xd1qd\x02\xff\xb3\xc9(\xc9\xcd\xb1\xb3)\xcbLI\xcdW(.J\xb6U\xd2\x07\xb2\xf5r\x0bL\x94\x14\xf4\xedl\xf4\xc1\xd2\x00R)\x10\x96%\x00\x00\x00trailing garbage'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative idea: zlib.decompress(data, wbits=31)
, it seems to ignore trailing data?
>>> zlib.decompress(gzip.compress(b"0123456789") + b"1234123412342134", wbits=31)
b'0123456789'
Since it should only contain one data, the docs hint at calling zlib
directly being faster, and since its not trying to decode trailing things as another payload, it does not fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative idea:
zlib.decompress(data, wbits=31)
, it seems to ignore trailing data?
could be just a coincidence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original gzip
version tries to decode the "trailing data" as another payload, as specified by the error message (Not a gzipped file). zlib
instead only decodes one payload, which is why it shouldn't fail for trailing data. I've tried with various different gzipped payloads and various sizes of trailing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah just saw your edit. Hmm.
Having a look at what urllib3
is doing, they are using zlib
too. But with some more complex processing:
https://github.com/urllib3/urllib3/blob/main/src/urllib3/response.py#L119-L139:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be better to leave this to another PR/commit if its something you may want to look at.
as it will likely require its own set of reviews and testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests should maybe be expanded to have more trailing data or sth, but otherwise, wouldn't it be just
@staticmethod
def gz(data):
- gz = gzip.GzipFile(fileobj=io.BytesIO(data), mode='rb')
- try:
- return gz.read()
- except OSError as original_oserror:
- # There may be junk add the end of the file
- # See http://stackoverflow.com/q/4928560/35070 for details
- for i in range(1, 1024):
- try:
- gz = gzip.GzipFile(fileobj=io.BytesIO(data[:-i]), mode='rb')
- return gz.read()
- except OSError:
- continue
- else:
- raise original_oserror
+ return zlib.decompress(data, 31)
We already have the whole data as bytes
, so not fitting into memory should not be a problem. If you want I can open a new PR though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah better as its own PR if you want to. I'll merge this one
Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>
Authored by: coletdjnz
Authored by: coletdjnz
IMPORTANT: PRs without the template will be CLOSED
Description of your pull request and other information
ADD DESCRIPTION HERE
Inspired by a comment by @dirkf on that urllib does not have support for multi content decoding, as defined in RFC9110. Decided to implement this as an exercise as it was fairly simple while needing to fix the below.
This also fixes the errors that occur in the odd case the webserver returns a brotli encoded response when the user does not have brotli support installed (#4748, #5855). Instead, it will return the raw brotli encoded content in line with the behaviour of the others.
Template
Before submitting a pull request make sure you have:
In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:
What is the purpose of your pull request?
Copilot Summary
🤖 Generated by Copilot at 3b62ee5
Summary
🗜️🧪🛠️
Add support for different content-encodings in HTTP responses. Improve the
YoutubeDLHandler
class to decompress gzip files with junk data and handle multiple and unsupported encodings. Updatetest_http.py
to test the new functionality.Walkthrough
zlib
andbrotli
modules (link, link)zlib
andbrotli
modules fromyt_dlp.dependencies
intest_http.py
(link)/content-encoding
path inHTTPTestRequestHandler
class to mock web server responses with different content-encodings (link)TestHTTP
class to testYoutubeDL
class's ability to handle different content-encodings (link)