[core] Support decoding multiple content encodings #7142

coletdjnz · 2023-05-27T09:09:50Z

Authored by: coletdjnz

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

ADD DESCRIPTION HERE

Inspired by a comment by @dirkf on that urllib does not have support for multi content decoding, as defined in RFC9110. Decided to implement this as an exercise as it was fairly simple while needing to fix the below.

This also fixes the errors that occur in the odd case the webserver returns a brotli encoded response when the user does not have brotli support installed (#4748, #5855). Instead, it will return the raw brotli encoded content in line with the behaviour of the others.

Template

Before submitting a pull request make sure you have:

At least skimmed through contributing guidelines including yt-dlp coding conventions
Searched the bugtracker for similar pull requests
Checked the code with flake8 and ran relevant tests

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Fix or improvement to an extractor (Make sure to add/update tests)
New extractor (Piracy websites will not be accepted)
Core bug fix/improvement
New feature (It is strongly recommended to open an issue first)

Copilot Summary

`🤖 Generated by Copilot at 3b62ee5`

Summary

🗜️🧪🛠️

Add support for different content-encodings in HTTP responses. Improve the YoutubeDLHandler class to decompress gzip files with junk data and handle multiple and unsupported encodings. Update test_http.py to test the new functionality.

Oh we are the coders of the sea, and we work on YoutubeDL
We handle the responses that we get, whether they're compressed or not
We use zlib and brotli to decompress what we fetch
And we heave away on the count of three, to raise our code quality

Walkthrough

Decompress HTTP responses with different content-encodings using zlib and brotli modules (link, link)
Import zlib and brotli modules from yt_dlp.dependencies in test_http.py (link)
Add handler for /content-encoding path in HTTPTestRequestHandler class to mock web server responses with different content-encodings (link)
Add test cases to TestHTTP class to test YoutubeDL class's ability to handle different content-encodings (link)

Authored by: coletdjnz

coletdjnz · 2023-05-27T09:13:03Z

I will note that if brotli is not available and server returns a multi content-encoding (such as gzip,br), then decoding will fail fatally. Extremely unlikely we'll ever see this so no need to handle it imo.

test/test_http.py

yt_dlp/utils/_utils.py

test/test_http.py

Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>

pukkandan · 2023-05-27T09:38:55Z

yt_dlp/utils/_utils.py

+        except OSError as original_ioerror:
+            # There may be junk add the end of the file
+            # See http://stackoverflow.com/q/4928560/35070 for details
+            for i in range(1, 1024):
+                try:
+                    gz = gzip.GzipFile(fileobj=io.BytesIO(data[:-i]), mode='rb')
+                    return gz.read()
+                except OSError:
+                    continue


Could you introspect the error object and see if we have access to the required offset? I don't like the idea of re-reading in a loop if it can be helped.

Sadly that doesn't really work:

>>> import gzip >>> data = gzip.compress(b"01234567890123456789asdfasdfasdf") >>> data += b"11111" >>> gzip.decompress(d) gzip.BadGzipFile: Not a gzipped file (b'11')

Hmm.

This raises BadGzipFile (subclass of OSError in 3.8) with the message Not a gzipped file (b'tr'). No other useful attributes are set.

tr in this case is the start of the trailing garbage in the raw response content: b'\x1f\x8b\x08\x00u\xd1qd\x02\xff\xb3\xc9(\xc9\xcd\xb1\xb3)\xcbLI\xcdW(.J\xb6U\xd2\x07\xb2\xf5r\x0bL\x94\x14\xf4\xedl\xf4\xc1\xd2\x00R)\x10\x96%\x00\x00\x00trailing garbage'

Alternative idea: zlib.decompress(data, wbits=31), it seems to ignore trailing data?

>>> zlib.decompress(gzip.compress(b"0123456789") + b"1234123412342134", wbits=31) b'0123456789'

Since it should only contain one data, the docs hint at calling zlib directly being faster, and since its not trying to decode trailing things as another payload, it does not fail

Alternative idea: zlib.decompress(data, wbits=31), it seems to ignore trailing data?

could be just a coincidence

The original gzip version tries to decode the "trailing data" as another payload, as specified by the error message (Not a gzipped file). zlib instead only decodes one payload, which is why it shouldn't fail for trailing data. I've tried with various different gzipped payloads and various sizes of trailing data.

Ah just saw your edit. Hmm.

Having a look at what urllib3 is doing, they are using zlib too. But with some more complex processing:

https://github.com/urllib3/urllib3/blob/main/src/urllib3/response.py#L119-L139:

might be better to leave this to another PR/commit if its something you may want to look at.

as it will likely require its own set of reviews and testing

Tests should maybe be expanded to have more trailing data or sth, but otherwise, wouldn't it be just

@staticmethod def gz(data): - gz = gzip.GzipFile(fileobj=io.BytesIO(data), mode='rb') - try: - return gz.read() - except OSError as original_oserror: - # There may be junk add the end of the file - # See http://stackoverflow.com/q/4928560/35070 for details - for i in range(1, 1024): - try: - gz = gzip.GzipFile(fileobj=io.BytesIO(data[:-i]), mode='rb') - return gz.read() - except OSError: - continue - else: - raise original_oserror + return zlib.decompress(data, 31)

We already have the whole data as bytes, so not fitting into memory should not be a problem. If you want I can open a new PR though

yeah better as its own PR if you want to. I'll merge this one

Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>

Authored by: coletdjnz

coletdjnz added 2 commits May 27, 2023 20:46

[core] Support multiple HTTP content encodings

7e3f3c5

Authored by: coletdjnz

cleanup

3b62ee5

fix resource warning?

e541c57

Grub4K approved these changes May 27, 2023

View reviewed changes

test/test_http.py Show resolved Hide resolved

yt_dlp/utils/_utils.py Outdated Show resolved Hide resolved

test/test_http.py Outdated Show resolved Hide resolved

coletdjnz and others added 2 commits May 27, 2023 09:35

Update test/test_http.py

bf3c28c

Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>

fix

fc37d27

pukkandan reviewed May 27, 2023

View reviewed changes

pukkandan assigned coletdjnz May 27, 2023

coletdjnz and others added 4 commits May 27, 2023 21:53

minor fixes from review

844ba6a

Update test/test_http.py

dfbb45c

Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>

use skipUnless decorator for brotli test

70edc26

strip whitespace

c4e6ef5

coletdjnz merged commit daafbf4 into yt-dlp:master May 27, 2023

coletdjnz deleted the nw-backport/urllib-multiple-encodings branch May 27, 2023 10:40

aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024

[core] Support decoding multiple content encodings (yt-dlp#7142)

4d3fa62

Authored by: coletdjnz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Support decoding multiple content encodings #7142

[core] Support decoding multiple content encodings #7142

coletdjnz commented May 27, 2023 •

edited by ghost

Loading

coletdjnz commented May 27, 2023

pukkandan May 27, 2023 •

edited

Loading

Grub4K May 27, 2023

coletdjnz May 27, 2023

Grub4K May 27, 2023 •

edited

Loading

coletdjnz May 27, 2023

Grub4K May 27, 2023

coletdjnz May 27, 2023 •

edited

Loading

coletdjnz May 27, 2023 •

edited

Loading

Grub4K May 27, 2023

coletdjnz May 27, 2023

[core] Support decoding multiple content encodings #7142

[core] Support decoding multiple content encodings #7142

Conversation

coletdjnz commented May 27, 2023 • edited by ghost Loading

Description of your pull request and other information

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

What is the purpose of your pull request?

🤖 Generated by Copilot at 3b62ee5

Summary

Walkthrough

coletdjnz commented May 27, 2023

pukkandan May 27, 2023 • edited Loading

Choose a reason for hiding this comment

Grub4K May 27, 2023

Choose a reason for hiding this comment

coletdjnz May 27, 2023

Choose a reason for hiding this comment

Grub4K May 27, 2023 • edited Loading

Choose a reason for hiding this comment

coletdjnz May 27, 2023

Choose a reason for hiding this comment

Grub4K May 27, 2023

Choose a reason for hiding this comment

coletdjnz May 27, 2023 • edited Loading

Choose a reason for hiding this comment

coletdjnz May 27, 2023 • edited Loading

Choose a reason for hiding this comment

Grub4K May 27, 2023

Choose a reason for hiding this comment

coletdjnz May 27, 2023

Choose a reason for hiding this comment

coletdjnz commented May 27, 2023 •

edited by ghost

Loading

`🤖 Generated by Copilot at 3b62ee5`

pukkandan May 27, 2023 •

edited

Loading

Grub4K May 27, 2023 •

edited

Loading

coletdjnz May 27, 2023 •

edited

Loading

coletdjnz May 27, 2023 •

edited

Loading