-
Notifications
You must be signed in to change notification settings - Fork 285
ngclient: Prevent automatic decoding of gzip files #2048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ngclient: Prevent automatic decoding of gzip files #2048
Conversation
…nks. This prevents requests/urllib3 from automatically decoding files with Content-Encoding gzip or deflate. Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
3ff17c3 to
c6857e0
Compare
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
ReadTimeoutError is actually a urllib3 exception. We need it because requests Response.raw exposes the urllib3 Response, from which we call the read() method. The exception is imported from requests.adapters, instead of directly from urllib3.exceptions, to prevent an explicit dependency on urllib3. This is deliberate, as we only need the urllib3 exception because requests uses urllib3 (if the requests dependency would ever change, so should we). Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
The custom requests handler enables us to add headers to the response, by specifying them in the request. This can be used, for example, to get a response with a "Content-Encoding" header. Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
This uses RequestsFetcher to get a response with content-encoding gzip, and ensures that the fetched data is not decoded. Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
Pull Request Test Coverage Report for Build 2691750026Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @dennisvang for your efforts!
I really liked how you approached the problem, that you give a detailed description, suggested a solution with links to support it, and added a regression test to make sure it doesn't occur again!
Maybe it's obvious, but I want to say it out loud and confirm I understand it: you created the CustomHTTPRequestHandler helper class in utils.py in order to support adding that Content-Encoding header you need for the test.
I am not sure where exactly do you envoke CustomHTTPRequestHandler? I am sure I am missing something and I wanted to clarify it.
Also, there are linting errors shown in the CI run, will appreciate it if you fix them.
PS: Sorry that it took us so long to give you feedback on your pr.
I will not be coding or reviewing the upcoming week, so if you update the pr or comment on something I will see it after a week or more.
| # Globals | ||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we need that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I got a PEP8 warning there: "expecting 2 blank lines"
|
|
||
| # Imports | ||
| import requests | ||
| from requests.models import ReadTimeoutError # todo: import directly from urllib3? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that it makes sense to directly import it from urlib3 given that requests.models just reexports it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer it the original way (importing from requests), that way our requests_fetcher module only depends on the requests interface and it doesn't matter what underlying connectivity API requests uses. With the change to importing from urllib3 directly, we now have an explicit dependency on urllib3.
I'd like to hear from a more active maintainer than me on this, though – @lukpueh or @jku any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshuagl That was my original feeling also, as mentioned in eeba7c7
On the other hand, my fix relies on requests.Response.raw, which is in fact a urllib3.response.HTTPResponse. So it actually relies on (part of) the urllib3 response interface as well, viz. the read() method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also import the error from requests for said reasons.
And when using requests.Response.raw, we do still rely on documented requests API, and we should expect that they will make an effort to not break it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/test_fetcher_ng.py
Outdated
|
|
||
| import requests | ||
| from requests.models import ReadTimeoutError # actually from urllib3 | ||
| from requests.adapters import ReadTimeoutError # this is a urllib3 exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already depend on urlib3:
python-tuf/requirements-pinned.txt
Line 10 in c12be48
| urllib3==1.26.9 # via requests |
so I don't think you should worry about this.
Appreciate the thoughtfulness. 👍
tests/utils.py
Outdated
|
|
||
| # May be used in a request to add headers to the response that is returned by | ||
| # the customized http request handler. | ||
| REQUEST_RESPONSE_HEADERS = "Request-Response-Headers" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that later on, you set REQUEST_RESPONSE_HEADERS in your test in test_fetcher.py in order to test the specific situation.
Can you instead of setting a global constant during the test, create an __init__ for CustomHTTPRequestHandler and set the information you need there in a variable?
Then you can use that variable in CustomHTTPRequestHandler.end_headers() instead of using the global REQUEST_RESPONSE_HEADERS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I'm misinterpreting your question, but I don't think passing the header into CustomHTTPRequestHandler.__init__ would be an option, as the CustomHTTPRequestHandler is instantiated by socketserver.TCPServer (see python source):
python-tuf/tests/simple_server.py
Line 16 in 5a572d8
| httpd = socketserver.TCPServer(("localhost", 0), CustomHTTPRequestHandler) |
I could make REQUEST_RESPONSE_HEADERS a class attribute of CustomHTTPRequestHandler, instead of a global. Would that be preferable?
| # Serve dummy file with "Content-Encoding: gzip" header | ||
| content_encoding_header = json.dumps({"Content-Encoding": "gzip"}) | ||
| headers = {utils.REQUEST_RESPONSE_HEADERS: content_encoding_header} | ||
| get_with_headers = partialmethod(requests.Session.get, headers=headers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I am new to functools and partialmethod what is the bonus of creating a partialmethod object instead of just calling response.Session.get directly in line 123 with the headers you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling response.Session.get directly with the headers argument would certainly simplify things, but, in this case, I wanted to test RequestsFetcher.fetch. This implies that response.Session.get is called indirectly, viz. through RequestsFetcher._fetch.
So, instead of calling response.Session.get directly, I patch it to make sure it is called with the desired headers argument. Basically, I use functools.partialmethod to supply a default value for headers. Now, when RequestsFetcher._fetch calls get, the headers arg is automatically set.
I am aware patch also accepts additional kwargs, but I believe these are only applied if patch creates a Mock object.
If there is a simpler way to do this, I am all for it. :)
| # so we can expect an error to be raised if requests/urllib3 tries to | ||
| # decode the file (urllib3 uses zlib for this). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried running this script before replacing response.iter_content with response.raw?
Did request raise an error with this test then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did try that, and I can confirm that the error was raised.
|
@MVrachev Thanks for the review. :)
Yes, that is correct. This seemed like the least intrusive way to get a response with the desired header. It does not affect the existing tests, as far as I know. A much simpler alternative would be to bypass the test-server and just send a request to e.g. https://httpbin.org/gzip. However, that's an external dependency, and would require an actual internet connection. Another alternative, perhaps, would be to patch
The handler class is passed into the python-tuf/tests/simple_server.py Line 16 in 5a572d8
And this is used in the Line 185 in c12be48
|
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
5a572d8 to
520112b
Compare
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
|
|
||
| try: | ||
| for data in response.iter_content(self.chunk_size): | ||
| while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could perhaps replace this while loop by the following:
yield from response.raw.stream(amt=self.chunk_size, decode_content=False)See urrlib3 source: https://github.com/urllib3/urllib3/blob/1f2f6a5817fd420d65e280e671d2b579791aef09/src/urllib3/response.py#L748
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@lukpueh You are right, the However, I do wonder: Have the changes introduced in b8759a9 ever been tested with actual compressed files? Issue #2047 appears to suggest otherwise. Or perhaps it was tested with metadata files but not with target files? |
I wonder too. :) At least #1774 does not seem to add any tests with compressed target files. Let me look, I would think that @jku tested this. |
|
Perhaps it was only intended for metadata, judging from @jku's #1251 (comment):
I suppose #2047 does not arise for metadata (although I haven't tested this), because the number of downloaded bytes is compared with the maximum allowed metadata file size from python-tuf/tuf/ngclient/updater.py Lines 312 to 314 in e07fa27
unless an optional file python-tuf/tuf/ngclient/updater.py Lines 389 to 394 in e07fa27
|
|
Looks like @jku's "Minimal TUF Playground Client" (see jku/repository-playground#16) is able to successfully download a git clone git@github.com:jku/repository-playground.git -b tuf-minimal
cd repository-playground/playground/client/
pip install tuf click
python playground_client.py download tuf-spec/tarball
# Downloads https://jku.github.io/playground-tuf-minimal/repository/content/tuf-spec/tarball/1.0.30/648dc8c3be0a7de08d654e38556349580926b3e024e22b4f2828eab8895b71c5.specification-1.0.30.tar.gz
|
|
In above example the server serves metadata as See debugger output for details: > /Users/lukp/.pyenv/versions/min-tuf/lib/python3.8/site-packages/tuf/ngclient/_internal/requests_fetcher.py(92)_fetch()
-> return self._chunks(response)
(Pdb) session.headers
{'User-Agent': 'tuf/1.1.0 python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
(Pdb) url
'https://jku.github.io/playground-tuf-minimal/repository//metadata/timestamp.json'
(Pdb) response.headers
{'Connection': 'keep-alive', 'Content-Length': '328', 'Server': 'GitHub.com', 'Content-Type': 'application/json; charset=utf-8', 'permissions-policy': 'interest-cohort=()', 'Last-Modified': 'Thu, 30 Jun 2022 06:16:55 GMT', 'Access-Control-Allow-Origin': '*', 'ETag': 'W/"62bd3fd7-1dc"', 'expires': 'Tue, 19 Jul 2022 12:51:55 GMT', 'Cache-Control': 'max-age=600', 'Content-Encoding': 'gzip', 'x-proxy-cache': 'MISS', 'X-GitHub-Request-Id': '3358:A2C4:8A7C6E:9658CB:62D6A693', 'Accept-Ranges': 'bytes', 'Date': 'Tue, 19 Jul 2022 14:01:26 GMT', 'Via': '1.1 varnish', 'Age': '373', 'X-Served-By': 'cache-muc13969-MUC', 'X-Cache': 'HIT', 'X-Cache-Hits': '1', 'X-Timer': 'S1658239286.277420,VS0,VE0', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': '70fb24cb5491e50d1eb5ff8f12a09d4468b7f150'}
(Pdb) c
> /Users/lukp/.pyenv/versions/min-tuf/lib/python3.8/site-packages/tuf/ngclient/_internal/requests_fetcher.py(92)_fetch()
-> return self._chunks(response)
(Pdb) response.headers
{'Connection': 'keep-alive', 'Content-Length': '49485', 'Server': 'GitHub.com', 'Content-Type': 'application/gzip', 'permissions-policy': 'interest-cohort=()', 'x-origin-cache': 'HIT', 'Last-Modified': 'Thu, 30 Jun 2022 06:16:55 GMT', 'Access-Control-Allow-Origin': '*', 'ETag': '"62bd3fd7-c14d"', 'expires': 'Tue, 19 Jul 2022 12:52:17 GMT', 'Cache-Control': 'max-age=600', 'x-proxy-cache': 'MISS', 'X-GitHub-Request-Id': 'EC3A:237B:BA3A:C458:62D6A6A9', 'Accept-Ranges': 'bytes', 'Date': 'Tue, 19 Jul 2022 14:02:08 GMT', 'Via': '1.1 varnish', 'Age': '386', 'X-Served-By': 'cache-muc13969-MUC', 'X-Cache': 'HIT', 'X-Cache-Hits': '1', 'X-Timer': 'S1658239329.841205,VS0,VE1', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': '7dee8b09d77bb81445bc41c771514782e89960d8'} |
@lukpueh You are right, that sounds like what is described in rfc7231 and here:
|
I think it is. Your effort is still appreciated and I apologize for not having picked up your issue earlier! (also 💯 for verbose commit messages) |
|
@lukpueh No problem. I should have done a much more thorough searched of old issues. #1774 (comment) does indeed describe what is going wrong on my side. For what it's worth: #2047 occurred on my side when downloading a
|
|
Good to know. Django folks are pretty explicit about not wanting to improve their dev web server. But AWS S3 bucket seems worth troubleshooting. Out of curiosity do you have that setup open-sourced? |
|
Either way, I'd say we close this PR... |
TLDR: It looks like the issue is caused by DetailsThe test setup uses Django's As it turns out, _type, encoding = mimetypes.guess_type(name)The resulting The And that's where #2047 arises. NoteI also tried creating a fresh S3 bucket with default settings, and uploading a |
|
And, of course, it turns out there's already an open issue: jschneier/django-storages#917 UPDATE: This has been fixed in django-storages 1.14 (see jschneier/django-storages#1281 for details) |
Fixes #2047
Description of the changes being introduced by the pull request:
response.iter_contentbyresponse.rawinRequestsFetcher._chunks(). This prevents requests/urllib3 from automatically decoding files withContent-Encodingheaders (gzipordeflate).CustomHTTPRequestHandlertotests.utils, and use this insimple_server.py(instead ofSimpleHTTPRequestHandler). The custom request handler adds the ability to specify additional headers in an HTTP request, which are then added to the corresponding HTTP response. This is used to test the "Content-Encoding" issue.Some open questions:
ReadTimeoutErrorin_chunks(), or should we catch e.g. IncompleteRead too?Please verify and check that the pull request fulfills the following
requirements: