Skip to content

Conversation

@dennisvang
Copy link

@dennisvang dennisvang commented Jul 5, 2022

Fixes #2047

Description of the changes being introduced by the pull request:

  • Replace response.iter_content by response.raw in RequestsFetcher._chunks(). This prevents requests/urllib3 from automatically decoding files with Content-Encoding headers (gzip or deflate).
  • Add a CustomHTTPRequestHandler to tests.utils, and use this in simple_server.py (instead of SimpleHTTPRequestHandler). The custom request handler adds the ability to specify additional headers in an HTTP request, which are then added to the corresponding HTTP response. This is used to test the "Content-Encoding" issue.
  • Add a regression test for issue 2047 which verifies that files served with a "Content-Encoding: gzip" header are not decoded.

Some open questions:

  • Is it sufficient to catch ReadTimeoutError in _chunks(), or should we catch e.g. IncompleteRead too?

Please verify and check that the pull request fulfills the following
requirements
:

  • The code follows the Code Style Guidelines
  • Tests have been added for the bug fix or new feature
  • Docs have been added for the bug fix or new feature

…nks.

This prevents requests/urllib3 from automatically decoding files with Content-Encoding gzip or deflate.

Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
ReadTimeoutError is actually a urllib3 exception.
We need it because requests Response.raw exposes the urllib3 Response, from which we call the read() method.
The exception is imported from requests.adapters, instead of directly from urllib3.exceptions, to prevent an explicit dependency on urllib3.
This is deliberate, as we only need the urllib3 exception because requests uses urllib3 (if the requests dependency would ever change, so should we).

Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
The custom requests handler enables us to add headers to the response, by specifying them in the request.
This can be used, for example, to get a response with a "Content-Encoding" header.

Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
This uses RequestsFetcher to get a response with content-encoding gzip, and ensures that the fetched data is not decoded.

Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
@dennisvang dennisvang changed the title Replace response.iter_content by response.raw in RequestsFetcher._chunks ngclient: Prevent automatic decoding of gzip files Jul 7, 2022
@coveralls
Copy link

coveralls commented Jul 13, 2022

Pull Request Test Coverage Report for Build 2691750026

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 5 of 5 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.004%) to 98.095%

Totals Coverage Status
Change from base Build 2654873123: 0.004%
Covered Lines: 1064
Relevant Lines: 1076

💛 - Coveralls

Copy link
Collaborator

@MVrachev MVrachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @dennisvang for your efforts!
I really liked how you approached the problem, that you give a detailed description, suggested a solution with links to support it, and added a regression test to make sure it doesn't occur again!

Maybe it's obvious, but I want to say it out loud and confirm I understand it: you created the CustomHTTPRequestHandler helper class in utils.py in order to support adding that Content-Encoding header you need for the test.
I am not sure where exactly do you envoke CustomHTTPRequestHandler? I am sure I am missing something and I wanted to clarify it.

Also, there are linting errors shown in the CI run, will appreciate it if you fix them.

PS: Sorry that it took us so long to give you feedback on your pr.
I will not be coding or reviewing the upcoming week, so if you update the pr or comment on something I will see it after a week or more.

# Globals
logger = logging.getLogger(__name__)


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we need that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I got a PEP8 warning there: "expecting 2 blank lines"


# Imports
import requests
from requests.models import ReadTimeoutError # todo: import directly from urllib3?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that it makes sense to directly import it from urlib3 given that requests.models just reexports it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer it the original way (importing from requests), that way our requests_fetcher module only depends on the requests interface and it doesn't matter what underlying connectivity API requests uses. With the change to importing from urllib3 directly, we now have an explicit dependency on urllib3.

I'd like to hear from a more active maintainer than me on this, though – @lukpueh or @jku any thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshuagl That was my original feeling also, as mentioned in eeba7c7

On the other hand, my fix relies on requests.Response.raw, which is in fact a urllib3.response.HTTPResponse. So it actually relies on (part of) the urllib3 response interface as well, viz. the read() method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also import the error from requests for said reasons.

And when using requests.Response.raw, we do still rely on documented requests API, and we should expect that they will make an effort to not break it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukpueh and @joshuagl Allright, sounds good. I'll revert to from requests.adapters import ....


import requests
from requests.models import ReadTimeoutError # actually from urllib3
from requests.adapters import ReadTimeoutError # this is a urllib3 exception
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already depend on urlib3:

urllib3==1.26.9 # via requests

so I don't think you should worry about this.
Appreciate the thoughtfulness. 👍

tests/utils.py Outdated

# May be used in a request to add headers to the response that is returned by
# the customized http request handler.
REQUEST_RESPONSE_HEADERS = "Request-Response-Headers"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that later on, you set REQUEST_RESPONSE_HEADERS in your test in test_fetcher.py in order to test the specific situation.
Can you instead of setting a global constant during the test, create an __init__ for CustomHTTPRequestHandler and set the information you need there in a variable?
Then you can use that variable in CustomHTTPRequestHandler.end_headers() instead of using the global REQUEST_RESPONSE_HEADERS.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm misinterpreting your question, but I don't think passing the header into CustomHTTPRequestHandler.__init__ would be an option, as the CustomHTTPRequestHandler is instantiated by socketserver.TCPServer (see python source):

httpd = socketserver.TCPServer(("localhost", 0), CustomHTTPRequestHandler)

I could make REQUEST_RESPONSE_HEADERS a class attribute of CustomHTTPRequestHandler, instead of a global. Would that be preferable?

# Serve dummy file with "Content-Encoding: gzip" header
content_encoding_header = json.dumps({"Content-Encoding": "gzip"})
headers = {utils.REQUEST_RESPONSE_HEADERS: content_encoding_header}
get_with_headers = partialmethod(requests.Session.get, headers=headers)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I am new to functools and partialmethod what is the bonus of creating a partialmethod object instead of just calling response.Session.get directly in line 123 with the headers you want?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling response.Session.get directly with the headers argument would certainly simplify things, but, in this case, I wanted to test RequestsFetcher.fetch. This implies that response.Session.get is called indirectly, viz. through RequestsFetcher._fetch.

So, instead of calling response.Session.get directly, I patch it to make sure it is called with the desired headers argument. Basically, I use functools.partialmethod to supply a default value for headers. Now, when RequestsFetcher._fetch calls get, the headers arg is automatically set.

I am aware patch also accepts additional kwargs, but I believe these are only applied if patch creates a Mock object.

If there is a simpler way to do this, I am all for it. :)

Comment on lines +119 to +120
# so we can expect an error to be raised if requests/urllib3 tries to
# decode the file (urllib3 uses zlib for this).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried running this script before replacing response.iter_content with response.raw?
Did request raise an error with this test then?

Copy link
Author

@dennisvang dennisvang Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did try that, and I can confirm that the error was raised.

@dennisvang
Copy link
Author

dennisvang commented Jul 14, 2022

@MVrachev Thanks for the review. :)

Maybe it's obvious, but I want to say it out loud and confirm I understand it: you created the CustomHTTPRequestHandler helper class in utils.py in order to support adding that Content-Encoding header you need for the test.

Yes, that is correct. This seemed like the least intrusive way to get a response with the desired header. It does not affect the existing tests, as far as I know.

A much simpler alternative would be to bypass the test-server and just send a request to e.g. https://httpbin.org/gzip. However, that's an external dependency, and would require an actual internet connection.

Another alternative, perhaps, would be to patch requests or urrlib3 (in the test), in order to modify the response at some appropriate point. However, I have a feeling that could also get quite complex.

I am not sure where exactly do you envoke CustomHTTPRequestHandler? I am sure I am missing something and I wanted to clarify it.

The handler class is passed into the socketserver.TCPServer here:

httpd = socketserver.TCPServer(("localhost", 0), CustomHTTPRequestHandler)

And this is used in the TestServerProcess class:

server: str = os.path.join(TESTS_DIR, "simple_server.py"),

Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>
Signed-off-by: dennisvang <29799340+dennisvang@users.noreply.github.com>

try:
for data in response.iter_content(self.chunk_size):
while True:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could perhaps replace this while loop by the following:

yield from response.raw.stream(amt=self.chunk_size, decode_content=False)

See urrlib3 source: https://github.com/urllib3/urllib3/blob/1f2f6a5817fd420d65e280e671d2b579791aef09/src/urllib3/response.py#L748

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukpueh @joshuagl @MVrachev What do you think about this?

@lukpueh
Copy link
Member

lukpueh commented Jul 19, 2022

It looks like this PR reverts #1774. Did you happen to read the rationale behind switching from raw to iter_content presented in b8759a9?

@dennisvang
Copy link
Author

dennisvang commented Jul 19, 2022

It looks like this PR reverts #1774. Did you happen to read the rationale behind switching from raw to iter_content presented in b8759a9?

@lukpueh No, I was not aware of those. Thanks for pointing them out. I'll have a look.

@dennisvang
Copy link
Author

dennisvang commented Jul 19, 2022

@lukpueh You are right, the _chunks part does indeed look like a reversion of b8759a9, almost down to the smallest detail.

However, I do wonder: Have the changes introduced in b8759a9 ever been tested with actual compressed files? Issue #2047 appears to suggest otherwise. Or perhaps it was tested with metadata files but not with target files?

@lukpueh
Copy link
Member

lukpueh commented Jul 19, 2022

However, I do wonder: Has the iter_content implementation introduced in b8759a9 ever been tested with actual compressed files? Issue #2047 appears to suggest otherwise. Or perhaps it was tested with metadata files but not with target files?

I wonder too. :) At least #1774 does not seem to add any tests with compressed target files. Let me look, I would think that @jku tested this.

@dennisvang
Copy link
Author

dennisvang commented Jul 19, 2022

@lukpueh

Perhaps it was only intended for metadata, judging from @jku's #1251 (comment):

...
I lean to the same solution... but I'll still document the alternative though (sechkova already said this but just for emphasis): allow compression only for metadata. ...

I suppose #2047 does not arise for metadata (although I haven't tested this), because the number of downloaded bytes is compared with the maximum allowed metadata file size from config, instead of the actual expected file size, as can be seen e.g. here:

data = self._download_metadata(
Root.type,
self.config.root_max_length,

unless an optional file length is included in the snapshot and/or timestamp metadata, as can be seen here:

length = metainfo.length or self.config.targets_max_length
version = None
if self._trusted_set.root.signed.consistent_snapshot:
version = metainfo.version
data = self._download_metadata(role, length, version)

@lukpueh
Copy link
Member

lukpueh commented Jul 19, 2022

Looks like @jku's "Minimal TUF Playground Client" (see jku/repository-playground#16) is able to successfully download a *.tar.gz-target:

git clone git@github.com:jku/repository-playground.git -b tuf-minimal
cd repository-playground/playground/client/
pip install tuf click
python playground_client.py download tuf-spec/tarball
# Downloads https://jku.github.io/playground-tuf-minimal/repository/content/tuf-spec/tarball/1.0.30/648dc8c3be0a7de08d654e38556349580926b3e024e22b4f2828eab8895b71c5.specification-1.0.30.tar.gz

@lukpueh
Copy link
Member

lukpueh commented Jul 19, 2022

In above example the server serves metadata as 'application/json' and gzipped target file as 'application/gzip'. And only sends the 'Content-Encoding': 'gzip' header for the metadata but not for the already gzipped target file. This seems like the correct behavior for a server.

See debugger output for details:

> /Users/lukp/.pyenv/versions/min-tuf/lib/python3.8/site-packages/tuf/ngclient/_internal/requests_fetcher.py(92)_fetch()
-> return self._chunks(response)
(Pdb) session.headers
{'User-Agent': 'tuf/1.1.0 python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
(Pdb) url
'https://jku.github.io/playground-tuf-minimal/repository//metadata/timestamp.json'
(Pdb) response.headers
{'Connection': 'keep-alive', 'Content-Length': '328', 'Server': 'GitHub.com', 'Content-Type': 'application/json; charset=utf-8', 'permissions-policy': 'interest-cohort=()', 'Last-Modified': 'Thu, 30 Jun 2022 06:16:55 GMT', 'Access-Control-Allow-Origin': '*', 'ETag': 'W/"62bd3fd7-1dc"', 'expires': 'Tue, 19 Jul 2022 12:51:55 GMT', 'Cache-Control': 'max-age=600', 'Content-Encoding': 'gzip', 'x-proxy-cache': 'MISS', 'X-GitHub-Request-Id': '3358:A2C4:8A7C6E:9658CB:62D6A693', 'Accept-Ranges': 'bytes', 'Date': 'Tue, 19 Jul 2022 14:01:26 GMT', 'Via': '1.1 varnish', 'Age': '373', 'X-Served-By': 'cache-muc13969-MUC', 'X-Cache': 'HIT', 'X-Cache-Hits': '1', 'X-Timer': 'S1658239286.277420,VS0,VE0', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': '70fb24cb5491e50d1eb5ff8f12a09d4468b7f150'}
(Pdb) c
> /Users/lukp/.pyenv/versions/min-tuf/lib/python3.8/site-packages/tuf/ngclient/_internal/requests_fetcher.py(92)_fetch()
-> return self._chunks(response)
(Pdb) response.headers
{'Connection': 'keep-alive', 'Content-Length': '49485', 'Server': 'GitHub.com', 'Content-Type': 'application/gzip', 'permissions-policy': 'interest-cohort=()', 'x-origin-cache': 'HIT', 'Last-Modified': 'Thu, 30 Jun 2022 06:16:55 GMT', 'Access-Control-Allow-Origin': '*', 'ETag': '"62bd3fd7-c14d"', 'expires': 'Tue, 19 Jul 2022 12:52:17 GMT', 'Cache-Control': 'max-age=600', 'x-proxy-cache': 'MISS', 'X-GitHub-Request-Id': 'EC3A:237B:BA3A:C458:62D6A6A9', 'Accept-Ranges': 'bytes', 'Date': 'Tue, 19 Jul 2022 14:02:08 GMT', 'Via': '1.1 varnish', 'Age': '386', 'X-Served-By': 'cache-muc13969-MUC', 'X-Cache': 'HIT', 'X-Cache-Hits': '1', 'X-Timer': 'S1658239329.841205,VS0,VE1', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': '7dee8b09d77bb81445bc41c771514782e89960d8'}

@dennisvang
Copy link
Author

In above example the server serves metadata as 'application/json' and gzipped target file as 'application/gzip'. And only sends the 'Content-Encoding': 'gzip' header for the metadata but not for the already gzipped target file. This seems like the correct behavior for a server.

@lukpueh You are right, that sounds like what is described in rfc7231 and here:

Note that the original media/content type is specified in the Content-Type header, and that the Content-Encoding applies to the representation, or "coded form", of the data. If the original media is encoded in some way (e.g. a zip file) then this information would not be included in the Content-Encoding header.

@dennisvang
Copy link
Author

@lukpueh So, that would mean the current implementation is correct, and #2047 is not a bug, but is probably caused by an incorrect server configuration. Which would imply I've been leading everyone on a wild goose chase...

@lukpueh
Copy link
Member

lukpueh commented Jul 19, 2022

@lukpueh So, that would mean the current implementation is correct, and #2047 is not a bug, but is probably caused by an incorrect server configuration.

I think it is. Your effort is still appreciated and I apologize for not having picked up your issue earlier!

(also 💯 for verbose commit messages)

@dennisvang
Copy link
Author

dennisvang commented Jul 19, 2022

@lukpueh No problem. I should have done a much more thorough searched of old issues. #1774 (comment) does indeed describe what is going wrong on my side.

For what it's worth:

#2047 occurred on my side when downloading a .tar.gz file from two different servers:

  1. a local Django development server

    It turns out the Django development server guesses both the Content-Type and the Content-Encoding using mimetypes.guess_type, as follows (source):

    content_type, encoding = mimetypes.guess_type(str(fullpath))

    For a .tar.gz file, this finally results in a combination of Content-Type: application/x-tar and Content-Encoding: gzip, instead of the Content-Type: application/gzip (and no content-encoding) that would be expected according to rfc7231.

  2. direct download from an AWS S3 bucket

    I'm still looking into what is going wrong with the AWS S3 case, but I suspect it has to do with the arguments supplied to PutObject in order to upload the file to the S3 bucket.

    EDIT: see update below

@lukpueh
Copy link
Member

lukpueh commented Jul 20, 2022

Good to know. Django folks are pretty explicit about not wanting to improve their dev web server. But AWS S3 bucket seems worth troubleshooting. Out of curiosity do you have that setup open-sourced?

@lukpueh
Copy link
Member

lukpueh commented Jul 20, 2022

Either way, I'd say we close this PR...

@lukpueh lukpueh closed this Jul 20, 2022
@dennisvang
Copy link
Author

dennisvang commented Jul 20, 2022

@lukpueh

... But AWS S3 bucket seems worth troubleshooting. Out of curiosity do you have that setup open-sourced?

TLDR: It looks like the issue is caused by S3Boto3Storage from django-storages.

Details

The test setup uses Django's FileField with S3Boto3Storage from django-storages to upload .tar.gz files to an S3 bucket with default settings.

As it turns out, django-storages uses the same approach as the django dev server to guess the content-type and content-encoding for a file, before uploading that file to an S3 bucket. This happens in S3Boto3Storage._get_write_parameters:

_type, encoding = mimetypes.guess_type(name)

The resulting Content-Type and Content-Encoding headers are finally passed along to the S3 API in a PutObject call in s3transfer.upload.PutObjectTask.

The Content-Encoding: gzip header then ends up in the S3 object metadata, so it is included in the response headers when the object is downloaded from S3.

And that's where #2047 arises.

Note

I also tried creating a fresh S3 bucket with default settings, and uploading a .tar.gz file manually, using the AWS S3 web-console (so without django-storages). As expected, in this case the download response headers only contained the Content-Type: application/x-gzip header, without a Content-Encoding header.

@dennisvang
Copy link
Author

dennisvang commented Jul 20, 2022

And, of course, it turns out there's already an open issue: jschneier/django-storages#917

UPDATE:

This has been fixed in django-storages 1.14 (see jschneier/django-storages#1281 for details)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DownloadLengthMismatchError because requests automatically decodes gzip files

5 participants