Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul of MediaDownloader: simpler, handles Content-Encoding: gzip #723

Merged
merged 2 commits into from
Apr 20, 2016
Merged

Overhaul of MediaDownloader: simpler, handles Content-Encoding: gzip #723

merged 2 commits into from
Apr 20, 2016

Conversation

mmdriley
Copy link
Contributor

Today, MediaDownloader downloads content in chunks by sending one HTTP
request per chunk. Each request has an appropriate Range header set.

One consequence of this approach is that when users try to download
content served with Content-Encoding: gzip and that content happens
to be more than one chunk large, the download will fail with an error
like:

System.IO.InvalidDataException: The magic number in GZip header is not correct. Make sure you are passing in a GZip stream.

HttpClientHandler is trying to do the decoding for us because we set
its AutomaticDecompression property, but the decoding fails because
we've requested a range of data in the middle of a GZip stream.

Users can encounter this when downloading files from GCS that were
uploaded with gsutil -Z.

I considered asking the server to send us already-decompressed content
by not setting the Accept-Encoding header in our requests. Aside from
wasting customers' bandwidth, this turns out not to work: GCS ignores
the header and returns gzipped data regardless.

I also briefly considered special case handling for GZip downloads,
e.g. setting a huge ChunkSize or buffering intermediate results.

Instead, I decided to make things simpler and more flexible by only
making one request per download. As far as I can tell, there wasn't any
benefit to the multiple-request approach. It probably hurt performance
because we interrupted established transfers.

To avoid any impact on clients, ChunkSize is retained. Now it indicates
the granularity at which content should be written to the output stream
and our callers will be notified of progress. I've verified that the
progress events the caller sees are the same for both implementations.

I could not find a way to test this in MediaDownloaderTest as it was.
We were mocking out the request side of HttpClient and not exercising
any of the code that is actually responsible for transport concerns
like compression. My solution was to rewrite MediaDownloaderTest to
use a local HTTP server that serves the responses we need to exercise
the behavior under test. A lot of code changed, but I really think the
resulting test is easier to read and less fragile than before.

The old MediaDownloader implementation passes the new tests (except
the one that returns GZip content). Only minimal modifications to the
old tests are necessary for them to pass on the new MediaDownloader --
mostly removing those tests' baked-in assumptions about how many
requests would be made and for what ranges.

While I was there, I replaced some fiddly Query string manipulation
code in MediaDownloader that took apart the Query string only to
immediately reassembly it. That removed MediaDownloader's use of
RequestBuilder. Since the latter is what we trusted to bring in our
URI hacks, I made PatchUriQuirks public and added a class
initializer to MediaDownloader to use it.

@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Apr 12, 2016
@mmdriley
Copy link
Contributor Author

@jskeet @peleyal

Apologies for the large PR. I found it's much easier to read as a "Split" diff.

I'm out for the rest of the week; I'll work on addressing feedback when I'm back on Monday.

@mmdriley
Copy link
Contributor Author

FYI, when I say "minimal modifications to the old tests", here's what I had to change to get them to pass with the new implementation: master...mmdriley:old-test-new-code#diff-a4fca60bfee5ead670867ffff7e7d3ddL90

if (string.IsNullOrEmpty(uri.Query))
// Add alt=media to the query parameters.
var uri = new UriBuilder(url);
if (uri.Query == null || uri.Query.Length <= 1)

This comment was marked as spam.

This comment was marked as spam.

@jskeet
Copy link
Collaborator

jskeet commented Apr 13, 2016

Love the new testing approach. Much nicer.

@jskeet
Copy link
Collaborator

jskeet commented Apr 13, 2016

(Oh, and LGTM in general.)

@mmdriley
Copy link
Contributor Author

Many thanks! PTAL.

Adding the /Quit handler made me want async TearDownTestFixture, which led me to updating our tests to use NUnit 3. I expect to leave that as a separate commit in this PR.

Other changes are in the "PR feedback" commit, including:

  • Added the /Quit handler.
  • Fixed the redundant Downloading and Completed notification when the response size is divisible by ChunkSize. The CountedBuffer utility class is admittedly a bit verbose, though overall I found it made things more readable. Let me know if you don't find that the case.
  • Added an assertion that BytesDownloaded always increases. The old implementation passed this test; the new implementation does now.

I plan to squash the "PR feedback" commit before merging.

@jskeet
Copy link
Collaborator

jskeet commented Apr 20, 2016

LGTM. I tried to think of a simpler way of handling it, but couldn't come up with anything. The data copying (in RemoveFromFront) bothered me a little until I realized it would usually only be copying a single byte anyway...

Today, MediaDownloader downloads content in chunks by sending one HTTP
request per chunk. Each request has an appropriate `Range` header set.

One consequence of this approach is that when users try to download
content served with `Content-Encoding: gzip` and that content happens
to be more than one chunk large, the download will fail with an error
like:

`System.IO.InvalidDataException: The magic number in GZip header is not correct. Make sure you are passing in a GZip stream.`

`HttpClientHandler` is trying to do the decoding for us because we set
its `AutomaticDecompression` property, but the decoding fails because
we've requested a range of data in the middle of a GZip stream.

Users can encounter this when downloading files from GCS that were
uploaded with `gsutil -Z`.

I considered asking the server to send us already-decompressed content
by not setting the `Accept-Encoding` header in our requests. Aside from
wasting  customers' bandwidth, this turns out not to work: GCS ignores
the header and returns gzipped data regardless.

I also briefly considered special case handling for GZip downloads,
e.g. setting a huge ChunkSize or buffering intermediate results.

Instead, I decided to make things simpler and more flexible by only
making one request per download. As far as I can tell, there wasn't any
benefit to the multiple-request approach. It probably hurt performance
because we interrupted established transfers.

To avoid any impact on clients, ChunkSize is retained. Now it indicates
the granularity at which content should be written to the output stream
and our callers will be notified of progress. I've verified that the
progress events the caller sees are the same for both implementations.

I could not find a way to test this in MediaDownloaderTest as it was.
We were mocking out the request side of HttpClient and not exercising
any of the code that is actually responsible for transport concerns
like compression. My solution was to rewrite MediaDownloaderTest to
use a local HTTP server that serves the responses we need to exercise
the behavior under test. A lot of code changed, but I really think the
resulting test is easier to read and less fragile than before.

The old MediaDownloader implementation passes the new tests (except
the one that returns GZip content). Only minimal modifications to the
old tests are necessary for them to pass on the new MediaDownloader --
mostly removing those tests' baked-in assumptions about how many
requests would be made and for what ranges.

While I was there, I replaced some fiddly Query string manipulation
code in MediaDownloader that took apart the Query string only to
immediately reassembly it. That removed MediaDownloader's use of
RequestBuilder. Since the latter is what we trusted to bring in our
URI hacks, I made `PatchUriQuirks` public and added a class
initializer to MediaDownloader to use it.
@mmdriley mmdriley merged commit d9d68ee into googleapis:master Apr 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants