Overhaul of MediaDownloader: simpler, handles `Content-Encoding: gzip` #723

mmdriley · 2016-04-12T23:52:04Z

Today, MediaDownloader downloads content in chunks by sending one HTTP
request per chunk. Each request has an appropriate Range header set.

One consequence of this approach is that when users try to download
content served with Content-Encoding: gzip and that content happens
to be more than one chunk large, the download will fail with an error
like:

System.IO.InvalidDataException: The magic number in GZip header is not correct. Make sure you are passing in a GZip stream.

HttpClientHandler is trying to do the decoding for us because we set
its AutomaticDecompression property, but the decoding fails because
we've requested a range of data in the middle of a GZip stream.

Users can encounter this when downloading files from GCS that were
uploaded with gsutil -Z.

I considered asking the server to send us already-decompressed content
by not setting the Accept-Encoding header in our requests. Aside from
wasting customers' bandwidth, this turns out not to work: GCS ignores
the header and returns gzipped data regardless.

I also briefly considered special case handling for GZip downloads,
e.g. setting a huge ChunkSize or buffering intermediate results.

Instead, I decided to make things simpler and more flexible by only
making one request per download. As far as I can tell, there wasn't any
benefit to the multiple-request approach. It probably hurt performance
because we interrupted established transfers.

To avoid any impact on clients, ChunkSize is retained. Now it indicates
the granularity at which content should be written to the output stream
and our callers will be notified of progress. I've verified that the
progress events the caller sees are the same for both implementations.

I could not find a way to test this in MediaDownloaderTest as it was.
We were mocking out the request side of HttpClient and not exercising
any of the code that is actually responsible for transport concerns
like compression. My solution was to rewrite MediaDownloaderTest to
use a local HTTP server that serves the responses we need to exercise
the behavior under test. A lot of code changed, but I really think the
resulting test is easier to read and less fragile than before.

The old MediaDownloader implementation passes the new tests (except
the one that returns GZip content). Only minimal modifications to the
old tests are necessary for them to pass on the new MediaDownloader --
mostly removing those tests' baked-in assumptions about how many
requests would be made and for what ranges.

While I was there, I replaced some fiddly Query string manipulation
code in MediaDownloader that took apart the Query string only to
immediately reassembly it. That removed MediaDownloader's use of
RequestBuilder. Since the latter is what we trusted to bring in our
URI hacks, I made PatchUriQuirks public and added a class
initializer to MediaDownloader to use it.

mmdriley · 2016-04-12T23:57:26Z

@jskeet @peleyal

Apologies for the large PR. I found it's much easier to read as a "Split" diff.

I'm out for the rest of the week; I'll work on addressing feedback when I'm back on Monday.

mmdriley · 2016-04-13T00:02:32Z

FYI, when I say "minimal modifications to the old tests", here's what I had to change to get them to pass with the new implementation: master...mmdriley:old-test-new-code#diff-a4fca60bfee5ead670867ffff7e7d3ddL90

Src/Support/GoogleApis/Apis/[Media]/Download/MediaDownloader.cs

-            if (string.IsNullOrEmpty(uri.Query))
+            // Add alt=media to the query parameters.
+            var uri = new UriBuilder(url);
+            if (uri.Query == null || uri.Query.Length <= 1)


jskeet · 2016-04-13T09:23:43Z

Love the new testing approach. Much nicer.

jskeet · 2016-04-13T09:23:53Z

(Oh, and LGTM in general.)

mmdriley · 2016-04-20T08:08:59Z

Many thanks! PTAL.

Adding the /Quit handler made me want async TearDownTestFixture, which led me to updating our tests to use NUnit 3. I expect to leave that as a separate commit in this PR.

Other changes are in the "PR feedback" commit, including:

Added the /Quit handler.
Fixed the redundant Downloading and Completed notification when the response size is divisible by ChunkSize. The CountedBuffer utility class is admittedly a bit verbose, though overall I found it made things more readable. Let me know if you don't find that the case.
Added an assertion that BytesDownloaded always increases. The old implementation passed this test; the new implementation does now.

I plan to squash the "PR feedback" commit before merging.

jskeet · 2016-04-20T08:40:50Z

LGTM. I tried to think of a simpler way of handling it, but couldn't come up with anything. The data copying (in RemoveFromFront) bothered me a little until I realized it would usually only be copying a single byte anyway...

Today, MediaDownloader downloads content in chunks by sending one HTTP request per chunk. Each request has an appropriate `Range` header set. One consequence of this approach is that when users try to download content served with `Content-Encoding: gzip` and that content happens to be more than one chunk large, the download will fail with an error like: `System.IO.InvalidDataException: The magic number in GZip header is not correct. Make sure you are passing in a GZip stream.` `HttpClientHandler` is trying to do the decoding for us because we set its `AutomaticDecompression` property, but the decoding fails because we've requested a range of data in the middle of a GZip stream. Users can encounter this when downloading files from GCS that were uploaded with `gsutil -Z`. I considered asking the server to send us already-decompressed content by not setting the `Accept-Encoding` header in our requests. Aside from wasting customers' bandwidth, this turns out not to work: GCS ignores the header and returns gzipped data regardless. I also briefly considered special case handling for GZip downloads, e.g. setting a huge ChunkSize or buffering intermediate results. Instead, I decided to make things simpler and more flexible by only making one request per download. As far as I can tell, there wasn't any benefit to the multiple-request approach. It probably hurt performance because we interrupted established transfers. To avoid any impact on clients, ChunkSize is retained. Now it indicates the granularity at which content should be written to the output stream and our callers will be notified of progress. I've verified that the progress events the caller sees are the same for both implementations. I could not find a way to test this in MediaDownloaderTest as it was. We were mocking out the request side of HttpClient and not exercising any of the code that is actually responsible for transport concerns like compression. My solution was to rewrite MediaDownloaderTest to use a local HTTP server that serves the responses we need to exercise the behavior under test. A lot of code changed, but I really think the resulting test is easier to read and less fragile than before. The old MediaDownloader implementation passes the new tests (except the one that returns GZip content). Only minimal modifications to the old tests are necessary for them to pass on the new MediaDownloader -- mostly removing those tests' baked-in assumptions about how many requests would be made and for what ranges. While I was there, I replaced some fiddly Query string manipulation code in MediaDownloader that took apart the Query string only to immediately reassembly it. That removed MediaDownloader's use of RequestBuilder. Since the latter is what we trusted to bring in our URI hacks, I made `PatchUriQuirks` public and added a class initializer to MediaDownloader to use it.

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Apr 12, 2016

jskeet reviewed Apr 13, 2016
View reviewed changes

Update NUnit framework from v2 to v3

60ed4d5

mmdriley merged commit d9d68ee into googleapis:master Apr 20, 2016

mmdriley mentioned this pull request Apr 22, 2016

Rewrite MediaUploadTest to use test HTTP server #735

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul of MediaDownloader: simpler, handles `Content-Encoding: gzip` #723

Overhaul of MediaDownloader: simpler, handles `Content-Encoding: gzip` #723

mmdriley commented Apr 12, 2016

mmdriley commented Apr 12, 2016

mmdriley commented Apr 13, 2016

This comment was marked as spam.

This comment was marked as spam.

jskeet commented Apr 13, 2016

jskeet commented Apr 13, 2016

mmdriley commented Apr 20, 2016

jskeet commented Apr 20, 2016

Overhaul of MediaDownloader: simpler, handles Content-Encoding: gzip #723

Overhaul of MediaDownloader: simpler, handles Content-Encoding: gzip #723

Conversation

mmdriley commented Apr 12, 2016

mmdriley commented Apr 12, 2016

mmdriley commented Apr 13, 2016

This comment was marked as spam.

This comment was marked as spam.

jskeet commented Apr 13, 2016

jskeet commented Apr 13, 2016

mmdriley commented Apr 20, 2016

jskeet commented Apr 20, 2016

Overhaul of MediaDownloader: simpler, handles `Content-Encoding: gzip` #723

Overhaul of MediaDownloader: simpler, handles `Content-Encoding: gzip` #723