-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul of MediaDownloader: simpler, handles Content-Encoding: gzip
#723
Conversation
FYI, when I say "minimal modifications to the old tests", here's what I had to change to get them to pass with the new implementation: master...mmdriley:old-test-new-code#diff-a4fca60bfee5ead670867ffff7e7d3ddL90 |
if (string.IsNullOrEmpty(uri.Query)) | ||
// Add alt=media to the query parameters. | ||
var uri = new UriBuilder(url); | ||
if (uri.Query == null || uri.Query.Length <= 1) |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
Love the new testing approach. Much nicer. |
(Oh, and LGTM in general.) |
Many thanks! PTAL. Adding the Other changes are in the "PR feedback" commit, including:
I plan to squash the "PR feedback" commit before merging. |
LGTM. I tried to think of a simpler way of handling it, but couldn't come up with anything. The data copying (in RemoveFromFront) bothered me a little until I realized it would usually only be copying a single byte anyway... |
Today, MediaDownloader downloads content in chunks by sending one HTTP request per chunk. Each request has an appropriate `Range` header set. One consequence of this approach is that when users try to download content served with `Content-Encoding: gzip` and that content happens to be more than one chunk large, the download will fail with an error like: `System.IO.InvalidDataException: The magic number in GZip header is not correct. Make sure you are passing in a GZip stream.` `HttpClientHandler` is trying to do the decoding for us because we set its `AutomaticDecompression` property, but the decoding fails because we've requested a range of data in the middle of a GZip stream. Users can encounter this when downloading files from GCS that were uploaded with `gsutil -Z`. I considered asking the server to send us already-decompressed content by not setting the `Accept-Encoding` header in our requests. Aside from wasting customers' bandwidth, this turns out not to work: GCS ignores the header and returns gzipped data regardless. I also briefly considered special case handling for GZip downloads, e.g. setting a huge ChunkSize or buffering intermediate results. Instead, I decided to make things simpler and more flexible by only making one request per download. As far as I can tell, there wasn't any benefit to the multiple-request approach. It probably hurt performance because we interrupted established transfers. To avoid any impact on clients, ChunkSize is retained. Now it indicates the granularity at which content should be written to the output stream and our callers will be notified of progress. I've verified that the progress events the caller sees are the same for both implementations. I could not find a way to test this in MediaDownloaderTest as it was. We were mocking out the request side of HttpClient and not exercising any of the code that is actually responsible for transport concerns like compression. My solution was to rewrite MediaDownloaderTest to use a local HTTP server that serves the responses we need to exercise the behavior under test. A lot of code changed, but I really think the resulting test is easier to read and less fragile than before. The old MediaDownloader implementation passes the new tests (except the one that returns GZip content). Only minimal modifications to the old tests are necessary for them to pass on the new MediaDownloader -- mostly removing those tests' baked-in assumptions about how many requests would be made and for what ranges. While I was there, I replaced some fiddly Query string manipulation code in MediaDownloader that took apart the Query string only to immediately reassembly it. That removed MediaDownloader's use of RequestBuilder. Since the latter is what we trusted to bring in our URI hacks, I made `PatchUriQuirks` public and added a class initializer to MediaDownloader to use it.
Today, MediaDownloader downloads content in chunks by sending one HTTP
request per chunk. Each request has an appropriate
Range
header set.One consequence of this approach is that when users try to download
content served with
Content-Encoding: gzip
and that content happensto be more than one chunk large, the download will fail with an error
like:
System.IO.InvalidDataException: The magic number in GZip header is not correct. Make sure you are passing in a GZip stream.
HttpClientHandler
is trying to do the decoding for us because we setits
AutomaticDecompression
property, but the decoding fails becausewe've requested a range of data in the middle of a GZip stream.
Users can encounter this when downloading files from GCS that were
uploaded with
gsutil -Z
.I considered asking the server to send us already-decompressed content
by not setting the
Accept-Encoding
header in our requests. Aside fromwasting customers' bandwidth, this turns out not to work: GCS ignores
the header and returns gzipped data regardless.
I also briefly considered special case handling for GZip downloads,
e.g. setting a huge ChunkSize or buffering intermediate results.
Instead, I decided to make things simpler and more flexible by only
making one request per download. As far as I can tell, there wasn't any
benefit to the multiple-request approach. It probably hurt performance
because we interrupted established transfers.
To avoid any impact on clients, ChunkSize is retained. Now it indicates
the granularity at which content should be written to the output stream
and our callers will be notified of progress. I've verified that the
progress events the caller sees are the same for both implementations.
I could not find a way to test this in MediaDownloaderTest as it was.
We were mocking out the request side of HttpClient and not exercising
any of the code that is actually responsible for transport concerns
like compression. My solution was to rewrite MediaDownloaderTest to
use a local HTTP server that serves the responses we need to exercise
the behavior under test. A lot of code changed, but I really think the
resulting test is easier to read and less fragile than before.
The old MediaDownloader implementation passes the new tests (except
the one that returns GZip content). Only minimal modifications to the
old tests are necessary for them to pass on the new MediaDownloader --
mostly removing those tests' baked-in assumptions about how many
requests would be made and for what ranges.
While I was there, I replaced some fiddly Query string manipulation
code in MediaDownloader that took apart the Query string only to
immediately reassembly it. That removed MediaDownloader's use of
RequestBuilder. Since the latter is what we trusted to bring in our
URI hacks, I made
PatchUriQuirks
public and added a classinitializer to MediaDownloader to use it.