Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on stalled upload #30776

Merged
merged 4 commits into from
Apr 4, 2018
Merged

Retry on stalled upload #30776

merged 4 commits into from
Apr 4, 2018

Conversation

butonic
Copy link
Member

@butonic butonic commented Mar 14, 2018

fixes #30770

how to test

use this snippet to fake a bad connection:

diff --git a/apps/dav/lib/Connector/Sabre/Directory.php b/apps/dav/lib/Connector/Sabre/Directory.php
index d6650d575f..9dcccc3c27 100644
--- a/apps/dav/lib/Connector/Sabre/Directory.php
+++ b/apps/dav/lib/Connector/Sabre/Directory.php
@@ -488,6 +488,17 @@ class Directory extends Node implements ICollection, IQuota, IMoveTarget {
         * @param resource $data data
         */
        public function createFileDirectly($name, $data) {
-               $this->fileView->file_put_contents($this->getPath() . '/' .  $name, $data);
+               $tgt = $this->fileView->fopen($this->getPath() . '/' .  $name, 'wb');
+
+               while(!feof($data)) {
+                       if (mt_rand(0,99) <= 10) {
+                               sleep(10);
+                               exit();
+                       }
+                       fwrite($tgt, fread($data, 512));
+               }
+               fclose($data);
+               fclose($tgt);
+               //$this->fileView->file_put_contents($this->getPath() . '/' .  $name, $data);
        }
 }

now set a lower max chunk size and upload stall timeout:

./occ config:app:set files max_chunk_size --value 1024
./occ config:app:set files upload_stall_timeout --value 3

you should see the uploads stall quite frequently. but now they get restarted an puck up where they left off.

  • requires doc changes:
    introduces two new app config keys to control stalled uploads:
    • upload_stall_timeout (default 60) in seconds: when to consider an upload stalled
    • upload_stall_retries (default 100): how often to retry 100 because we really want to upload a file.

@butonic butonic added backport-request p1-urgent Critical issue, need to consider hotfix with just that issue labels Mar 14, 2018
@butonic butonic added this to the development milestone Mar 14, 2018
@butonic butonic self-assigned this Mar 14, 2018
@butonic butonic requested a review from PVince81 March 14, 2018 13:20
@butonic butonic requested a review from tomneedham March 14, 2018 13:26
retries = upload.data.retries || 0,
retry = function () {
var uid = OC.getCurrentUser().uid;
upload.uploader.davClient.getFolderContents(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move this in a separate private function on the Uploader object if possible: getUploadedBytes(). Too much nesting makes this difficult to read.

// only count full chunks
&& file.size === fu.options.maxChunkSize
) {
data.uploadedBytes += file.size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment to explain why you need to calculate total size: is it to find out what chunk to resend ?
in general there is already information in the JS class about the last chunk so maybe you only need to PROPFIND said chunk to find out whether it was written at all ?

retries < fu.options.maxRetries) {
retries += 1;
upload.data.retries = retries;
window.setTimeout(retry, retries * fu.options.retryTimeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice if retry could also be a private method on the object instead of inline

@PVince81
Copy link
Contributor

  • BUG: progress state disappears while waiting for retry, it also seems to reset back. Also it says "Invalid Date" sometimes

@PVince81
Copy link
Contributor

I wonder why you didn't simply use the "fileuploadchunkfail" event, might be able to store the chunk id somewhere and reuse for retry instead of doing size computation.

@PVince81
Copy link
Contributor

PVince81 commented Mar 27, 2018

  • TODO: a PROPFIND on the upload folder and getting oc:size might be better than summing up all chunks (the internal Sabre Node logic already omits special files there) PROPFIND on ".file" entry

@PVince81
Copy link
Contributor

ok so from reading the code of jquery.fileupload.js it doesn't seem to have a way to resume upload directly after failure: https://github.com/owncloud/core/blob/v10.0.7/apps/files/js/jquery.fileupload.js#L802.

This is likely the reason why you chose to go the route of restarting the upload from scratch, but specifying an initial "uploadedBytes" value to make the upload start at this offset.

@PVince81
Copy link
Contributor

and apparently most of the code was copied from https://github.com/blueimp/jQuery-File-Upload/wiki/Chunked-file-uploads#automatic-resume and adjusted.

I wonder why the jquery.fileupload devs didn't simply add a "retry" option and call upload() again on failure instead of working it around that way...

@PVince81
Copy link
Contributor

Based on this information I will not try and find a better way since this is the recommended way from the jquery.fileupload wiki.

Now to find a way to make this a bit more readable...

@codecov
Copy link

codecov bot commented Mar 27, 2018

Codecov Report

Merging #30776 into master will decrease coverage by <.01%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #30776      +/-   ##
============================================
- Coverage     62.31%    62.3%   -0.01%     
  Complexity    18208    18208              
============================================
  Files          1142     1142              
  Lines         68190    68204      +14     
  Branches       1232     1232              
============================================
+ Hits          42494    42496       +2     
- Misses        25335    25347      +12     
  Partials        361      361
Flag Coverage Δ Complexity Δ
#javascript 52.07% <ø> (+0.02%) 0 <ø> (ø) ⬇️
#phpunit 63.47% <0%> (-0.02%) 18208 <0> (ø)
Impacted Files Coverage Δ Complexity Δ
apps/files/lib/App.php 25% <0%> (-12.5%) 3 <0> (ø)
apps/dav/lib/CardDAV/AddressBookImpl.php 72.61% <0%> (-1.78%) 26% <0%> (ø)
core/ajax/share.php 0% <0%> (ø) 0% <0%> (ø) ⬇️
core/js/js.js 54.84% <0%> (+0.23%) 0% <0%> (ø) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e2b8aa...746adf7. Read the comment docs.

@PVince81
Copy link
Contributor

The reason why the progress bar disappears is because before retrying we first need to abort the upload: this removes the progress bar. But then there's the waiting delay of "retryTimeout" during which no progress bar is shown until an actual retry is attempted. I wonder whether we need this because we already have our own stall detection.

@PVince81
Copy link
Contributor

Apart from that it seems to recover properly and it's able to finish the upload. The checksum is the same despite retries. At least with Chromium. Need to try with IE11 and Edge.

We actually need to test not only with the stalled mode but also in case of actual chunk failure where the server returns an error code (like timeout).

@PVince81
Copy link
Contributor

PVince81 commented Mar 27, 2018

  • TEST: IE11
  • TEST: Edge
  • TEST: Firefox
  • TEST: Chromium
  • TEST: stalled chunk is retried => upload finishes and checksum matches
  • TEST: erroring chunk upload is retried => upload finishes and checksum matches
  • TEST: multiple files upload with chunking (the abort/retry code might affect the other files that haven't been uploaded yet)

@PVince81
Copy link
Contributor

PVince81 commented Mar 27, 2018

  • TEST: can a chunk be partially written ?

If yes then the logic above cannot work as the server would return the size of all chunks + the partial one.
I do wonder why the wiki page favors this approach instead of just looking at the last chunk offset from the previous send.

@PVince81
Copy link
Contributor

PVince81 commented Mar 27, 2018

  • BUG: "Invalid date" appears in progress bar

@PVince81
Copy link
Contributor

  • BUG: when stuck on breakpoint for a long time and resuming, the upload will eventually fail and say "chunks on server do not sum up" instead of properly reuploading

@ownclouders ownclouders force-pushed the retry-on-stalled-upload branch from e103aa3 to 746adf7 Compare April 3, 2018 19:38
@ownclouders
Copy link
Contributor

Automated rebase with GitMate.io was successful! 🎉

@butonic
Copy link
Member Author

butonic commented Apr 4, 2018

Possible improvement: instead of hijacking progress bar code to detect progress, actually use the "fileuploadprogress" and use the "uploadedBytes" value

does not work because these events only fire when progress is happening. no bytes sent => no progress event.

Copy link
Contributor

@PVince81 PVince81 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed with @butonic the current solution is good enough to make it retry chunks 👍

@PVince81 PVince81 merged commit 1a8073b into master Apr 4, 2018
@PVince81 PVince81 deleted the retry-on-stalled-upload branch April 4, 2018 12:19
@PVince81
Copy link
Contributor

PVince81 commented Apr 4, 2018

stable10: #31005

@lock
Copy link

lock bot commented Jul 30, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jul 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
blue-ticket p2-high Escalation, on top of current planning, release blocker server sev2-high Type:Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Detect chunked upload timeouts and retry
3 participants