Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

restore: Make all download error as retryable #298

Merged
merged 3 commits into from
May 25, 2020

Conversation

kennytm
Copy link
Collaborator

@kennytm kennytm commented May 18, 2020

What problem does this PR solve?

Workaround tikv/tikv#7846 and the "Restore" part of tidb-challenge-program/bug-hunting-issue#72.

What is changed and how it works?

Allow ErrDownloadFailed to be retried. The transient 5xx errors from the cloud server should be retryable.

This is currently too relaxed, as truly unrecoverable errors such as file-not-found will also get retried. But the worst case is retrying this for 8 times, so this seems to be an acceptable compromise.

We could tighten it back when we have more precise error handling from TiKV.

In the future we may tighten it back on truly recoverable errors such as file-not-found. but the worst of having a false negative is just hitting the same error 8 times.

Additionally, changed WithRetry to return a multierr so no errors are lost.

Check List

Tests

  • Unit test

Code changes

Side effects

Related changes

  • Need to cherry-pick to the release branch
    • Cherry-pick to 3.1.

Release Note

  • Improved robustness of restore, reducing chance of failure due to transient network error with cloud storage.

@codecov
Copy link

codecov bot commented May 18, 2020

Codecov Report

Merging #298 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #298   +/-   ##
=======================================
  Coverage   71.80%   71.80%           
=======================================
  Files          48       48           
  Lines        5057     5057           
=======================================
  Hits         3631     3631           
  Misses        966      966           
  Partials      460      460           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b25f71f...404a567. Read the comment docs.

@shuijing198799
Copy link

maybe backup also need to retry on upload

Copy link
Member

@overvenus overvenus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kennytm kennytm added the status/LGT1 LGTM1 label May 20, 2020
@kennytm
Copy link
Collaborator Author

kennytm commented May 20, 2020

(Accidentally verified on DBaaS that this works, after receiving a DNS error.)

tikv-0.log:[2020/05/20 06:06:08.328 +00:00] [ERROR] [sst_importer.rs:129] ["download failed"] [err="Cannot read gcs://<REDACTED>//66_23424_314_57ed0c06e1488d74165852a6acc6309e4985479d550bb9c2059756b6216e1b24_write.sst: request GCS access token failed: error sending request for url (https://oauth2.googleapis.com/token): error trying to connect: dns error: Device or resource busy (os error 16)"] [name=66_23424_314_57ed0c06e1488d74165852a6acc6309e4985479d550bb9c2059756b6216e1b24_write.sst] [meta="uuid: A25481D8C349472BACF2D3E4FD148F78 range { start: 7480000000000000FF5C5F72800000000CFF2C30ED0000000000FA end: 7480000000000000FF5C5F72800000000DFF8737F60000000000FA } length: 1703709 cf_name: \"write\" region_id: 23460 region_epoch { conf_ver: 5 version: 335 }"]
tikv-0.log:[2020/05/20 06:06:26.655 +00:00] [INFO] [sst_importer.rs:125] [download] [range="Some(start: 7480000000000000FF5C5F72800000000CFF2C30ED0000000000FAFFFFFFFFA1BC1AB4 end: 7480000000000000FF5C5F72800000000DFF8737F50000000000FAFFFFFFFFA1BC1AB4)"] [name=66_23424_314_57ed0c06e1488d74165852a6acc6309e4985479d550bb9c2059756b6216e1b24_write.sst] [meta="uuid: 8318DF5367634F43898F8CD36F0C52AC range { start: 7480000000000000FF5C5F72800000000CFF2C30ED0000000000FA end: 7480000000000000FF5C5F72800000000DFF8737F60000000000FA } length: 1703709 cf_name: \"write\" region_id: 23460 region_epoch { conf_ver: 5 version: 335 }"]
tikv-1.log:[2020/05/20 06:06:27.385 +00:00] [INFO] [sst_importer.rs:125] [download] [range="Some(start: 7480000000000000FF5C5F72800000000CFF2C30ED0000000000FAFFFFFFFFA1BC1AB4 end: 7480000000000000FF5C5F72800000000DFF8737F50000000000FAFFFFFFFFA1BC1AB4)"] [name=66_23424_314_57ed0c06e1488d74165852a6acc6309e4985479d550bb9c2059756b6216e1b24_write.sst] [meta="uuid: 8318DF5367634F43898F8CD36F0C52AC range { start: 7480000000000000FF5C5F72800000000CFF2C30ED0000000000FA end: 7480000000000000FF5C5F72800000000DFF8737F60000000000FA } length: 1703709 cf_name: \"write\" region_id: 23460 region_epoch { conf_ver: 5 version: 335 }"]
tikv-2.log:[2020/05/20 06:06:28.089 +00:00] [INFO] [sst_importer.rs:125] [download] [range="Some(start: 7480000000000000FF5C5F72800000000CFF2C30ED0000000000FAFFFFFFFFA1BC1AB4 end: 7480000000000000FF5C5F72800000000DFF8737F50000000000FAFFFFFFFFA1BC1AB4)"] [name=66_23424_314_57ed0c06e1488d74165852a6acc6309e4985479d550bb9c2059756b6216e1b24_write.sst] [meta="uuid: 8318DF5367634F43898F8CD36F0C52AC range { start: 7480000000000000FF5C5F72800000000CFF2C30ED0000000000FA end: 7480000000000000FF5C5F72800000000DFF8737F60000000000FA } length: 1703709 cf_name: \"write\" region_id: 23460 region_epoch { conf_ver: 5 version: 335 }"]

@shuijing198799
Copy link

Repeated testing using the 1.4 T data set did not reproduce

I0520 04:32:05.724178       1 restore.go:86] [2020/05/20 04:32:05.724 +00:00] [INFO] [collector.go:195] ["Full restore Success summary: total restore files: 35252, total success: 35252, total failed: 0, total take(s): 6115.33, total kv: 10140081556, total size(MB): 1478230.72, avg speed(MB/s): 241.73"] ["split region"=1m46.683811285s] ["restore checksum"=11m11.709348139s] ["restore ranges"=19861]

and catch tikv log

[2020/05/20 06:06:08.328 +00:00] [ERROR] [sst_importer.rs:129] ["download failed"] [err="Cannot read gcs://dbaas-hibernatetest1T//66_23424_314_57ed0c06e1488d74165852a6acc6309e4985479d550bb9c2059756b6216e1b24_write.sst: request GCS access token failed: error sending request for url (https://oauth2.googleapis.com/token): error trying to connect: dns error: Device or resource busy (os error 16)"] [name=66_23424_314_57ed0c06e1488d74165852a6acc6309e4985479d550bb9c2059756b6216e1b24_write.sst] [meta="uuid: A25481D8C349472BACF2D3E4FD148F78 range { start: 7480000000000000FF5C5F72800000000CFF2C30ED0000000000FA end: 7480000000000000FF5C5F72800000000DFF8737F60000000000FA } length: 1703709 cf_name: \"write\" region_id: 23460 region_epoch { conf_ver: 5 version: 335 }"]

It seems fix the issue tikv/tikv#7846 . thx @kennytm

@shuijing198799
Copy link

fix tikv/tikv#7846

Copy link
Contributor

@5kbpers 5kbpers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kennytm kennytm added status/LGT2 LGTM2 and removed status/LGT1 LGTM1 labels May 25, 2020
@kennytm
Copy link
Collaborator Author

kennytm commented May 25, 2020

/merge

@sre-bot
Copy link
Contributor

sre-bot commented May 25, 2020

/run-all-tests

@sre-bot sre-bot merged commit 0b47b8f into pingcap:master May 25, 2020
@sre-bot
Copy link
Contributor

sre-bot commented May 25, 2020

cherry pick to release-3.1 failed

@sre-bot
Copy link
Contributor

sre-bot commented May 25, 2020

cherry pick to release-4.0 in PR #307

overvenus pushed a commit that referenced this pull request May 26, 2020
* utils: return a multierr in WithRetry rather than just the last error

* restore: treat ErrDownloadFailed as retryable

Co-authored-by: kennytm <kennytm@gmail.com>
kennytm added a commit to kennytm/br that referenced this pull request May 27, 2020
@kennytm kennytm deleted the retry-on-download branch June 2, 2020 09:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants