Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: refine the error handling and retry mechanism for stale read #24956

Merged
merged 18 commits into from
Jun 3, 2021

Conversation

JmPotato
Copy link
Member

@JmPotato JmPotato commented May 28, 2021

Signed-off-by: JmPotato ghzpotato@gmail.com

What problem does this PR solve?

Part of #21094 and #23271.

Refine the error handling and retry mechanism for stale read.

What is changed and how it works?

  • Update the excluded StoreID list during the retry of a stale read request to make sure TiDB will try on different peers.
  • Add RegionNotInitialized and DataIsNotReady error handling.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

Release note

  • No release note.

@ti-chi-bot ti-chi-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 28, 2021
@JmPotato JmPotato marked this pull request as ready for review May 31, 2021 03:59
@ti-chi-bot ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 31, 2021
@JmPotato
Copy link
Member Author

/run-all-tests

@zhouqiang-cl
Copy link
Contributor

/rebuild

@JmPotato JmPotato changed the title store: retry the leader or next peer on the Stale Read error store: refine the error handling and retry mechanism for stale read May 31, 2021
@JmPotato
Copy link
Member Author

JmPotato commented Jun 1, 2021

@xhebox @nolouch @Yisaer @djshow832 PTAL

store/tikv/region_cache.go Outdated Show resolved Hide resolved
store/tikv/region_cache.go Outdated Show resolved Hide resolved
store/tikv/region_request.go Outdated Show resolved Hide resolved
store/tikv/region_request.go Outdated Show resolved Hide resolved
Copy link
Contributor

@xhebox xhebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

store/tikv/region_request.go Outdated Show resolved Hide resolved
store/tikv/region_cache.go Outdated Show resolved Hide resolved
store/tikv/region_request.go Outdated Show resolved Hide resolved
store/tikv/region_cache.go Outdated Show resolved Hide resolved
store/tikv/region_cache.go Outdated Show resolved Hide resolved
store/tikv/region_cache.go Outdated Show resolved Hide resolved
store/tikv/region_cache.go Outdated Show resolved Hide resolved
Comment on lines 659 to 671
// Stale Read request will retry the leader or next peer on error,
// so we will exclude the PeerID of the requested peer every time.
// If the new PeerID keeps being the same with the last one, we
// should not continue excluding it to make the opts become bigger.
if ctx.isStaleRead && ctx.lastPeerID != ctx.Peer.GetId() {
opts = append(opts, WithExcludedPeerIDs([]uint64{ctx.Peer.GetId()}))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback strategy is kind of different from what I think, I think if the tx_scope is local, we shall find other store which located in same dc. If the tx_scope is global, we shall directly send request to leader.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, it works like this:

  • If the tx_scope is local, it will find other stores located in the same DC according to matched labels and try on different peers one by one on the error. If all peers are tried, it will only retry on the leader until fails.
  • If the tx_scope is global, it will try on different peers one by one on the error. If all peers are tried, it will only retry on the leader until fails.

It seem that we only need to change the behaviour of the global? Once it fails, it will directly request to the leader rather than try all other peers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to send to leader directly when tx_scope is global

Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Copy link
Contributor

@Yisaer Yisaer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM except faillback strategy

store/tikv/region_request.go Outdated Show resolved Hide resolved
Signed-off-by: JmPotato <ghzpotato@gmail.com>
store/tikv/region_request.go Outdated Show resolved Hide resolved
@JmPotato JmPotato requested a review from a team as a code owner June 3, 2021 06:53
@JmPotato JmPotato requested review from wshwsh12 and removed request for a team June 3, 2021 06:53
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Copy link
Contributor

@xhebox xhebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

store/tikv/region_cache.go Outdated Show resolved Hide resolved
Signed-off-by: JmPotato <ghzpotato@gmail.com>
@JmPotato
Copy link
Member Author

JmPotato commented Jun 3, 2021

@xhebox @nolouch @Yisaer PTAL.

Copy link
Contributor

@xhebox xhebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jun 3, 2021
@JmPotato JmPotato requested review from nolouch and Yisaer June 3, 2021 08:10
@github-actions github-actions bot added the sig/execution SIG execution label Jun 3, 2021
Copy link
Contributor

@youjiali1995 youjiali1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add TestingT otherwise tests in store/tikv package won't run.

Copy link
Contributor

@Yisaer Yisaer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

// Stale Read request will retry the leader or next peer on error,
// if txnScope is global, we will only retry the leader by using the WithLeaderOnly option,
// if txnScope is local, we will retry both other peers and the leader by the incresing seed.
if ctx.tryTimes < 1 && req != nil && req.TxnScope == oracle.GlobalTxnScope && req.GetStaleRead() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need tryTimes and tryTimesLimit?

Copy link
Member Author

@JmPotato JmPotato Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make WithLeaderOnly option only be appended once to prevent the unnecessary slice memory reallocation. And tryTimesLimit is used by the unit test to mock different retry times.

@JmPotato
Copy link
Member Author

JmPotato commented Jun 3, 2021

Please add TestingT otherwise tests in store/tikv package won't run.

I have added it in store/tikv/tikv_test.go.

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • Yisaer
  • xhebox

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jun 3, 2021
@Yisaer
Copy link
Contributor

Yisaer commented Jun 3, 2021

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: c870d80

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Jun 3, 2021
@ti-chi-bot
Copy link
Member

@JmPotato: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot merged commit 69274d8 into pingcap:master Jun 3, 2021
@JmPotato JmPotato deleted the stale_read_retry branch June 3, 2021 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/execution SIG execution size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants