-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tikvclient: refine region-cache #10256
Conversation
/run-all-tests |
/run-all-tests |
/rebuild |
1 similar comment
/rebuild |
/run-all-tests |
/rebuild |
1 similar comment
/rebuild |
/run-unit-test |
Codecov Report
@@ Coverage Diff @@
## master #10256 +/- ##
================================================
- Coverage 77.8178% 77.8144% -0.0034%
================================================
Files 410 410
Lines 84365 84438 +73
================================================
+ Hits 65651 65705 +54
- Misses 13813 13826 +13
- Partials 4901 4907 +6 |
Codecov Report
@@ Coverage Diff @@
## master #10256 +/- ##
================================================
- Coverage 77.2779% 77.2603% -0.0177%
================================================
Files 413 413
Lines 86986 87244 +258
================================================
+ Hits 67221 67405 +184
- Misses 14600 14647 +47
- Partials 5165 5192 +27 |
If we add an Then we need to check the store state and switch to backup stores in |
store/tikv/region_cache.go
Outdated
id uint64 | ||
peer *metapb.Peer | ||
) | ||
for id, peer = range r.backupPeers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is good practice to depend on the fact that access of map is randomized... Maybe simpler to use a slice to store backupPeers instead...
store/tikv/region_cache.go
Outdated
|
||
func (r *Region) tryRandPeer() { | ||
r.peer = r.meta.Peers[0] | ||
for i := 1; i < len(r.meta.Peers); i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about rand.Shuffle(r.meta.Peers)
then you can try peers one by one.
kv return NotLeader before EpochNotMatch
LGTM |
LGTM |
/run-all-tests |
1. mark store after send * mark store be failed when sending data failure, it let tidb blackout this store in following in a period(continue fail count + last fail timestamp). * mark store as success when sending data success. 2. invalidate region cache * cache item never be deleted and only invalidate it to trigger pd re-fetch * make cache item validate again to keep use old data if fetch failure caused by pd down.
What problem does this PR solve?
try to fixes #10037, keep retry region's other stores
What is changed and how it works?
What region-cache can hold?
What fails or data outdated region-cache will meet?
normally, it happens when machine down or network partition between tidb and kv or process crash.
for TiDB side, it will see send data failure event to identify the store failure.
but sometimes, this will be caused by someone to replace a new machine or change network interface, but people don't often do that.
this means the store is working well, but info is miss matched.
for TiDB side, send data will success, but will get fail response from kv.
normally, it's the region info be changed, seldom it's store info changed.
How current running
on send fail, it will drop region cache and remove store info, it will trigger reload region and store info.
but have 2 problems:
how this PR change
"blackout store when failure" keep the chance that failure peer can be used again but doesn't make retry flood.
cache item never be deleted and only invalidate it to trigger pd re-fetch
it will make it validate again to keep use old data if fetch failure caused by pd down.
review this PR maybe need to take look at #6880 and #2792, also
GetRPCContext
method which isn't modified but vital to this logic.Check List
Tests
Code changes
Side effects
Related changes
This change is