tikvclient: refine region-cache #10256

lysu · 2019-04-24T09:35:32Z

What problem does this PR solve?

try to fixes #10037, keep retry region's other stores

meet hibernate region feature in tikv
keep using old data when tidb can not connect to pd

What is changed and how it works?

What region-cache can hold?

region: region maintains a range of data, one region has multiple stores, it will be created/deleted frequently(region split, region merge, leader change, schedule from one store to another)
stores: store is kv process, one machine can hold multiple stores, one store can hold multiple regions, it will be infrequent change(copy old kv to new machine, change network interface...)

What fails or data outdated region-cache will meet?

send data failure

normally, it happens when machine down or network partition between tidb and kv or process crash.
for TiDB side, it will see send data failure event to identify the store failure.

but sometimes, this will be caused by someone to replace a new machine or change network interface, but people don't often do that.

send success but got the error response

this means the store is working well, but info is miss matched.
for TiDB side, send data will success, but will get fail response from kv.

normally, it's the region info be changed, seldom it's store info changed.

How current running

on send fail, it will drop region cache and remove store info, it will trigger reload region and store info.

but have 2 problems:

re-fetch region info will get the old leader when kv doesn't trigger new election, but hibernate region need trigger other peer to start new election
fetch region from pd maybe failure if pd have network partition.

how this PR change

mark store after send

mark store be failed when sending data failure, it let tidb blackout this store in following in a period(continue fail count + last fail timestamp).
mark store as success when sending data success.

"blackout store when failure" keep the chance that failure peer can be used again but doesn't make retry flood.

invalidate region cache

cache item never be deleted and only invalidate it to trigger pd re-fetch
it will make it validate again to keep use old data if fetch failure caused by pd down.

review this PR maybe need to take look at #6880 and #2792, also GetRPCContext method which isn't modified but vital to this logic.

main focus

fast & lock-free for cache-hit code path?
when to try other peers when the current leader unreachable?
when to invalidate a region and trigger reload region?
when & how long to blackout a store when it continues failed?
when to trigger stores info to be reloaded?

Check List

Tests

Unit test(WIP add more)
Integration test

Code changes

Has exported function/method change
Has exported variable/fields change
Has interface methods change

Side effects

Increased code complexity

Related changes

N/A

This change is

lysu · 2019-04-24T09:38:41Z

/run-all-tests

store/tikv/region_cache.go

lysu · 2019-04-25T07:43:19Z

/run-all-tests

zhouqiang-cl · 2019-04-25T08:04:34Z

/rebuild

zhouqiang-cl · 2019-04-25T08:45:47Z

/rebuild

lysu · 2019-04-25T09:16:31Z

/run-all-tests

zhouqiang-cl · 2019-04-25T09:24:28Z

/rebuild

mahjonp · 2019-04-25T12:29:52Z

/rebuild

lysu · 2019-04-25T13:44:38Z

/run-unit-test

codecov · 2019-04-28T03:04:40Z

Codecov Report

Merging #10256 into master will decrease coverage by 0.0033%.
The diff coverage is 72.3404%.

@@               Coverage Diff                @@
##             master     #10256        +/-   ##
================================================
- Coverage   77.8178%   77.8144%   -0.0034%     
================================================
  Files           410        410                
  Lines         84365      84438        +73     
================================================
+ Hits          65651      65705        +54     
- Misses        13813      13826        +13     
- Partials       4901       4907         +6

codecov · 2019-04-28T03:04:40Z

Codecov Report

Merging #10256 into master will decrease coverage by 0.0176%.
The diff coverage is 81.0096%.

@@               Coverage Diff                @@
##             master     #10256        +/-   ##
================================================
- Coverage   77.2779%   77.2603%   -0.0177%     
================================================
  Files           413        413                
  Lines         86986      87244       +258     
================================================
+ Hits          67221      67405       +184     
- Misses        14600      14647        +47     
- Partials       5165       5192        +27

coocood · 2019-04-28T07:17:14Z

If we add an Unreachable pointer to the Store and makes every region reference the pointer, access atomically.
We can avoid holding the lock for a long time on request fail.

Then we need to check the store state and switch to backup stores in GetRPCContext.

store/tikv/region_cache.go

disksing · 2019-04-28T11:10:47Z

store/tikv/region_cache.go

+			id   uint64
+			peer *metapb.Peer
+		)
+		for id, peer = range r.backupPeers {


I don't think it is good practice to depend on the fact that access of map is randomized... Maybe simpler to use a slice to store backupPeers instead...

disksing · 2019-04-28T11:24:52Z

store/tikv/region_cache.go

+
+func (r *Region) tryRandPeer() {
+	r.peer = r.meta.Peers[0]
+	for i := 1; i < len(r.meta.Peers); i++ {


How about rand.Shuffle(r.meta.Peers) then you can try peers one by one.

kv return NotLeader before EpochNotMatch

coocood · 2019-05-20T06:52:32Z

LGTM

tiancaiamao · 2019-05-21T05:02:47Z

LGTM

tiancaiamao · 2019-05-21T05:04:46Z

/run-all-tests

1. mark store after send * mark store be failed when sending data failure, it let tidb blackout this store in following in a period(continue fail count + last fail timestamp). * mark store as success when sending data success. 2. invalidate region cache * cache item never be deleted and only invalidate it to trigger pd re-fetch * make cache item validate again to keep use old data if fetch failure caused by pd down.

lysu added component/tikv type/bugfix This PR fixes a bug. labels Apr 24, 2019

lysu requested a review from coocood April 24, 2019 09:36

lysu added the status/WIP label Apr 24, 2019

ngaut reviewed Apr 24, 2019

View reviewed changes

store/tikv/region_cache.go Outdated Show resolved Hide resolved

lysu added the require-LGT3 Indicates that the PR requires three LGTM. label Apr 25, 2019

lysu requested review from disksing and tiancaiamao April 25, 2019 07:28

lysu removed the status/WIP label Apr 25, 2019

lysu marked this pull request as ready for review April 25, 2019 07:29

lysu requested a review from jackysp April 25, 2019 07:35

shenli added the priority/release-blocker This issue blocks a release. Please solve it ASAP. label Apr 28, 2019

lysu added the status/WIP label Apr 28, 2019

lysu removed the status/WIP label Apr 28, 2019

jackysp reviewed Apr 28, 2019

View reviewed changes

store/tikv/region_cache.go Show resolved Hide resolved

jackysp reviewed Apr 28, 2019

View reviewed changes

store/tikv/region_cache.go Outdated Show resolved Hide resolved

jackysp removed the priority/release-blocker This issue blocks a release. Please solve it ASAP. label Apr 28, 2019

disksing reviewed Apr 28, 2019

View reviewed changes

lysu added 16 commits May 20, 2019 11:13

refine region epoch not match handle

c3cb447

notify check store via a channel

ca2ef1b

ac: reduce atomic.Load time

0311be2

release region-cache goroutine in testcase

fc54dde

address comments

438c590

address comments

b2a9962

address comments

74e084c

address comments

7988050

simplify use defer

66cfe83

address comment

77f5f6e

address comment

f74329e

address comment

f817653

address comment

c34dcbb

reload region after try 5 peers

7804403

address comment

cbfd25e

NotLeader's leader maybe miss in current cache

6e9f6f4

kv return NotLeader before EpochNotMatch

jackysp added the status/LGT2 Indicates that a PR has LGTM 2. label May 20, 2019

disksing mentioned this pull request May 21, 2019

store/tikv: Make RangeTaskRunner support dividing task by multiple regions #10482

Merged

Merge branch 'master' into dev-region-retry-fail

94794bb

tiancaiamao added status/LGT3 The PR has already had 3 LGTM. and removed status/LGT2 Indicates that a PR has LGTM 2. labels May 21, 2019

tiancaiamao approved these changes May 21, 2019

View reviewed changes

tiancaiamao merged commit f6346a1 into pingcap:master May 21, 2019

lysu mentioned this pull request May 22, 2019

tikv: fix notLeader in region cache #10572

Merged

marsishandsome mentioned this pull request Oct 25, 2019

tikv-client should refine region-cache pingcap/tispark#1170

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tikvclient: refine region-cache #10256

tikvclient: refine region-cache #10256

lysu commented Apr 24, 2019 •

edited

Loading

lysu commented Apr 24, 2019

lysu commented Apr 25, 2019

zhouqiang-cl commented Apr 25, 2019

zhouqiang-cl commented Apr 25, 2019

lysu commented Apr 25, 2019

zhouqiang-cl commented Apr 25, 2019

mahjonp commented Apr 25, 2019

lysu commented Apr 25, 2019

codecov bot commented Apr 28, 2019

codecov bot commented Apr 28, 2019 •

edited

Loading

coocood commented Apr 28, 2019 •

edited

Loading

disksing Apr 28, 2019

disksing Apr 28, 2019

coocood commented May 20, 2019

tiancaiamao commented May 21, 2019

tiancaiamao commented May 21, 2019

tikvclient: refine region-cache #10256

tikvclient: refine region-cache #10256

Conversation

lysu commented Apr 24, 2019 • edited Loading

What problem does this PR solve?

What is changed and how it works?

What region-cache can hold?

What fails or data outdated region-cache will meet?

How current running

how this PR change

Check List

lysu commented Apr 24, 2019

lysu commented Apr 25, 2019

zhouqiang-cl commented Apr 25, 2019

zhouqiang-cl commented Apr 25, 2019

lysu commented Apr 25, 2019

zhouqiang-cl commented Apr 25, 2019

mahjonp commented Apr 25, 2019

lysu commented Apr 25, 2019

codecov bot commented Apr 28, 2019

Codecov Report

codecov bot commented Apr 28, 2019 • edited Loading

Codecov Report

coocood commented Apr 28, 2019 • edited Loading

disksing Apr 28, 2019

Choose a reason for hiding this comment

disksing Apr 28, 2019

Choose a reason for hiding this comment

coocood commented May 20, 2019

tiancaiamao commented May 21, 2019

tiancaiamao commented May 21, 2019

lysu commented Apr 24, 2019 •

edited

Loading

codecov bot commented Apr 28, 2019 •

edited

Loading

coocood commented Apr 28, 2019 •

edited

Loading