When leader is selected as target peer at first, no other replicas could be retied for stale read #906

cfzjywxk · 2023-07-24T12:24:17Z

When the leader peer is first picked as the target with labels and it's unavailable because of some region errors, the replica selector would keep trying the leader peer but not recover though replica peers may be able to process this request successfully.

For example:

consider there are 3AZs with 1 tidb and 1 tikv server in each AZ, and there is a region whose leader peer is in AZ a.
Then a stale read request is triggered by the tidb-server in AZ a, the tikv-server in AZ a is selected as the target peer because of the label configurations.
Meanwhile, network partition happens between the tidb and tikv nodes in AZ a, and when errors are encountered processing stale read requests the leader peer is unconditionally retried.
The leader is unavailable and tidb kept raising pseudo epoch not match errors and retrying those peers.

you06 · 2023-08-04T01:57:24Z

#916 fixed this issue, which wll fallback to follower read when leadder is unavailable.

cfzjywxk · 2023-08-04T02:55:46Z

#916 fixed this issue, which wll fallback to follower read when leadder is unavailable.

@you06 #916 introduces fallback to follower ServerIsBusy is returned, the unavailable or rpc error is not processed yet?

you06 · 2023-08-04T03:34:32Z

@cfzjywxk also made some changes in #910, which handles unavailable and rpc error.

client-go/internal/locate/region_request.go

Lines 627 to 641 in 979c489

    
           // In stale-read, the request will fallback to leader after the local follower failure. 
        
           // If the leader is also unavailable, we can fallback to the follower and use replica-read flag again, 
        
           // The remote follower not tried yet, and the local follower can retry without stale-read flag. 
        
           if state.isStaleRead { 
        
           	selector.state = &tryFollower{ 
        
           		fallbackFromLeader: true, 
        
           		leaderIdx:          state.leaderIdx, 
        
           		lastIdx:            state.leaderIdx, 
        
           		labels:             state.option.labels, 
        
           	} 
        
           	if leaderEpochStale { 
        
           		selector.regionCache.scheduleReloadRegion(selector.region) 
        
           	} 
        
           	return nil, stateChanged{} 
        
           }

cfzjywxk · 2023-08-04T03:38:22Z

@you06 OK，you cloud close the issue in that #910.

cfzjywxk · 2023-08-11T08:17:15Z

@you06 Please close this issue if the master PR is merged.

ekexium mentioned this issue Sep 1, 2023

add region cache state test & fix some issues of replica selector (#910) #942

Merged

cfzjywxk closed this as completed in #942 Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When leader is selected as target peer at first, no other replicas could be retied for stale read #906

When leader is selected as target peer at first, no other replicas could be retied for stale read #906

cfzjywxk commented Jul 24, 2023

you06 commented Aug 4, 2023

cfzjywxk commented Aug 4, 2023

you06 commented Aug 4, 2023

cfzjywxk commented Aug 4, 2023 •

edited

Loading

cfzjywxk commented Aug 11, 2023

When leader is selected as target peer at first, no other replicas could be retied for stale read #906

When leader is selected as target peer at first, no other replicas could be retied for stale read #906

Comments

cfzjywxk commented Jul 24, 2023

you06 commented Aug 4, 2023

cfzjywxk commented Aug 4, 2023

you06 commented Aug 4, 2023

cfzjywxk commented Aug 4, 2023 • edited Loading

cfzjywxk commented Aug 11, 2023

cfzjywxk commented Aug 4, 2023 •

edited

Loading