PD may remove wrong pending peer if the peer count exceeds the limit #4045

andylokandy · 2021-08-27T10:42:18Z

Bug Report

What did you do?

The steps to reproduce:

Assume we have 3 stores: Store1 Store2 Store3. Region1 is a two-replica region and has 1 voter Peer1 and 1 learner Peer2:

Store1	Store2	Store3
Peer1 (leader)	Peer2 (learner)

PD decides to move the learner to Store3, therefore, PD adds a new learner Peer3 to Store3:

Store1	Store2	Store3
Peer1 (leader)	Peer2 (learner)	Peer3 (learner, no data, pending)

Store2 suffers from network jitter, which makes Peer2 fails to catch up with the leader:

Store1	Store2	Store3
Peer1 (leader)	Peer2 (learner, pending)	Peer3 (learner, no data, pending)

Store3's network becomes unstable and thus unable to get a snapshot from the leader. PD cancels the add peer operation on timeout. And then 3 peers exist for Region1, and PD is going to delete one of the peers:

Store1	Store2	Store3
Peer1 (leader)		Peer3 (learner, no data, pending)

Store1 goes down, then no data is left for Region1 in the TiKV cluster:

Store1	Store2	Store3
		Peer3 (learner, no data, pending)

What did you expect to see?

PD should not delete extra peers if the number of normal peers does not exceed the limit.

What version of PD are you using (`pd-server -V`)?

master

The text was updated successfully, but these errors were encountered:

zz-jason · 2021-08-29T05:25:08Z

will this bug exist if there are at least 3 voters?

zz-jason · 2021-08-29T05:28:53Z

PD should not delete extra peers if the number of normal peers does not exceed the limit.

Sorry I didn't get the point, could you elaborate more on:

what are the "extra peers"?
what are "normal peers"?
what is the "limit"?

andylokandy · 2021-08-29T13:11:38Z

@zz-jason

will this bug exist if there are at least 3 voters?

There will be.

what are the "extra peers"? what is the "limit"?

If a region is desired to have n replicas, and we now have m replicas in fact (m > n), then there are (m - n) extra peers.

what are "normal peers"?

It means the peer is not pending.

PD should not delete extra peers if the number of normal peers does not exceed the limit.

in other words, PD should not try to keep the number of peers to the desired peer count by removing some of the peers when the total number of non-pending peers does not exceed the desired count.

tiancaiamao · 2021-08-30T10:04:46Z

The severity of this bug seems to be critical.

andylokandy · 2021-08-30T10:23:25Z

/cc @nolouch @disksing PTAL

nolouch · 2021-08-30T10:32:32Z

There is indeed one more replica of the learner, and it is reasonable to delete one learner. what do you expect here is to delete a replica with fewer data instead of just deleting one at random? But PD does not know the synchronization process of the replica.

nolouch · 2021-08-30T10:39:37Z

PD should not try to keep the number of peers to the desired peer count by removing some of the peers when the total number of non-pending peers does not exceed the desired count.

got it. it's reasonable.

disksing · 2021-08-30T11:03:30Z

IMO, this example is beyond the scope of our design. We should use multiple voters instead of some learners to ensure data consistency.

From the perspective of consistency model, non-pending learner means that the data is likely to be up-to-date, and pending learner means that the data is likely to be not up-to-date -- both are essentially inconsistent.

The reason for the data loss in this example is that there is only one voter, and that voter is down. Whether the PD's strategy is to not delete the extra learner, or to only delete the pending learner, or to only delete the learner with more logs, it will not change the occurrence of data loss.

nolouch · 2021-08-30T11:41:39Z

Is this question mainly considered in the recovery scenario? It should be an enhancement.

andylokandy · 2021-08-30T12:02:53Z

@disksing the problem is that pd can not distinguish between a pending peer that has no data and a pending peer that has no recent data.

andylokandy added the type/bug The issue is confirmed as a bug. label Aug 27, 2021

andylokandy changed the title ~~PD may remove wrong pending peer if the peer count exceed the limit~~ PD may remove wrong pending peer if the peer count exceeds the limit Aug 27, 2021

andylokandy mentioned this issue Sep 2, 2021

placement: do not delete orphan peers if some peers selected by RuleFit is down or pending #4067

Merged

HunDunDM added type/enhancement The issue or PR belongs to an enhancement. and removed type/bug The issue is confirmed as a bug. labels Sep 3, 2021

ti-chi-bot closed this as completed in #4067 Sep 6, 2021

andylokandy mentioned this issue Sep 6, 2021

placement: do not delete orphan peers if some peers selected by RuleFit is down or pending (#4067) #4087

Closed

nolouch mentioned this issue Oct 21, 2021

Priority to fit health peer with placement rules #4233

Closed

HunDunDM mentioned this issue Dec 15, 2022

PD may repeatedly add learner to a region #5786

Closed

nolouch mentioned this issue Oct 26, 2023

checker: add disconnected check when fix orphan peers #7240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PD may remove wrong pending peer if the peer count exceeds the limit #4045

PD may remove wrong pending peer if the peer count exceeds the limit #4045

andylokandy commented Aug 27, 2021 •

edited

Loading

zz-jason commented Aug 29, 2021

zz-jason commented Aug 29, 2021

andylokandy commented Aug 29, 2021

tiancaiamao commented Aug 30, 2021

andylokandy commented Aug 30, 2021

nolouch commented Aug 30, 2021 •

edited

Loading

nolouch commented Aug 30, 2021

disksing commented Aug 30, 2021

nolouch commented Aug 30, 2021 •

edited

Loading

andylokandy commented Aug 30, 2021

PD may remove wrong pending peer if the peer count exceeds the limit #4045

PD may remove wrong pending peer if the peer count exceeds the limit #4045

Comments

andylokandy commented Aug 27, 2021 • edited Loading

Bug Report

What did you do?

What did you expect to see?

What version of PD are you using (pd-server -V)?

zz-jason commented Aug 29, 2021

zz-jason commented Aug 29, 2021

andylokandy commented Aug 29, 2021

tiancaiamao commented Aug 30, 2021

andylokandy commented Aug 30, 2021

nolouch commented Aug 30, 2021 • edited Loading

nolouch commented Aug 30, 2021

disksing commented Aug 30, 2021

nolouch commented Aug 30, 2021 • edited Loading

andylokandy commented Aug 30, 2021

andylokandy commented Aug 27, 2021 •

edited

Loading

What version of PD are you using (`pd-server -V`)?

nolouch commented Aug 30, 2021 •

edited

Loading

nolouch commented Aug 30, 2021 •

edited

Loading