-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PD may remove wrong pending peer if the peer count exceeds the limit #4045
Comments
will this bug exist if there are at least 3 voters? |
Sorry I didn't get the point, could you elaborate more on:
|
There will be.
If a region is desired to have n replicas, and we now have m replicas in fact (m > n), then there are (m - n) extra peers.
It means the peer is not pending.
in other words, PD should not try to keep the number of peers to the desired peer count by removing some of the peers when the total number of non-pending peers does not exceed the desired count. |
The severity of this bug seems to be critical. |
There is indeed one more replica of the learner, and it is reasonable to delete one learner. what do you expect here is to delete a replica with fewer data instead of just deleting one at random? But PD does not know the synchronization process of the replica. |
got it. it's reasonable. |
IMO, this example is beyond the scope of our design. We should use multiple voters instead of some learners to ensure data consistency. From the perspective of consistency model, non-pending learner means that the data is likely to be up-to-date, and pending learner means that the data is likely to be not up-to-date -- both are essentially inconsistent. The reason for the data loss in this example is that there is only one voter, and that voter is down. Whether the PD's strategy is to not delete the extra learner, or to only delete the pending learner, or to only delete the learner with more logs, it will not change the occurrence of data loss. |
Is this question mainly considered in the recovery scenario? It should be an enhancement. |
@disksing the problem is that pd can not distinguish between a pending peer that has no data and a pending peer that has no recent data. |
Bug Report
What did you do?
The steps to reproduce:
Store1
Store2
Store3
.Region1
is a two-replica region and has 1 voterPeer1
and 1 learnerPeer2
:Store3
, therefore, PD adds a new learnerPeer3
toStore3
:Store2
suffers from network jitter, which makes Peer2 fails to catch up with the leader:Store3
's network becomes unstable and thus unable to get a snapshot from the leader. PD cancels theadd peer
operation on timeout. And then 3 peers exist forRegion1
, and PD is going to delete one of the peers:Store1
goes down, then no data is left forRegion1
in the TiKV cluster:What did you expect to see?
PD should not delete extra peers if the number of normal peers does not exceed the limit.
What version of PD are you using (
pd-server -V
)?master
The text was updated successfully, but these errors were encountered: