Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightning block at not full replicaed #6426

Closed
nolouch opened this issue May 9, 2023 · 1 comment · Fixed by #6427
Closed

Lightning block at not full replicaed #6426

nolouch opened this issue May 9, 2023 · 1 comment · Fixed by #6427

Comments

@nolouch
Copy link
Contributor

nolouch commented May 9, 2023

Bug Report

Problem

In concurrent import, we may have replicas that do not satisfy the replicas constraints, such as:

image

the lightning split the region during the scatter, it may cause abnormal regions. We can see the logs in pd like, it has 6 replicas :

/var/lib/pd/log/pd-2023-05-09T06-58-33.433.log:[2023/05/09 05:41:54.491 +00:00] [INFO] [cluster_worker.go:145] ["alloc ids for region split"] [region-id=1902122] [peer-ids="[1902123,1902124,1902125,1902126,1902127,1902128]"] 
/var/lib/pd/log/pd-2023-05-09T06-58-33.433.log:[2023/05/09 05:41:54.502 +00:00] [WARN] [region_scatterer.go:285] ["region not replicated during scatter"] [region-id=1902122]
/var/lib/pd/log/pd-2023-05-09T06-58-33.433.log:[2023/05/09 05:41:54.514 +00:00] [WARN] [region_scatterer.go:285] ["region not replicated during scatter"] [region-id=1902122]
/var/lib/pd/log/pd-2023-05-09T06-58-33.433.log:[2023/05/09 05:41:54.536 +00:00] [WARN] [region_scatterer.go:285] ["region not replicated during scatter"] [region-id=1902122]
/var/lib/pd/log/pd-2023-05-09T06-58-33.433.log:[2023/05/09 05:41:54.579 +00:00] [WARN] [region_scatterer.go:285] ["region not replicated during scatter"] [region-id=1902122]
/var/lib/pd/log/pd-2023-05-09T06-58-33.433.log:[2023/05/09 05:41:54.661 +00:00] [WARN] [region_scatterer.go:285] ["region not replicated during scatter"] [region-id=1902122]
/var/lib/pd/log/pd-2023-05-09T06-58-33.433.log:[2023/05/09 05:41:54.822 +00:00] [WARN] [region_scatterer.go:285] ["region not replicated during scatter"] [region-id=1902122]
/var/lib/pd/log/pd-2023-05-09T06-58-33.433.log:[2023/05/09 05:41:55.144 +00:00] [WARN] [region_scatterer.go:285] ["region not replicated during scatter"] [region-id=1902122]
/var/lib/pd/log/pd-2023-05-09T06-58-33.433.log:[2023/05/09 05:41:55.785 +00:00] [WARN] [region_scatterer.go:285] ["region not replicated during scatter"] [region-id=1902122]
/var/lib/pd/log/pd-2023-05-09T06-58-33.433.log:[2023/05/09 05:41:57.067 +00:00] [WARN] [region_scatterer.go:285] ["region not replicated during scatter"] [region-id=1902122]
/var/lib/pd/log/pd-2023-05-09T06-58-33.433.log:[2023/05/09 05:41:59.069 +00:00] [WARN] [region_scatterer.go:285] ["region not replicated during scatter"] [region-id=1902122]
/var/lib/pd/log/pd-2023-05-09T06-58-33.433.log:[2023/05/09 05:42:01.071 +00:00] [WARN] [region_scatterer.go:285] ["region not replicated during scatter"] [region-id=1902122]

And metrics like:
image

That's say the lightning tries to scatter the region, but the region is not satisfied by the replicas count of the rule constraint, the replica is less or large than the expect. currently, PD should fix this region, then lighting can retry request success to scatter this region. but from the log, PD is always not trying to fix it.

Analyze

After the investigation, we found lightning use the new interface, want to pauses the scheduler of a key range. but the function introduced by #4649 will stop all scheduler and operator for the key range if the key range label as schedule=deny,
It will block the RuleChecker to fix the region, it is unexpected during the import. details:

func (c *Controller) CheckRegion(region *core.RegionInfo) []*operator.Operator {
// If PD has restarted, it needs to check learners added before and promote them.
// Don't check isRaftLearnerEnabled cause it maybe disable learner feature but there are still some learners to promote.
opController := c.opController
if op := c.jointStateChecker.Check(region); op != nil {
return []*operator.Operator{op}
}
if cl, ok := c.cluster.(interface{ GetRegionLabeler() *labeler.RegionLabeler }); ok {
l := cl.GetRegionLabeler()
if l.ScheduleDisabled(region) {
return nil
}
}

in before, lighting will timeout in 3mins, and then the label of pause scheduler will be clear, and then pd can continue fix the region. but currently, lighting increase the timeout, so it's easier to observe occur the block issue.

How to fix

Do not deny the placement rule operator during the import, we may consider a new label like schedule=importing or directly let deny allow rule check.

What version of PD are you using (pd-server -V)?

master

@nolouch
Copy link
Contributor Author

nolouch commented May 9, 2023

#6420 also add quick fix commit for this issue.

ti-chi-bot bot added a commit that referenced this issue May 10, 2023
close #6426

allow the `schedule=deny` label can do rule constraints check

Signed-off-by: nolouch <nolouch@gmail.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue May 10, 2023
close tikv#6426

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot bot added a commit that referenced this issue May 11, 2023
close #6426, ref #6427

allow the `schedule=deny` label can do rule constraints check

Signed-off-by: nolouch <nolouch@gmail.com>

Co-authored-by: nolouch <nolouch@gmail.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
nolouch added a commit to nolouch/pd that referenced this issue May 15, 2023
close tikv#6426

allow the `schedule=deny` label can do rule constraints check

Signed-off-by: nolouch <nolouch@gmail.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot added a commit that referenced this issue May 25, 2023
close #6426, ref #6427

allow the `schedule=deny` label can do rule constraints check

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: nolouch <nolouch@gmail.com>
Signed-off-by: Ryan Leung <rleungx@gmail.com>

Co-authored-by: ShuNing <nolouch@gmail.com>
Co-authored-by: nolouch <nolouch@gmail.com>
Co-authored-by: Ryan Leung <rleungx@gmail.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Feb 18, 2024
close tikv#6426

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot bot pushed a commit that referenced this issue Feb 26, 2024
close #6426

allow the `schedule=deny` label can do rule constraints check

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: nolouch <nolouch@gmail.com>

Co-authored-by: ShuNing <nolouch@gmail.com>
Co-authored-by: nolouch <nolouch@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants