Sometimes failed to create scatter operator after splitting #4525

YuJuncen · 2021-12-30T07:37:12Z

Bug Report

cc pingcap/tidb#31034

What did you do?

Execute ScatterRegions over some fresh regions created by BatchSplitRegion.

What did you expect to see?

The regions are scattered and balanced.

What did you see instead?

Some of the operators failed to be created because of unhealthy. And the final region isn't balanced.

What version of PD are you using (`pd-server -V`)?

/var/lib/pd/log # /pd-server -V
Release Version: v5.4.0-nightly
Edition: Community
Git Commit Hash: ae23d409c528a836ae6d98cd174689c38ad19f2d
Git Branch: heads/refs/tags/v5.4.0-nightly
UTC Build Time:  2021-12-30 12:11:23

Note

(The details of metric and cluster info TBD)

The text was updated successfully, but these errors were encountered:

YuJuncen · 2021-12-31T03:40:39Z

What happened:

When we running restore, we found that the region is unbalanced.

Then, we have checked the PD dashboard, and found that there are many failure of scatter:

According to the DEBUG log, we found the reason of failed to create scatter operator is mainly unhealthy. But the cluster looks fine. (Those empty regions are freshly split and would be filled with data soon)

Note that this metric may not accurate enough.

We also find during scattering, there are many add-rule-peer operator created.

I'm wondering: What is the reason of that failure? How can I get why those add-rule-peer operator created, are they relative to the failure?

nolouch · 2022-01-11T03:09:28Z

I'm wondering: What is the reason of that failure? How can I get why those add-rule-peer operator created, are they relative to the failure?

Good question, I think we need do some works to improve the diagnosis. like #4418

rleungx · 2022-01-14T10:27:42Z

The problem of add-rule-peer is tracked by #4565 and it has been fixed. The reason why the operator fails is that the region has pending peers. We can do a check before starting the scatter region. See pingcap/tidb#31691.

Here is the test result of the above PR:

YuJuncen added the type/bug The issue is confirmed as a bug. label Dec 30, 2021

nolouch added type/enhancement The issue or PR belongs to an enhancement. epic/stability and removed type/bug The issue is confirmed as a bug. labels Jan 11, 2022

YuJuncen mentioned this issue Jan 19, 2022

scatter: check pending peers before scattering pingcap/tidb#31691

Merged

12 tasks

This was referenced Feb 7, 2022

scatter: check pending peers before scattering (#31691) pingcap/tidb#32128

Merged

scatter: check pending peers before scattering (#31691) pingcap/tidb#32129

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sometimes failed to create scatter operator after splitting #4525

Sometimes failed to create scatter operator after splitting #4525

YuJuncen commented Dec 30, 2021

YuJuncen commented Dec 31, 2021 •

edited

Loading

nolouch commented Jan 11, 2022

rleungx commented Jan 14, 2022

Sometimes failed to create scatter operator after splitting #4525

Sometimes failed to create scatter operator after splitting #4525

Comments

YuJuncen commented Dec 30, 2021

Bug Report

What did you do?

What did you expect to see?

What did you see instead?

What version of PD are you using (pd-server -V)?

Note

YuJuncen commented Dec 31, 2021 • edited Loading

What happened:

nolouch commented Jan 11, 2022

rleungx commented Jan 14, 2022

What version of PD are you using (`pd-server -V`)?

YuJuncen commented Dec 31, 2021 •

edited

Loading