Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes failed to create scatter operator after splitting #4525

Open
YuJuncen opened this issue Dec 30, 2021 · 3 comments
Open

Sometimes failed to create scatter operator after splitting #4525

YuJuncen opened this issue Dec 30, 2021 · 3 comments
Labels
epic/stability type/enhancement The issue or PR belongs to an enhancement.

Comments

@YuJuncen
Copy link

Bug Report

cc pingcap/tidb#31034

What did you do?

Execute ScatterRegions over some fresh regions created by BatchSplitRegion.

What did you expect to see?

The regions are scattered and balanced.

What did you see instead?

Some of the operators failed to be created because of unhealthy. And the final region isn't balanced.

What version of PD are you using (pd-server -V)?

/var/lib/pd/log # /pd-server -V
Release Version: v5.4.0-nightly
Edition: Community
Git Commit Hash: ae23d409c528a836ae6d98cd174689c38ad19f2d
Git Branch: heads/refs/tags/v5.4.0-nightly
UTC Build Time:  2021-12-30 12:11:23

Note

(The details of metric and cluster info TBD)

@YuJuncen YuJuncen added the type/bug The issue is confirmed as a bug. label Dec 30, 2021
@YuJuncen
Copy link
Author

YuJuncen commented Dec 31, 2021

What happened:

  1. When we running restore, we found that the region is unbalanced.

image

  1. Then, we have checked the PD dashboard, and found that there are many failure of scatter:

image

  1. According to the DEBUG log, we found the reason of failed to create scatter operator is mainly unhealthy. But the cluster looks fine. (Those empty regions are freshly split and would be filled with data soon)

Note that this metric may not accurate enough.

image

  1. We also find during scattering, there are many add-rule-peer operator created.

image

I'm wondering: What is the reason of that failure? How can I get why those add-rule-peer operator created, are they relative to the failure?

@nolouch
Copy link
Contributor

nolouch commented Jan 11, 2022

I'm wondering: What is the reason of that failure? How can I get why those add-rule-peer operator created, are they relative to the failure?

Good question, I think we need do some works to improve the diagnosis. like #4418

@nolouch nolouch added type/enhancement The issue or PR belongs to an enhancement. epic/stability and removed type/bug The issue is confirmed as a bug. labels Jan 11, 2022
@rleungx
Copy link
Member

rleungx commented Jan 14, 2022

The problem of add-rule-peer is tracked by #4565 and it has been fixed. The reason why the operator fails is that the region has pending peers. We can do a check before starting the scatter region. See pingcap/tidb#31691.

Here is the test result of the above PR:
Screen Shot 2022-01-14 at 6 18 49 PM
origin_img_v2_7a2bb692-6f5e-4489-8abf-a5b94e80934g

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic/stability type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

3 participants