Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hot scheduler cannot handle some precise boundary conditions #5021

Closed
Tracked by #4949
lhy1024 opened this issue May 24, 2022 · 11 comments · Fixed by #5515
Closed
Tracked by #4949

Hot scheduler cannot handle some precise boundary conditions #5021

lhy1024 opened this issue May 24, 2022 · 11 comments · Fixed by #5515
Labels
severity/moderate type/bug The issue is confirmed as a bug.

Comments

@lhy1024
Copy link
Contributor

lhy1024 commented May 24, 2022

Bug Report

What did you do?

there is a store with the nearly same two hot regions, and a store without any hot region

{
  "as_peer": {
    "1": {
      "store_bytes": 700192310.5,
      "store_keys": 8004996.1,
      "store_query": 160.1,
      "total_flow_bytes": 700192311,
      "total_flow_keys": 8004996,
      "total_flow_query": 160,
      "regions_count": 2,
      "statistics": [
        {
          "store_id": 1,
          "region_id": 4004,
          "hot_degree": 180,
          "flow_bytes": 349545961,
          "flow_keys": 3996039,
          "flow_query": 80,
          "anti_count": 12,
          "last_update_time": "2022-05-23T07:57:53.784776989Z"
        },
        {
          "store_id": 1,
          "region_id": 4001,
          "hot_degree": 182,
          "flow_bytes": 350646350,
          "flow_keys": 4008957,
          "flow_query": 80,
          "anti_count": 12,
          "last_update_time": "2022-05-23T07:57:53.784836509Z"
        }
      ]
    },
    "66": {
      "store_bytes": 0,
      "store_keys": 0,
      "store_query": 0,
      "total_flow_bytes": 0,
      "total_flow_keys": 0,
      "total_flow_query": 0,
      "regions_count": 0,
      "statistics": []
    }
  },
  "as_leader": {
    "1": {
      "store_bytes": 700192310.5,
      "store_keys": 8004996.1,
      "store_query": 160.1,
      "total_flow_bytes": 700192311,
      "total_flow_keys": 8004996,
      "total_flow_query": 160,
      "regions_count": 2,
      "statistics": [
        {
          "store_id": 1,
          "region_id": 4004,
          "hot_degree": 180,
          "flow_bytes": 349545961,
          "flow_keys": 3996039,
          "flow_query": 80,
          "anti_count": 12,
          "last_update_time": "2022-05-23T07:57:53.784776989Z"
        },
        {
          "store_id": 1,
          "region_id": 4001,
          "hot_degree": 182,
          "flow_bytes": 350646350,
          "flow_keys": 4008957,
          "flow_query": 80,
          "anti_count": 12,
          "last_update_time": "2022-05-23T07:57:53.784836509Z"
        }
      ]
    },
    "66": {
      "store_bytes": 0,
      "store_keys": 0,
      "store_query": 0,
      "total_flow_bytes": 0,
      "total_flow_keys": 0,
      "total_flow_query": 0,
      "regions_count": 0,
      "statistics": []
    }
  }
}

What did you expect to see?

hot scheduler makes them even

What did you see instead?

no change

origin_img_v2_bc92d7ae-bf21-442c-ba61-79a694340e4g

What version of PD are you using (pd-server -V)?

v6.0

@lhy1024 lhy1024 added the type/bug The issue is confirmed as a bug. label May 24, 2022
@lhy1024
Copy link
Contributor Author

lhy1024 commented May 24, 2022

AVx9xNVzSP

suppose there are two peer with 10 qps in store1,there is no hot peer in store2 ,decRatio = (0+10)/(20-10) = 1.0 > 0.99

suppose there are three peer with 10 qps in store1,there is a peer with 10 qps in store2 ,decRatio = (10+10)/ (30-10) = 1.0 > 0.99

@mayjiang0203
Copy link

corner case, mark it moderate
/severity moderate

@xuning97
Copy link

/assign @xuning97

@nolouch
Copy link
Contributor

nolouch commented Jun 10, 2022

Hi @xuning97 , Do you want to help resolve this issue. This requires some background knowledge, and it is best to find some simple ones first.

@xuning97
Copy link

Yes, I want to look into it. At a glance, It looks like a simple one. It seems like some calculation formula issue, and with the correct formula, I can get the problem solved.

So probably this is more complicated than I thought?

@nolouch
Copy link
Contributor

nolouch commented Jun 13, 2022

@xuning97 We need to evaluate it. That requires some experience. There some problems

  • How to change the condition?
  • How to evaluate what is better scheduling?
  • If change the condition, it may increase the redundant scheduling. how to solve it?

@xuning97
Copy link

suppose there are two peer with 10 qps in store1,there is no hot peer in store2 ,decRatio = (20-10)/(0+10) = 1.0 > 0.99

suppose there are three peer with 10 qps in store1,there is a peer with 10 qps in store2 ,decRatio = (30-10)/(10+10) = 1.0 > 0.99

@lhy1024, just want to confirm whether the numerator and denominator are wrongly placed.

From the function (dstRate + peerRate) / getSrcDecRate(srcRate, peerRate), there is no minus operation on the numerator side.

@lhy1024
Copy link
Contributor Author

lhy1024 commented Jun 14, 2022

suppose there are two peer with 10 qps in store1,there is no hot peer in store2 ,decRatio = (20-10)/(0+10) = 1.0 > 0.99
suppose there are three peer with 10 qps in store1,there is a peer with 10 qps in store2 ,decRatio = (30-10)/(10+10) = 1.0 > 0.99

@lhy1024, just want to confirm whether the numerator and denominator are wrongly placed.

From the function (dstRate + peerRate) / getSrcDecRate(srcRate, peerRate), there is no minus operation on the numerator side.

Thanks, updated.

@xuning97
Copy link

@lhy1024 , for the second example
suppose there are three peer with 10 qps in store1,there is a peer with 10 qps in store2 ,decRatio = (10+10)/ (30-10) = 1.0 > 0.99

Why would you think it's not a proper case to be declined?

@lhy1024
Copy link
Contributor Author

lhy1024 commented Jun 14, 2022

@lhy1024 , for the second example suppose there are three peer with 10 qps in store1,there is a peer with 10 qps in store2 ,decRatio = (10+10)/ (30-10) = 1.0 > 0.99

Why would you think it's not a proper case to be declined?

Because we expect the load between stores to be even, while reducing unnecessary scheduling.

@xuning97
Copy link

@nolouch

@xuning97 We need to evaluate it. That requires some experience. There some problems

  • How to change the condition?
    Consider including standard deviation into the logic. Basically also compare the load's deviation from the mean of two loads.
    For example 1, (0+20)/2 = 10, and the variance is (100+100)=200, which reflects an unbalanced situation.
  • How to evaluate what is better scheduling?
    The stores are more balanced and the transfer won't happen more frequently ( or slightly more frequent) as compared to before.
  • If change the condition, it may increase the redundant scheduling. how to solve it?
    If the redundant scheduling is too obvious and thus the result is not good enough as compared to the benefit of fix this corner case, the solution is not acceptable. Then we can consider just limit the fix as small as possible to just solve the specific corner case pointed out.
    We need to measure the increase of the redundant scheduling with data, maybe there are some existing method or utility for this kind of check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
severity/moderate type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants