Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Query Influence the Resource Group Stability #50831

Closed
nolouch opened this issue Jan 30, 2024 · 7 comments
Closed

Large Query Influence the Resource Group Stability #50831

nolouch opened this issue Jan 30, 2024 · 7 comments
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@nolouch
Copy link
Member

nolouch commented Jan 30, 2024

Background

They're running a stable workload. the ru limit works fine. but once a query's plan changes or there comes a new query that will consume lots of RUs, this may make the limit unstable:

the query is running on tidb-0 and consume a lots RUs:

image

image
image

image

As the picture shows, after the problem query, the normal query needs to be recovered according to the refill rate to pay back the debt. which will increase the latency for a long time:

Issues

  • Actually, this problem is by design, but hard to explain to users. user may expect the normal query be recover quickly.
  • Different TiDBs are affected to varying degrees. If the debt can be shared, the overall situation may be better.
  • Also some queries report Exceed the limit after the large query is executed.
@nolouch nolouch added the type/enhancement The issue or PR belongs to an enhancement. label Jan 30, 2024
@pregenRobot
Copy link

pregenRobot commented May 23, 2024

Hello.

We encountered a similar issue in production the other day and I did some research into the internals of Resource Managers and Request Units as well as read up some RFCs.

Please note I am still a beginner to the internals of the project but here is what I found so far:

  • TiDB's Resource Manager uses Token Bucket Algorithm under the hood.
  • The PD trickles and grants tokens to TiDB clients each of which stores its own internal Token Bucket
  • Read-based queries computes RU ahead of time before executing the query
  • Write-based queries can calculate RU ahead of time to an extent but the real value of RU consumed for the query is computed at the end.
    • Suppose there is a trick amount of RU left in the local TiDB node, just enough to start running a massive write query.
    • After executing it a massive amount of RU gets used which is deducted from the local TiDB token bucket.
    • This TiDB node cannot handle requests until it gets repaid which leads to a throttling in QPS

As you suggested, it would be nice if the debt can be shared between the TiDB nodes.

One possible solution might be to throttle if trickling of Tokens to TiDB nodes with tokens > 0 and instead perhaps prioritize more tokens being allocated to TiDB nodes in debt?

One way this might work is:

  • Server has endpoint pay_back_debt()
  • Server keeps track of TiDB nodes in debt
  • Server has endpoint finished_paying_debt()
  • Client which is in debt calls pay_back_debt() to the server
  • Given N total TiDB nodes, and M TiDB nodes in debt, server upon receiving this requests, sends N/M times more tokens to TiDB nodes in debt while giving 0 tokens to TIDB nodes that are not in debt.
  • After the server receives finished_paying_debt() from the client, PD stops marking the TiDB client as in-debt.

I have to do more research on which existing functionalities we can reuse to implement this in the PD server / PD client but, what do you think of this approach as a remedy for sharing debt more equally?

@nolouch
Copy link
Member Author

nolouch commented May 23, 2024

@pregenRobot thanks for your attention.

yeah, I think your idea is a viable solution about how to pay the debt more fair.

There are two places to be optimized.

  1. Do not let him take on too much debt like report the error earlier or limit the concurrency. [debtor should consider his own ability to bear responsibilities ]
  2. Pay the debt more reasonable. such as sharing debts among all clients fairer or extending the repayment period [make debtor's life also good :)]

I am working on the first one. If you are interested, you can also try your own solution.

@pregenRobot
Copy link

pregenRobot commented May 24, 2024

@nolouch

One feature that will lead to a better experience would be to increase the observability around when TiDB nodes are in Token Bucket debt.

Is there any way we can verify how much debt the nodes are in? Currently, the Grafana Dashboard template does not have a feature and we are not even sure if such metrics are collected.

If it's okay, I would be willing to work on collecting TiDB Node-level RU debt to display on the dashboard. And yes, I am willing to explore the possibility of approach 2 you have laid out :)

@nolouch
Copy link
Member Author

nolouch commented May 27, 2024

It may only be used for debugging, normal users may not need it. and it also brings many metrics overhead if there are many resource groups and TiDB instances. so should consider whether users really need it. I think the concept of debt actually doesn't need to be known by users.

@pregenRobot
Copy link

pregenRobot commented Jul 28, 2024

@nolouch

Sorry for reaching out after a long time. I tried running some simulations to consider when a client goes into debt. Here is the test code I wrote (TestGroupTokenBucketRequestDebtShare):

Screenshot 2024-07-28 at 8 41 43 PM

With 300k tokens and 3 clients each requesting 100k, I expected none of the clients Token slots to go into debt

However, the last client TokenSlot does go into debt, even though the overall "TokenBucket" is not in debt. Moreover, clients that have not requested tokens get allocated some as well (100k/3 = 33k):

Screenshot 2024-07-28 at 8 41 34 PM

Is this expected behaviour? Is this possibly the reason clients are heading into debt too fast?

I am quite new to the TiDB code base and am willing to file a bug report and investigate further if this is not the expected behaviour.

@nolouch
Copy link
Member Author

nolouch commented Jul 31, 2024

@pregenRobot Slots and tokens are added or updated only upon request, so your test case may require multiple requests, similar to real-world scenarios. This is expected. Try step-by-step debugging to gain a deeper understanding.

@nolouch
Copy link
Member Author

nolouch commented Oct 31, 2024

close by #55029.

@nolouch nolouch closed this as completed Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

2 participants