Large Query Influence the Resource Group Stability #50831

nolouch · 2024-01-30T12:05:17Z

Background

They're running a stable workload. the ru limit works fine. but once a query's plan changes or there comes a new query that will consume lots of RUs, this may make the limit unstable:

the query is running on tidb-0 and consume a lots RUs:

As the picture shows, after the problem query, the normal query needs to be recovered according to the refill rate to pay back the debt. which will increase the latency for a long time:

Issues

Actually, this problem is by design, but hard to explain to users. user may expect the normal query be recover quickly.
Different TiDBs are affected to varying degrees. If the debt can be shared, the overall situation may be better.
Also some queries report Exceed the limit after the large query is executed.

The text was updated successfully, but these errors were encountered:

pregenRobot · 2024-05-23T07:33:40Z

Hello.

We encountered a similar issue in production the other day and I did some research into the internals of Resource Managers and Request Units as well as read up some RFCs.

Please note I am still a beginner to the internals of the project but here is what I found so far:

TiDB's Resource Manager uses Token Bucket Algorithm under the hood.
The PD trickles and grants tokens to TiDB clients each of which stores its own internal Token Bucket
Read-based queries computes RU ahead of time before executing the query
Write-based queries can calculate RU ahead of time to an extent but the real value of RU consumed for the query is computed at the end.
- Suppose there is a trick amount of RU left in the local TiDB node, just enough to start running a massive write query.
- After executing it a massive amount of RU gets used which is deducted from the local TiDB token bucket.
- This TiDB node cannot handle requests until it gets repaid which leads to a throttling in QPS

As you suggested, it would be nice if the debt can be shared between the TiDB nodes.

One possible solution might be to throttle if trickling of Tokens to TiDB nodes with tokens > 0 and instead perhaps prioritize more tokens being allocated to TiDB nodes in debt?

One way this might work is:

Server has endpoint pay_back_debt()
Server keeps track of TiDB nodes in debt
Server has endpoint finished_paying_debt()
Client which is in debt calls pay_back_debt() to the server
Given N total TiDB nodes, and M TiDB nodes in debt, server upon receiving this requests, sends N/M times more tokens to TiDB nodes in debt while giving 0 tokens to TIDB nodes that are not in debt.
After the server receives finished_paying_debt() from the client, PD stops marking the TiDB client as in-debt.

I have to do more research on which existing functionalities we can reuse to implement this in the PD server / PD client but, what do you think of this approach as a remedy for sharing debt more equally?

nolouch · 2024-05-23T10:05:50Z

@pregenRobot thanks for your attention.

yeah, I think your idea is a viable solution about how to pay the debt more fair.

There are two places to be optimized.

Do not let him take on too much debt like report the error earlier or limit the concurrency. [debtor should consider his own ability to bear responsibilities ]
Pay the debt more reasonable. such as sharing debts among all clients fairer or extending the repayment period [make debtor's life also good :)]

I am working on the first one. If you are interested, you can also try your own solution.

pregenRobot · 2024-05-24T04:35:38Z

@nolouch

One feature that will lead to a better experience would be to increase the observability around when TiDB nodes are in Token Bucket debt.

Is there any way we can verify how much debt the nodes are in? Currently, the Grafana Dashboard template does not have a feature and we are not even sure if such metrics are collected.

If it's okay, I would be willing to work on collecting TiDB Node-level RU debt to display on the dashboard. And yes, I am willing to explore the possibility of approach 2 you have laid out :)

nolouch · 2024-05-27T08:18:34Z

It may only be used for debugging, normal users may not need it. and it also brings many metrics overhead if there are many resource groups and TiDB instances. so should consider whether users really need it. I think the concept of debt actually doesn't need to be known by users.

pregenRobot · 2024-07-28T12:05:22Z

@nolouch

Sorry for reaching out after a long time. I tried running some simulations to consider when a client goes into debt. Here is the test code I wrote (TestGroupTokenBucketRequestDebtShare):

With 300k tokens and 3 clients each requesting 100k, I expected none of the clients Token slots to go into debt

However, the last client TokenSlot does go into debt, even though the overall "TokenBucket" is not in debt. Moreover, clients that have not requested tokens get allocated some as well (100k/3 = 33k):

Is this expected behaviour? Is this possibly the reason clients are heading into debt too fast?

I am quite new to the TiDB code base and am willing to file a bug report and investigate further if this is not the expected behaviour.

nolouch · 2024-07-31T09:57:34Z

@pregenRobot Slots and tokens are added or updated only upon request, so your test case may require multiple requests, similar to real-world scenarios. This is expected. Try step-by-step debugging to gain a deeper understanding.

nolouch · 2024-10-31T14:01:59Z

close by #55029.

nolouch added the type/enhancement The issue or PR belongs to an enhancement. label Jan 30, 2024

nolouch mentioned this issue Jun 5, 2024

client/controller: wait for tokens on response to reduce the debet tikv/pd#8260

Closed

nolouch mentioned this issue Jul 29, 2024

Improve the debt behavior on the resource controller tikv/pd#8457

Closed

nolouch closed this as completed Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Query Influence the Resource Group Stability #50831

Large Query Influence the Resource Group Stability #50831

nolouch commented Jan 30, 2024 •

edited

Loading

pregenRobot commented May 23, 2024 •

edited

Loading

nolouch commented May 23, 2024

pregenRobot commented May 24, 2024 •

edited

Loading

nolouch commented May 27, 2024 •

edited

Loading

pregenRobot commented Jul 28, 2024 •

edited

Loading

nolouch commented Jul 31, 2024

nolouch commented Oct 31, 2024

Large Query Influence the Resource Group Stability #50831

Large Query Influence the Resource Group Stability #50831

Comments

nolouch commented Jan 30, 2024 • edited Loading

Background

Issues

pregenRobot commented May 23, 2024 • edited Loading

nolouch commented May 23, 2024

pregenRobot commented May 24, 2024 • edited Loading

nolouch commented May 27, 2024 • edited Loading

pregenRobot commented Jul 28, 2024 • edited Loading

nolouch commented Jul 31, 2024

nolouch commented Oct 31, 2024

nolouch commented Jan 30, 2024 •

edited

Loading

pregenRobot commented May 23, 2024 •

edited

Loading

pregenRobot commented May 24, 2024 •

edited

Loading

nolouch commented May 27, 2024 •

edited

Loading

pregenRobot commented Jul 28, 2024 •

edited

Loading