-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large Query Influence the Resource Group Stability #50831
Comments
Hello. We encountered a similar issue in production the other day and I did some research into the internals of Resource Managers and Request Units as well as read up some RFCs. Please note I am still a beginner to the internals of the project but here is what I found so far:
As you suggested, it would be nice if the debt can be shared between the TiDB nodes. One possible solution might be to throttle if trickling of Tokens to TiDB nodes with tokens > 0 and instead perhaps prioritize more tokens being allocated to TiDB nodes in debt? One way this might work is:
I have to do more research on which existing functionalities we can reuse to implement this in the PD server / PD client but, what do you think of this approach as a remedy for sharing debt more equally? |
@pregenRobot thanks for your attention. yeah, I think your idea is a viable solution about how to pay the debt more fair. There are two places to be optimized.
I am working on the first one. If you are interested, you can also try your own solution. |
One feature that will lead to a better experience would be to increase the observability around when TiDB nodes are in Token Bucket debt. Is there any way we can verify how much debt the nodes are in? Currently, the Grafana Dashboard template does not have a feature and we are not even sure if such metrics are collected. If it's okay, I would be willing to work on collecting TiDB Node-level RU debt to display on the dashboard. And yes, I am willing to explore the possibility of approach 2 you have laid out :) |
It may only be used for debugging, normal users may not need it. and it also brings many metrics overhead if there are many resource groups and TiDB instances. so should consider whether users really need it. I think the concept of debt actually doesn't need to be known by users. |
@pregenRobot Slots and tokens are added or updated only upon request, so your test case may require multiple requests, similar to real-world scenarios. This is expected. Try step-by-step debugging to gain a deeper understanding. |
close by #55029. |
Background
They're running a stable workload. the ru limit works fine. but once a query's plan changes or there comes a new query that will consume lots of RUs, this may make the limit unstable:
the query is running on tidb-0 and consume a lots RUs:
As the picture shows, after the problem query, the normal query needs to be recovered according to the refill rate to pay back the debt. which will increase the latency for a long time:
Issues
Exceed the limit
after the large query is executed.The text was updated successfully, but these errors were encountered: