-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freeze when using Arc<RateLimiter>
from multiple await points on same task
#35
Comments
Hi! If you can post a minimal isolated reproduction that would be great. If you can add You should be able to use |
Hi! currently it happened in production so its kinda hard to know why (tho it is in an isolated tokio task), it was working for quite some time (~ a week) before getting stuck. i will try this week to reproduce it but it might be a little hard. |
also i will say, as a hunch, that this is something pretty new because we have been working with your project for a while and i have only seen this now. I am wondering if it might be new versions of stuff (rust, tokio, other deps) |
i have been trying to reproduce it with a simple code, but have not been able to...
it ran for like 4 days with no problem. not sure how else to reproduce. |
How are you narrowing down where the freeze is happening? How do you know it's in leaky-bucket and not in the database connection? |
we have a time limit on our insert so we can see if it fails. |
Assuming the timeout is on top of both layers (rate limiter + database), wouldn't the timeout trigger if the stall is in the database layer as well not allowing you to distinguish them? Cause if the timeout is in the database interaction, then the rate limiter necessarily have to have let the request through to start the timeout no? |
yes, but the timeout is only on the insert, meaning it looks something like
|
we currently also work around this problem by adding a timeout on the acquire |
I think this is what I'm not entirely getting, you said you diagnosed the problem initially by hitting the second timeout, but you can only hit the second timeout if you have passed the rate limiter. So from this, how did you know the rate limiter was blocked?
Since the rate limiter is expected to block, if we want to affirm there is a bug in this scenario when hitting a timeout we'd also have to make sure the rate limiter isn't just throttling as expected which it reasonably could do if it's trying to acquire enough tokens. In this case if it's empty, and you are acquiring |
im sorry, i think the problem is im not clear enough. I will try to explain again, this time from the beginning. This leaves us with the insertion task which is stuck for some reason. this is the insertion loop:
now, the only 2 places that can get stuck are the channel
based on this we concluded (tho never proved) that we are getting stuck on the hopefully that clears up any miss understanding, i will try it again, but this time running on the same hardware that production runs on instead of my mac, and make sure i have exactly the same versions as in production. also im tracking production and waiting for it to happen again, maybe. this time i can connect a debugger, or dump the state. sorry for the trouble |
All right, so what I suggest you do is to add more instrumentation. For each step you suspect there might be a freeze, add logs like this: let id = random_transaction_id();
let span = tracing::info_span!(Level::INFO, "transaction", ?id);
let task = async {
tracing::info!("start");
if let Some(limiter) = limiter {
tracing::info!(?size, "checking limiter");
//
}
if let Some(metrics) = metrics {
tracing::info!(?size, "recording metrics");
metrics.inc_by(size as u64);
}
tracing::info!("inserting into database");
//
tracing::info!("done");
};
task.instrument(span).await You can then divert Since I'm not sure what to do now, I'll mark this as |
hi! wanted to update you that i think we found the culprit for the hang and its not |
Hey, This might just be miss using the feature but I came across a weird problem.
I create a
RateLimter
:I give a the cloned
Arc
To each tableI have a single task that tries to insert to all tables some amount of rows like so.
In the insert function we try to first
acquire
the limiter for the amount of bytes we want to write.at some point it looks like all acquires freeze and the task is no longer making progress.
I read the part about the implementation detail here. and i am wondering if the freeze has something to do with the core switching functionality.
Any help would be appreciated!
Thanks
The text was updated successfully, but these errors were encountered: