-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible buffer leak in blocking stub DEADLINE_EXCEEDED race condition #7105
Comments
I found that since 1.27.0 tasks might be submitted to the app executor after the call is closed / status set. I guess that either needs to be avoided again or accepted/handled - I made a fix for the latter in #7106 and verified that my reproducer no longer produces the ByteBuf leak with it. |
So we think the response is arriving and racing with the deadline? I think part of the problem is we are calling executeCloseObserverInContext without the transport's involvement. I think we would want to go with the "avoid doing that" approach, but we'll have to think about how to do it (worst-case, we can have the transports handle the delay #6328 was trying to achieve; that may even be easier). We may want to revert #6328 while we do a fix. I don't see a quick-and-easy patch/workaround opportunity. |
@ejona86 I'm actually not sure since I didn't debug fully, but that seems most likely given the observations. |
This should avoid messages being leaked when a Listener throws an exception and the executor is shut down immediately after the call completes. This is related to grpc#7105 but a different scenario and we aren't aware of any user having observed the previous behavior. Note also this does _not_ fix the similar case of reordering caused by delayedCancelOnDeadlineExceeded().
I feel confident there were two bugs introduced with #6328:
I think a proper fix would be to jump onto the transport thread and schedule onClose() from there (which would mean adding runOnTransportThread to Stream). From that point forward the ClientStreamListener would need to release resources without calling into the application (so no more usage of Cc @ran-su, @zhangkun83 |
This should avoid messages being leaked when a Listener throws an exception and the executor is shut down immediately after the call completes. This is related to #7105 but a different scenario and we aren't aware of any user having observed the previous behavior. Note also this does _not_ fix the similar case of reordering caused by delayedCancelOnDeadlineExceeded().
@ran-su, @zhangkun83, I'm seriously considering reverting #6328 , since it has been 3.5 months and we still have the memory leak caused by it. |
Talked with Kun, we can revert #6328 for now, and I will be come back to this issue later(like next quarter). |
@ran-su, will y'all handle the revert? I expect it'll need a large test run. |
Sure, I'll run an internal test before send out a PR. |
Note: a user has noticed a bug where onMessage() could be called after onClose() on client-side (b/169277773). The problem was present starting in 1.27 and fixed in 1.33, so we strongly believe it was this bug. That means this was even more serious of a bug than previous believed. I'm very surprised more didn't break. Maybe more broke, and people just couldn't debug it. I'm going to update the release notes to mention it can produce weird callback orders. |
This should avoid messages being leaked when a Listener throws an exception and the executor is shut down immediately after the call completes. This is related to grpc#7105 but a different scenario and we aren't aware of any user having observed the previous behavior. Note also this does _not_ fix the similar case of reordering caused by delayedCancelOnDeadlineExceeded().
One of my unit tests started reporting a netty
ByteBuf
leak when upgrading from 1.26.x to 1.27.0, and it remains in subsequent versions up to 1.29.0. I'm not 100% sure but strongly suspect it's related to #6328, and in particular the interplay of that change with theThreadlessExecutor
used by blocking unary calls.It's a somewhat extreme case, the test itself was there to verify a past unrelated concurrency issue. Basically making many consecutive calls using blocking stub with very short (1ms) deadlines. Though likely rare in the wild I think it does expose a real bug.
Here is example snippet which reproduces the leak error log(s) every time mid-test for me when pointed at a local etcd server:
I can provide more of the leak access records if needed.
The text was updated successfully, but these errors were encountered: