-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: prevent goroutine leak and CPU spinning at websocket transport #2209
Conversation
@moofMonkey hey. I'm sorry it's my bad, indeed I didn't close the channel after the subscription resolver's context was done.
I used the chat example to create demo reproduction, where it's also not handled. Closing the channel fixes the issue. That's a pity I missed that thing here and hence missed it in my production code
Yes I fixed the issue in my prod code by closing channels in the goroutine |
thanks for the great fix. i gonna test it right now in my production environment |
@StevenACoffman will merge and close this ASAP. Thank you @moofMonkey, really! |
@fletcherist @moofMonkey it looks like there are some test failures. Can you suggest a fix? |
@StevenACoffman I don't really know how to handle keepalives in this test, so I have just added support for keepalives in the websocket client subscriptions. I'm not sure if that's a proper fix, but at least it fixes the chat example. If gqlgen doesn't like this approach - I can revert keepalives support and simply reduce batch size and subscriptions count in the chat example test, however it'll still sometimes fail to pass if CPU is heavily limited (which is likely for CI). The other thing that might fail in this test is a random sleep before sending messages, and without this sleep receiving all the messages is not guaranteed, which, I guess, needs to be fixed. |
Signed-off-by: Steve Coffman <steve@khanacademy.org>
What should I do about the hung test? |
Looks like the keepalives support doesn't help but just hangs tests, the root cause of the keepalive error is in the subscription race condition. |
I'm ok with temporarily skipping a problematic test to get a new release out, but I would love to have another PR from the WebSocket-using community members like yourself or @RobinCPel to clean this area up better. |
I guess that system message should fix the race condition. Is it good enough? |
Thanks for sticking with this! If you see any other opportunities to improve the Websocket / Subscription code, I would love further contributions! |
Released in v0.17.9 |
Thanks for fast responses, merge and release :) The only thing that I have left unchecked is that the last commit has revealed a race condition in the chat frontend due to subscriptions data being received before the query is done. |
This fixes #2167 and removes incorrect code merged from PR #2168.
Explanation: while the mentioned PR stopped calling function
responses
after the context was done, it has introduced a new bug - it doesn't break out of the for loop in case ofresponse == nil
becausebreak
now breaks select, not the for loop, which causes goroutine leak and CPU spinning indefinitely.*I'm not sure yet how to exactly reproduce this behavior, since it doesn't even properly show up in the Golang profiler, but we could see that in production, and only with the mentioned PR. Closing the returned channel helps in the chat example, but not in the more complicated code.
Repro at fletcherist/gqlgen-memory-leak is fundamentally flawed because the subscription resolver there doesn't care about passed context - it doesn't close returned channel (which was required before this PR) and it doesn't end goroutine created there after context cancellation.
Proposed fix: revert mentioned PR and improve subscription channel usage so that the resolver won't have to close it, which is, for example, not done in the chat example.
It should still be backwards compatible unless the application depended on the fact that the channel will be read after the context cancellation, which is already a problem.
Test case: improved chat example with a check of the amount of goroutines left after many closed subscriptions.
Easiest way to test - cherry-pick commit with changes to the
chat_test.go
or apply all the changes and revert commit with changes to the example itself.I have: