Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent goroutine leak and CPU spinning at websocket transport #2209

Merged
merged 11 commits into from
May 26, 2022
Merged

fix: prevent goroutine leak and CPU spinning at websocket transport #2209

merged 11 commits into from
May 26, 2022

Conversation

moofMonkey
Copy link
Contributor

@moofMonkey moofMonkey commented May 26, 2022

This fixes #2167 and removes incorrect code merged from PR #2168.

Explanation: while the mentioned PR stopped calling function responses after the context was done, it has introduced a new bug - it doesn't break out of the for loop in case of response == nil because break now breaks select, not the for loop, which causes goroutine leak and CPU spinning indefinitely.
*I'm not sure yet how to exactly reproduce this behavior, since it doesn't even properly show up in the Golang profiler, but we could see that in production, and only with the mentioned PR. Closing the returned channel helps in the chat example, but not in the more complicated code.

Repro at fletcherist/gqlgen-memory-leak is fundamentally flawed because the subscription resolver there doesn't care about passed context - it doesn't close returned channel (which was required before this PR) and it doesn't end goroutine created there after context cancellation.

Proposed fix: revert mentioned PR and improve subscription channel usage so that the resolver won't have to close it, which is, for example, not done in the chat example.
It should still be backwards compatible unless the application depended on the fact that the channel will be read after the context cancellation, which is already a problem.

Test case: improved chat example with a check of the amount of goroutines left after many closed subscriptions.
Easiest way to test - cherry-pick commit with changes to the chat_test.go or apply all the changes and revert commit with changes to the example itself.

I have:

  • Added tests covering the bug / feature (see testing)
  • Updated any relevant documentation (see docs)

@frederikhors
Copy link
Collaborator

@fletcherist

@fletcherist
Copy link
Contributor

fletcherist commented May 26, 2022

@moofMonkey hey. I'm sorry it's my bad, indeed I didn't close the channel after the subscription resolver's context was done.

Repro at fletcherist/gqlgen-memory-leak is fundamentally flawed because the subscription resolver there doesn't care about passed context - it doesn't close the returned channel (which was required before this PR) and it doesn't end goroutine created thereafter context cancellation.

I used the chat example to create demo reproduction, where it's also not handled. Closing the channel fixes the issue. That's a pity I missed that thing here and hence missed it in my production code

Closing the returned channel helps in the chat example, but not in the more complicated code.

Yes I fixed the issue in my prod code by closing channels in the goroutine
Thank you very much for catching this. Hence my PR seems dummy now and should be reverted

@fletcherist
Copy link
Contributor

fletcherist commented May 26, 2022

thanks for the great fix. i gonna test it right now in my production environment

@frederikhors
Copy link
Collaborator

@StevenACoffman will merge and close this ASAP. Thank you @moofMonkey, really!

@coveralls
Copy link

coveralls commented May 26, 2022

Coverage Status

Coverage decreased (-0.07%) to 74.924% when pulling e8e4ec3 on MoofMonkey:master into 5f5bfcb on 99designs:master.

@StevenACoffman
Copy link
Collaborator

@fletcherist @moofMonkey it looks like there are some test failures. Can you suggest a fix?

@moofMonkey
Copy link
Contributor Author

moofMonkey commented May 26, 2022

@StevenACoffman I don't really know how to handle keepalives in this test, so I have just added support for keepalives in the websocket client subscriptions. I'm not sure if that's a proper fix, but at least it fixes the chat example.
The problem is that it might hang tests with subscriptions, and there's no context parameter in the websocket client to properly handle that.

If gqlgen doesn't like this approach - I can revert keepalives support and simply reduce batch size and subscriptions count in the chat example test, however it'll still sometimes fail to pass if CPU is heavily limited (which is likely for CI).

The other thing that might fail in this test is a random sleep before sending messages, and without this sleep receiving all the messages is not guaranteed, which, I guess, needs to be fixed.
It doesn't seem to be related to my changes to the example since reverting my changes doesn't seem to change anything - i.e. even old test & example code fails with keepalive error if there's no delay specified.

moofMonkey and others added 2 commits May 26, 2022 20:09
Signed-off-by: Steve Coffman <steve@khanacademy.org>
@moofMonkey
Copy link
Contributor Author

What should I do about the hung test?
The problem, as I wrote before, is in the fact that sleep before posting messages is pretty much random. It doesn't prevent the bug, but only delays it with a race condition of some sort.
There is an option to simply reduce batch size and sleep time, but it won't fix the problem itself.

@moofMonkey
Copy link
Contributor Author

moofMonkey commented May 26, 2022

Looks like the keepalives support doesn't help but just hangs tests, the root cause of the keepalive error is in the subscription race condition.
To test that simply apply the latest revert commit, and increase time.Sleep from 100ms to 1000ms.
To almost surely get the keepalive error - remove time.Sleep in the test.

@StevenACoffman
Copy link
Collaborator

StevenACoffman commented May 26, 2022

I'm ok with temporarily skipping a problematic test to get a new release out, but I would love to have another PR from the WebSocket-using community members like yourself or @RobinCPel to clean this area up better.

@moofMonkey
Copy link
Contributor Author

I guess that system message should fix the race condition. Is it good enough?
I don't really see any other way to fix it, since the GraphQL server doesn't respond to the subscription by itself.

@StevenACoffman StevenACoffman merged commit 6855b72 into 99designs:master May 26, 2022
@StevenACoffman
Copy link
Collaborator

Thanks for sticking with this! If you see any other opportunities to improve the Websocket / Subscription code, I would love further contributions!

@StevenACoffman
Copy link
Collaborator

Released in v0.17.9

@moofMonkey
Copy link
Contributor Author

Thanks for fast responses, merge and release :)

The only thing that I have left unchecked is that the last commit has revealed a race condition in the chat frontend due to subscriptions data being received before the query is done.
I'll submit a pull request on that in a few minutes, since this one is already marked as merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory Leak at Websocket Transport
5 participants