Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly close watches when a node is removed #5

Closed
wants to merge 1 commit into from

Conversation

valerian-roche
Copy link
Owner

We noticed on our environments that control-planes getting no activity outside of serving xds through delta-ads to a constant nb of pods had an ever-growing consumption of memory
Investigating further we noticed that:

  • this memory increase correlated to a nb of goroutine increase (even if the nb of clients is constant)
  • the increase occur each time we restart those client pods (done on purpose to test some edge cases)
  • the increase is two goroutines per pod, which makes sense as those use lds and cds as dynamic resources and do not trigger rds, eds or rtds watches

Taking a goroutine dump we identified that leaked goroutines are from https://github.com/envoyproxy/go-control-plane/blob/main/pkg/server/delta/v3/server.go#L180
Those goroutines will only be gc-ed once the channel has either returned a response or been closed. This closure is expected to happen in watch.Cancel(), but in the case of a node getting removed, we are actually calling the potentially set cancel function attached to the watch, but we do not close this channel

Disclaimer: we use a slightly different fork of the repository. While I do not believe it impacts here, we are not using vanilla control-plane today

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
valerian-roche added a commit that referenced this pull request Jan 17, 2024
[sotw][issue-540] Return full state when applicable for watches in linear cache
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants