Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watcher resync causes memory usage burst #1209

Closed
nightkr opened this issue Apr 28, 2023 · 8 comments · Fixed by #1249
Closed

Watcher resync causes memory usage burst #1209

nightkr opened this issue Apr 28, 2023 · 8 comments · Fixed by #1249
Labels
bug Something isn't working runtime controller runtime related

Comments

@nightkr
Copy link
Member

nightkr commented Apr 28, 2023

Would you like to work on this feature?

None

What problem are you trying to solve?

During a resync all the data must be stored twice (both in JSON and k8s-openapi struct forms), which (for larger clusters) causes the controller infrastructure to consume ~twice (for the initial sync, expect ~3x for later syncs since the old state is also still kept around) as much memory as normal.

This is both a problem in itself (since it can cause the controller to hit memory quotas during these bursts), and for long-term memory usage (since many memory allocators, including glibc and jemalloc, never unmap memory once it has been allocated for the process).

Describe the solution you'd like

We can use paginated lists to minimize the overlap window. However, must be controlled by the user as this makes network traffic chattier, which may cause degradations on high-latency networks.

Describe alternatives you've considered

We can also use the new ?watch=true&sendInitialEvents=true API, which implements our list-then-watch pattern on the server side. I'd imagine that this would be ideal for us in most regards. However, this is still an alpha-level API, and would cause silent failures in older clusters.

Documentation, Adoption, Migration Strategy

We can implement chunking transparently in watcher, which would let us at least cut down the window of JSON that we need to parse at any given time. An "even better" solution here would be to implement a new watcher API that supports streaming resync. We should then be able to reimplement the current watcher API as a facade on top of that without major regressions compared to the current implementation. However, we would still need to consider whether that new streaming-resync watcher would be meaningfully helpful for Controller/reflector.

Target crate for feature

kube-runtime

@nightkr nightkr added bug Something isn't working runtime controller runtime related labels Apr 28, 2023
@clux
Copy link
Member

clux commented Apr 28, 2023

Yeah, I imagine we would want to advocate for a sendInitialEvents based solution long term, but it was just released in 1.27. We could implement it before then, but we'd have to version/feature flag it to avoid clashing with MSRV. Still, it's something that would be good to see.

The alternative of paginating internally in the watcher is also sensible, we could make a fixed number like 50 a default page (and expose in watcher::Config for people needing to tweak), but this also breaks the atomicity of Restarted events. Then again so does sendInitialEvents, so maybe the Restarted concept needs rethinking anyway?

If we just do sendInitialEvents, then that should allow us to kill flattening as a concept, as it's all single objects being passed around afaikt. Maybe it's worth starting to feature flag a send initial based watcher now anyway since this is a dev level opt-in that may require changes to watcher::Event/reflector interface.

NB: sendInitialEvents is on the main page now: https://kubernetes.io/docs/reference/using-api/api-concepts/#streaming-lists

@nightkr
Copy link
Member Author

nightkr commented Apr 28, 2023

Even in 1.27 SIE is just an alpha-level feature, which IIRC means that the feature gate is disabled by default.

The alternative of paginating internally in the watcher is also sensible, we could make a fixed number like 50 a default page (and expose in watcher::Config for people needing to tweak), but this also breaks the atomicity of Restarted events.

It depends. We could change the interface to something like:

enum StreamingWatcherEvent<K> {
    Added(K),
    Removed(K),
    RestartBegin,
    RestartPage(Vec<K>),
    RestartDone,
}

That would still let us keep a facade that buffers it down to the existing watcher::Event type. I don't think it's worth forcing this API change on all users since it's a decent chunk of extra complexity. It should also be able to cover both SIE and paginated lists.

nightkr added a commit to nightkr/kube-rs that referenced this issue Apr 28, 2023
See kube-rs#1209, relatively untested prototype and doesn't include any proposed API changes
@clux clux linked a pull request Apr 29, 2023 that will close this issue
@clux clux linked a pull request Jul 13, 2023 that will close this issue
clux added a commit that referenced this issue Jul 13, 2023
* Paginate initial watch list

See #1209, relatively untested prototype and doesn't include any proposed API changes

* Remove managed_fields stripper

Users can do this anyway using for_stream

* Make page size configurable

* initial clippy fixes

Signed-off-by: clux <sszynrae@gmail.com>

* more unrelated clippy fixes

Signed-off-by: clux <sszynrae@gmail.com>

* clippy fixes related to this pr

Signed-off-by: clux <sszynrae@gmail.com>

* align default pagination size with client-go (500)

Signed-off-by: clux <sszynrae@gmail.com>

* caveat the page example + rename to page_size

name matches what it's called upstream + it's shorter

Signed-off-by: clux <sszynrae@gmail.com>

* add a new mock test system for the runtime

Signed-off-by: clux <sszynrae@gmail.com>

* remove unnecessary boxing + slight rename

Signed-off-by: clux <sszynrae@gmail.com>

---------

Signed-off-by: clux <sszynrae@gmail.com>
Co-authored-by: Natalie Klestrup Röijezon <nat@nullable.se>
@clux clux reopened this Jul 14, 2023
@clux
Copy link
Member

clux commented Jul 14, 2023

The initial watcher pagination implementation is included in 0.84, and it should lessen the impact on upstream apiservers (and allow watchers on more extreme cluster layouts). That said, increased memory use from having to cache a full list (as this issue is about) while receiving data will still happen (to a lesser extent).

It could still make sense to add a Page((Vec<K>, continue: bool)) (or Page(Vec<K>) + ListComplete) event to watcher::Event to at least allow streaming watchers of large resource lists without having to buffer all the pages.

But there are some downsides downstream for doing this (on top of the api change/complexity Nat mentioned)

  • the reflector store will still have to do the buffering to avoid Store blindspots (times where we only have parts of the data) - meaning for atomic style stores (like ours is) you always have to do the buffering somewhere
  • doing slow work while handling pages could also lead to pages getting invalidated (causing us to have to re-list) - this is probably not a big deal in most cases though

regardless, avoiding the internal buffering in the future would benefit things not using reflectors.

so, in a future world - like 1/2y from now where sendInitialEvents is everywhere - there is potentially also no need for a list call. Having a ListComplete equivalent could be a good way to signal the SIE bookmark that indicates the end of the intial events, so that reflector stores can do buffering, and maybe that could be a stepping-stone?

@clux
Copy link
Member

clux commented Sep 8, 2023

As of 0.86 we have support for both pagination and sendInitialEvents (when the feature is available and opted-in to), so closing this.

(There might be smarter improvements on being able to return data to the user faster rather than buffer in watcher until a full list have completed, but such an upgrade would need to consider reflector usage which needs to wait for its atomic store property - so leaving something like that to potential future proposal.)

@clux clux closed this as completed Sep 8, 2023
@nightkr
Copy link
Member Author

nightkr commented Sep 9, 2023

the reflector store will still have to do the buffering to avoid Store blindspots (times where we only have parts of the data) - meaning for atomic style stores (like ours is) you always have to do the buffering somewhere

I don't think we strictly need this, just a list at the end of which objects to keep/delete. We never guarantee that the reflector is atomic.

@clux
Copy link
Member

clux commented Sep 9, 2023

are you suggesting that we could forward events from the initial-lists, and have the store keep track of elements that it got during the initial-list, so that when the initial-list is complete, it could delete the complement itself?

we do currently say that Restarted is a signal to do an atomic replace, and that watcher is an atomic interface for a state store

@nightkr
Copy link
Member Author

nightkr commented Sep 9, 2023

are you suggesting that we could forward events from the initial-lists, and have the store keep track of elements that it got during the initial-list, so that when the initial-list is complete, it could delete the complement itself?

Yes.

we do currently say that Restarted is a signal to do an atomic replace

I think this is a point-of-view thing. watcher currently guarantees it, but reflector doesn't depend strictly on that guarantee.

@clux
Copy link
Member

clux commented Sep 9, 2023

yeah, that makes sense. less double buffering, and better memory profile. 👍

probably needs some care for the store readiness check, and some care to work for both paged and streaming lists, but worth raising a help-wanted issue for (at the very least).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working runtime controller runtime related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants