Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3157: allow informers for getting a stream of data instead of chunking #3142

Merged
merged 1 commit into from
Feb 3, 2022

Conversation

p0lyn0mial
Copy link
Contributor

@p0lyn0mial p0lyn0mial commented Jan 14, 2022

  • One-line PR description: informers can get a stream of data to prime the caches
  • Other comments: The kube-apiserver is vulnerable to memory explosion.
    The issue is apparent in larger clusters, where only a few LIST requests might cause serious disruption.
    Uncontrolled and unbounded memory consumption of the servers does not only affect clusters that operate in an
    HA mode but also other programs that share the same machine.
    In this KEP we propose a potential solution to this issue.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 14, 2022
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jan 14, 2022
@p0lyn0mial p0lyn0mial force-pushed the kep-watch-list branch 2 times, most recently from 119d33a to 2d778e2 Compare January 14, 2022 15:30
@p0lyn0mial
Copy link
Contributor Author

/assign @wojtek-t

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass on this.


Today informers are the primary source of LIST requests.
The LIST is used to get a consistent snapshot of data currently in etcd to build up a client-side in-memory cache.
The primary issue with LIST requests is unpredictable memory consumption.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth mentioning that LIST can be served from two places:

  • from etcd (default)
  • from kube-apiserver cache (watchcache) - if explicitly requested by setting ResourceVersion param of the list (e.g. ResourceVersion="0").

FWIW, the second is what informers actually do by-default now (they can fallback to the former, but it's actually a fallback).

I think we should make this more explicit, as it's actually an important context for this KEP too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the Appendix section and added a reference to it. PTAL

This might lead to thrashing, starving, and finally losing other processes running on the same node, including kubelet.
Stopping kubelet has serious issues as it leads to workload disruption and a much bigger blast radius.
Note that in that scenario even clusters in an HA setup are affected.
Recovery of large clusters with therefore many kubelets and hence informers for pods, secrets, configmap can lead to this storm of LISTs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those aren't really a problem in if informers are not falling back to listing from etcd. If they are listing from watchcache, then actually we have proper indices in watchcache, and hence it all amortizes well.

The problem is when they fallback to listing from etcd, as e.g. to list pods from a given node from etcd, we actually need to list all pods, unmarshall and only the filter out unneded ones. And we're falling back e.g. in case of watchcache is not yet initialized, which happens often after kube-apiserver restart.

It's another piece of context that can be useful to add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, updated this bit and the Appendix section PTAL

Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see [Appendix](#appendix) section for more details on how informers operate today).
The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs.
This would allow us to keep memory allocations constant.
The server is bounded by the maximum allowed size of an object of 2 MB in etcd plus a few additional allocations, that will be explained later in this document.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the etcd (default) limit is 1.5MB.
That's said - it's serialized object - by default we store protobufs in etcd, so the object in memory can be much bigger (even an order of magnitude potentially).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified

The rough idea/plan is as follows:

- step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request.
- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting here to avoid prevent me forgetting about it (it's only partially related to this point)

We should ensure that in case watchcache is not yet initialized, we won't be falling back to etcd:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L653-L657
but rather something like "wait with timeout and in case of timeout return an error of not-yet-initialized" or sth like that.

If we won't do that, we will be risking that in case of storm of requests on not-yet-initialized kube-apiserver, we will actually face the same failure mode as we experience now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes sense but we should ensure it won't deadlock - in case of the timeout the informers should be able to recover i.e. fallback to LIST/WATCH semantics.

- step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request.
- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache.
- step 2a: send all objects currently stored in memory for the given resource.
- step 2b: propagate any updates that might have happened meanwhile until the watch cache catches up to the latest RV received in step 2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth mentioning that RV in etcd is global (i.e. across all types).
Whereas watchcache is watching only for changes for objects of a given type.

So if you have a case that my watchcache of pods is synchronized to RV=100,
but the next operation was secret creation with RV=101, watchcache RV will not get updated.

This was to significant extent addressed by https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1904-efficient-watch-resumption and reusing progress-notify feature from etcd.
That said, the frequency of progress-notify events we're using now is 5s. Which means that your LIST is expected to take on average at least 2.5s before getting the bookmark mentioned below.

This sounds fine to me, but I think we should be very explicit about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified.

The cacheWatcher.incoming is a buffered channel and has a different size for different Resources (10 or 1000).
Since the cacheWatcher starts processing the cacheWatcher.incoming channel only after sending all initial events it might block once its buffered channel tips over.
In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event.
All watchers that have failed to deliver the event will be closed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where we will land effectively is that if we won't be able to send the "initial state" before +1 events will be delivered to the watch in the meantime, the watcher will get closed.

We may try to play a bit with the buffer size, to mitigate the impact of that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be fine. The size of the buffer is different for different resources but in the range of 10 or 1000. In the worst-case scenario, we restart the WATCH request.

First of all, the primary purpose of bookmark events is to deliver the current resourceVersion to watchers, continuously even without regular events happening.
There are two sources of resourceVersions.
The first one is regular events that contain RVs besides objects.
The second one is a special type of etcd event called progressNotification delivering the most up-to-date revision with the given interval.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those aren't delivered to end client at all - this only consumed only by kube-apiserver and not forwarded further

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified

At regular intervals, the cacher checks expired watchers and tries to deliver a bookmark event.
As of today, the interval is set to 1 second.
The bookmark event contains an empty object and the current resourceVersion.
By default, a watchCache expires roughly every 1 minute.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

watcher?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

The bookmark event contains an empty object and the current resourceVersion.
By default, a watchCache expires roughly every 1 minute.

The expiry interval initially will be decreased to 1 second in this feature's code-path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't scale. Delivering a bookmark every 1s to every watcher will blow us up (unless I misunderstood what you want to do here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the default delivery (expiry) time for bookmarks events is 1 minute. That means an initial request (RV=0 and the resourceVersionMatch set) would have to wait up to 60s. I propose to decrease that time only for watchers in that special state.

That means initial data is not wrapped into cachingObject and hence not subject to this existing optimization.<br><br>
Before sending objects any further the cacheWatcher does a DeepCopy of every object that has not been wrapped into the cachingObject.
Making a copy of every object is both CPU and memory intensive. It is a serious issue that needs to be addressed.<br><br>
With RemoveSelfLink [graduating](https://github.com/kubernetes/kubernetes/blob/956dfb5196c2884b61804e94810e29ca077301ee/staging/src/k8s.io/apiserver/pkg/features/kube_features.go#L138) to GA (and already disabled by default) we are able to safely avoid this copying.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unrelated to this KEP (I mean, we should do that, but we should drive that from RemoveSelfLink) - do we have to mention it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

2. Reduce the number of allocations in the WatchServer<br><br>
The WatchServer is largely responsible for streaming data received from the storage layer (in our case from the cacher) back to clients.
It turns out that sending a single event per consumer requires 4 memory allocations, visualized in the following image.
Two of which deserve special attention, namely the allocations 1 and 3 because they won't reuse memory and rely on the GC for cleanup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try running your experiment with JSON as an output format (as opposed to protobuf)?

JSON is obviously more verbose, but I'm wondering if it is doing that many allocations too (from quick glance into the code it seems to behave better).

Copy link
Contributor Author

@p0lyn0mial p0lyn0mial Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, but I optimized that part and re-run the test with 2048 informers - all synced! (1 TB of data). Here is a graph of memory usage during that run:

2048-informers

@p0lyn0mial
Copy link
Contributor Author

@wojtek-t many thanks for taking the time to read and review the KEP. PTAL

@p0lyn0mial p0lyn0mial force-pushed the kep-watch-list branch 3 times, most recently from 68e5d14 to 9ce605b Compare January 18, 2022 12:00
@p0lyn0mial p0lyn0mial changed the title KEP-NNNN: allow informers for getting a stream of data instead of chunking KEP-3157: allow informers for getting a stream of data instead of chunking Jan 18, 2022
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - just quick next pass, I will look at it in more detail early next week.

-->

Today informers are the primary source of LIST requests.
The LIST is used to get a consistent snapshot of data currently in etcd to build up a client-side in-memory cache.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "currently in etcd" is not necessary true for lists served from watchcache.
I would just remove this part of the sentence.

- reduce etcd load by serving from watch cache
- get a replacement for paginated lists from watch-cache, which is not feasible without major investment
- enforce consistency in the sense of freshness of the returned list
- be backward compatible with new client -> old server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to discuss one more thing, i.e. overhead on the client side.
There is a significant overhead from relisting on the client-side too.
In the ideal world, I would like to have this addressed too, but either case I would like to see this added either to goals or non-goals explicitly.

Clarifying this one - sorry for mental shortcut.

The problem I have on my mind, that in a happy case, when the component initializes its cache once (LIST at the beginning) and then watching, we can quite well predict its resource usage. This is how people often are setting requests/limits for components.

However, when your component (using an informer/reflector/...) needs to relist, what happens is that it effectively LISTs stuff from kube-apiserver and for a moment have both the old-copy of the state and the new-one.
Which means that in many cases we're effectively doubling its footprint.

For the largest (and thus most problematic) clusters, generally the diff between the old and new isn't that huge, but we still have two copies of objects even for those that didn't change (one in the cache and the second that we got from the LIST).

However, if we're going to switch to watching, at least in theory we wouldn't have to keep the whole second copy of the state. What we could do is:

  • for objects that didn't change, just mark them as fresh in the cache
  • for objects that changes, store their new version in the buffer below
  • once we got the whole state (via this watch stream) then atomically:
    (a) sweep over the cache and remove objects not marked as fresh
    (b) put the object from that buffer into the cache

This would help a lot with the client-side memory spikes.

Copy link
Contributor

@sttts sttts Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand you correctly, you would – without me knowing exactly the factoring of the informers/reflectors right now – basically make use of the individual incoming events and compare those with the existing store of the informer. If they differ, you keep them for the later, atomic switch-over. If they don't differ, you throw away the events, but just take a reference on the existing object for your new snapshot.

In other words, streaming opens a door for optimizing away this double memory consumption of a giant unpaged watch-cache backed LIST request.

I.e. this is some additional value proposition of the enhancement because we can do these kind of optimizations later-on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - this is what I meant.

And yes - it allows us to also optimize client side - I was asking if this is in-scope of the enhancement, or we should be treating it as "future work".
[I would really like this to happen too, but for the sake of making progress, I can live with "future work" too :) ]

nitty-gritty.
-->

In order to lower memory consumption while getting a list of data and make it more predictable,we propose to use consistent streaming from the watch-cache instead of paging from etcd.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "consistent streaming" really mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it means that the returned data is consistent - served from etcd via a quorum read.


In order to lower memory consumption while getting a list of data and make it more predictable,we propose to use consistent streaming from the watch-cache instead of paging from etcd.
Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see [Appendix](#appendix) section for more details on how informers operate today).
The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth mentioning, that watching with RV=0 is already relatively close to what we need, and describe its current semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified in the Note section (a few lines down)

Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see [Appendix](#appendix) section for more details on how informers operate today).
The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs.
This would allow us to keep memory allocations constant.
The server is bounded by the maximum allowed size of an object of 1.5 MB in etcd (note that the same object in memory can be much bigger, even by an order of magnitude)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I don't understand the question - we don't have a strict limit for number of objects, so technically we're not bounded, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the memory usage is bounded by the max size of an object plus all the allocations that are needed to handle the requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't strictly true.

Basically, to serve the current state (even via watch), you still need to copy (sure - not whole objects, but the pointers to them) [well - technically, we could be doing copy-on-write tree to store the current state, but that's not what we're doing]. So we're actually allocating a pointer-per-objects to store that:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/watch_cache.go#L598

Also, it's not for watch we wait for processing the next even until the all data from the previous one is send and the memory buffer is released. So I think this sentence is a bit misleading.

The rough idea/plan is as follows:

- step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request.
- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems my comment got lost, so copying here:

I think we should try to make the new semantic as similar to the current one as possible.
What I have on my mind in particular is that currently, when reflector has to re-list, it first sets the RV= to the version of last observed change.
This is extremely important to avoid going back in time.

I think we should maintain it - in the new API:

if you don't specify the RV, then indeed you contact etcd to get the current one
but if you specify one, you actually don't have to fallback to etcd, just use the provided one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I captured it in the Design Details section, see if you like it.

It is used to make sure the cacher is up to date (has seen data stored in etcd) and to let the reflector know it has seen all initial data.
There are ways to do that cheaply, e.g. we could issue a count request against the datastore.
Next, the cacher creates a new cacheWatcher (implements watch.Interface) passing the given bookmarkAfterResourceVersion, and gets initial data from the watchCache.
After sending initial data the cacheWatcher starts listening on an incoming channel for new events, including a bookmark event.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: technically it starts watching on incoming channel immediately, just buffers those events until initial state is send out

The cacheWatcher.incoming is a buffered channel and has a different size for different Resources (10 or 1000).
Since the cacheWatcher starts processing the cacheWatcher.incoming channel only after sending all initial events it might block once its buffered channel tips over.
In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event.
All watchers that have failed to deliver the event will be closed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively opens us for a new failure mode.

The scenario that I have on my mind is a type where we observe high churn (let's say 100 object changes per second) with a huge number of objects.

Currently, if we need to relist, we just:

  1. LIST (let's assume it can finish successfully within its 1m timeout)
  2. starts watching from that
    Given that watchcache keeps at least 75s of history, watching should just work, and worst case in case of a bug that it lost its history, we at least made a progress (by updating the cache in the component) via the LIST request.

In the new proposed design:

  1. We open a watch, and start streaming the whole state
  2. At the same time we accumulate in the incoming buffer the watch events happening in the meantime
  3. Soon after the buffers fills in (before we even managed to stream the initial state)
  4. And we close the watch

On the client side we need to retry, but we didn't make any progress (because we didn't get the whole initial state even, so couldn't update the cache).

I think it's solvable (we just don't close the watch immediately, just wait until it will stream the data and close it ~then), but it requires some changes, which we should definitely do as part of these changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I think that simply delaying closure might break the consistency guarantees we are trying to provide here. I think that delaying closure only if the buffer contains data up to the desired RV (bookmarkAfterResourceVersion) would preserve consistency. Assuming we would send those data right after the initial events. Thoughts?

Note: the proposed watch-list semantics (without bookmark event and without the consistency guarantee) kube-apiserver follows already in RV="0" watches.
The mode is not used in informers today but is supported by every kube-apiserver for legacy, compatibility reasons.

Note 2: informers need consistent lists to avoid time-travel when switching to another HA instance of kube-apiserver with outdated/lagging watch cache.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying:

I had it on my plate to reply there - this is addressed within a single incarnation (i.e. if a binary doesn't restart, the reflector will not time-travel).
This wasn't addressed only across component restarts (so basically only for the initial list - relists work fine, worst case they just fallback to etcd).

@p0lyn0mial p0lyn0mial yesterday
should we put it as a goal of this KEP? (Ensure the watch cache is fresh when RV=0).
I think it should be easy once we have this KEP implemented.

Partially. I think that we shouldn't change the default behavior of RV=0 (it by definition means at least from RV=0, which is trivially always satisfied). And that's fine.
I think that we should address that by defining some new "ResourceVersionMatch" value that will be able to ensure that (or sth like that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ic, we could use the same value used for the "new" WATCH request. I proposed MostRecent

Consider including folks who also work outside the SIG or subproject.
-->

## Design Details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I would like to see in this section is the exact changes in the API that you're also proposing (the new ResourceVersionMatch option that you're proposing, the exact semantics, etc.).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, see if it make more sense now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

List the specific goals of the KEP. What is it trying to achieve? How will we
know that this has succeeded?
-->
- protect kube-apiserver and its node against list-based OOM attacks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't sound like you can protect against attacks (someone intentionally listing). Your proposal appears to be about allowing well-behaved clients (informers) to change their client code to issue a watch instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To a degree also against attacks. If you know your system does not need many lists, you can heavily throttle (p&f). Today we cannot because everybody has to list.


In order to lower memory consumption while getting a list of data and make it more predictable, we propose to use consistent streaming from the watch-cache instead of paging from etcd.
Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see [Appendix](#appendix) section for more details on how informers operate today).
The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you actually changing LIST or are you changing informers (client) to use a watch and then tweaking watches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are going to change informers to use a single WATCH request instead of a LIST followed by a WATCH request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the primary problem is that apiserver can run out of memory, fixing clients is not going to solve that problem. (I haven't read this yet)

The rough idea/plan is as follows:

- step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request.
- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does our client actually care about getting the latest RV or does it are about knowing which RV it has gotten?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, that's the whole point here and a long-known problem we have today of time-travel under certain conditions. Behaviour then is undefined. See @smarterclayton's kubernetes/kubernetes#59848.

- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache.
- step 2a: send all objects currently stored in memory for the given resource.
- step 2b: propagate any updates that might have happened meanwhile until the watch cache catches up to the latest RV received in step 2.
- step 2c: send a bookmark event to the informer with the given RV.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to use this to indicate that the initial list is finished?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, @wojtek-t's proposal that the first bookmark event marks the beginning of normal, non-initial events.

Moreover, around 16:40 we lost the server after running 16 informers. During an investigation, we realized that the server allocates a lot of memory for handling LIST requests.
In short, it needs to bring data from the database, unmarshal it, do some conversions and prepare the final response for the client.
The bottom line is around O(5*the_response_from_etcd) of temporary memory consumption.
Neither priority and fairness nor Golang garbage collection is able to protect the system from exhausting memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APF should be able to limit a collection to 1 concurrent LIST (i.e., serialize all list calls). You're saying there's a condition where apiserver cannot handle a single LIST call?

- get rid of list or list pagination
- rewrite the list storage stack to allow streaming, but rather use the existing streaming infrastructure (watches).

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an alternative:

a) Force everyone to use pagination (e.g. serialize non-paginated lists)
b) The server may decide to reduce the page size based on the size of the response from etcd.
c) The continue token is of course changed to point at the next unseen item.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-->

- [ ] Metrics
- Metric name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the intent of the KEP, it's appropriate to have metrics for the alpha release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This KEP intends to lower the memory consumption needed to handle "list" requests. The best metric for measuring memory consumption of an API server I have found so far is container_memory_working_set_bytes (already exposed). The sig-scalability offered to help test memory consumption during some load tests (i.e. on 5K nodes). I think these tests along with the metric are a good starting point during the alpha phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the issue with the full buffer we could have the following metrics

apiserver_cache_watcher_buffer_length (histogram, what was the buffer size)
apiserver_watch_cache_lag (histogram, for how far the cache is behind the expected RV)
apiserver_terminated_watchers_total (counter, already defined, needs to be updated (by an attribute) so that we count closed watch requests due to an overfull buffer in the new mode)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@lavalamp
Copy link
Member

lavalamp commented Feb 2, 2022

My thoughts:

I don't like that this solution forces clients to upgrade to fix the problem. I think API machinery has a bit of a justly-deserved reputation for doing this and this makes it worse. It will be especially hard for clients in other languages.

I don't like that this solution leaves us with two ways of getting a list-- and the easy and intuitive one will also break the server sometimes.

I do think that streaming a list is a superior way of getting it to the client. Have we brainstormed ways to fix LIST? (Imagine changing the server such that an initial unpaginated list does something different under the hood: call watch, use a serializer that omits the watch event envelope, and stops after it is caught up.)

Finally, since we're changing clients anyway, I don't like that the solution doesn't optimize things as much as possible. There's a class of design where the server only sends data the client doesn't have cached; one example is written up here: kubernetes/kubernetes#90339

Ultimately, although I'm conflicted, I don't really think these reasons are quite compelling enough to block this KEP, so I guess I won't.

@deads2k
Copy link
Contributor

deads2k commented Feb 2, 2022

This design provides the ability for well-behaved clients to reduce the load they place on the server and improve their own deserialization memory cost.

Providing this capability and using it in reflectors may make it possible to improve the list-via-watch-cache-pagination problem afterwards by reducing use.

@wojtek-t has implementation concerns that must be addressed before code in k/k is changed, but if he doesn't get back to reviewing this before KEP freeze, I think it is implementable.

@p0lyn0mial please update with alternatives suggested and considered and add a means of measuring the usage of this feature.

@p0lyn0mial
Copy link
Contributor Author

@deads2k I have updated the KEP with a potential solution. Although I don't understand why it should block alpha. After all, in the worst-case scenario informers will simply retry requests asking for the entire set again. It is not efficient but a similar situation can happen today with regular LIST requests. Currently, there is a timeout for LIST requests of 60 seconds. That means a slow reflector might fail synchronization as well and would have to re-establish the connection again and again.

Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before.

For an alpha version, we might simply delay closing the watch request until all data is sent to the client.
We could try collecting some metrics for measuring how far the cache is behind the expected RV,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will collect metrics

@@ -385,13 +385,39 @@ Since the cacheWatcher starts processing the cacheWatcher.incoming channel only
In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event.
All watchers that have failed to deliver the event will be closed.

Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before.

For an alpha version, we might simply delay closing the watch request until all data is sent to the client.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will delay

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expect this to behave well even in heavily loaded clusters. To increase confidence in the approach, we will collect metrics ...

We could try collecting some metrics for measuring how far the cache is behind the expected RV,
what's the average buffer size, and a counter for closed watch requests due to an overfull buffer.

For a beta version, we could explore this area further. One viable option worth considering would be,
Copy link
Contributor

@sttts sttts Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a beta version, we have further options if they turn out to be necessary:

2. make the buffer dynamic - especially when the difference between RVs is > than 1000
3. inject new events directly to the initial list, i.e. to have the initial list loop consume the channel directly and avoid to wait for the whole initial list being processed before.
4. cap the size (cannot allocate more than X MB of memory) of the buffer
5. maybe even apply some compression techniques to the buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like only storing a low-memory shallow reference and take the actual objects for the event from the store.

…nking.

The kube-apiserver is vulnerable to memory explosion.
The issue is apparent in larger clusters, where only a few LIST requests might cause serious disruption.
Uncontrolled and unbounded memory consumption of the servers does not only affect clusters that operate in an
HA mode but also other programs that share the same machine.
In this KEP we propose a potential solution to this issue.
@deads2k
Copy link
Contributor

deads2k commented Feb 3, 2022

@deads2k I have updated the KEP with a potential solution. Although I don't understand why it should block alpha.

ordinarily I'd wait for @wojtek-t, but that's not possible at this moment before freeze. He would have a chance to hold if he wished. If he chooses to exercise that option later, I think these circumstances are extenuating enough to accommodate that. I think this KEP is well described enough that I suspect he'd be biased toward merge, so I'm doing that. I also think this a good approach, but @wojtek-t's knowledge of the watch cache mechanics exceeds mine.

/approve
/lgtm

PS, the PRR is fine too.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, p0lyn0mial

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2022
@k8s-ci-robot k8s-ci-robot merged commit 328e235 into kubernetes:master Feb 3, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 3, 2022
- reduce etcd load by serving from watch cache
- get a replacement for paginated lists from watch-cache, which is not feasible without major investment
- enforce consistency in the sense of freshness of the returned list
- be backward compatible with new client -> old server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - this is what I meant.

And yes - it allows us to also optimize client side - I was asking if this is in-scope of the enhancement, or we should be treating it as "future work".
[I would really like this to happen too, but for the sake of making progress, I can live with "future work" too :) ]


Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before.

For an alpha version, we will delay closing the watch request until all data is sent to the client.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean "all initial data", right?

Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before.

For an alpha version, we will delay closing the watch request until all data is sent to the client.
We expect this to behave well even in heavily loaded clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have data to back it up, but I'm afraid it may not be enough. Let me think about it more.

But (let me read the text below) assuming we have an option to improve for Beta, I would be fine with this for Alpha.

For an alpha version, we will delay closing the watch request until all data is sent to the client.
We expect this to behave well even in heavily loaded clusters.
To increase confidence in the approach, we will collect metrics for measuring how far the cache is behind the expected RV,
what's the average buffer size, and a counter for closed watch requests due to an overfull buffer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm not fully following what exactly we want to measure here, but we can probably clarify that during code review.

what's the average buffer size, and a counter for closed watch requests due to an overfull buffer.

For a beta version, we have further options if they turn out to be necessary:
1. comparing the bookmarkAfterResourceVersion (from Step 2) with the current RV the watchCache is on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem to be focusing on the case where watchCache is lagging compared to etcd.
This isn't really the case I'm afraid of.

If we're lagging significantly, I think it's perfectly ok to just wait (without doing anything) until we catch up (with timeout) and if we don't, just fail the request. Then it was super cheap and it's fine (and not really a regression).

The case I'm worried about is:

  • say watchcache is perfectly keeping up with etcd
  • suddenly we're opening a gazillions of watches to stream a huge amount of data (say e.g. all kube-proxies in 5k-node cluster were restarted at exactly the same time)
  • we're starting streaming the data to all of them
  • in the meantime there is high churn of other events
  • all buffers are getting full and we're triggering closing all the watches without finishing streaming the initial state

If we really close this watch, then the watcher didn't make any progress. So we can go into this kind of crashloop.

But now let's say that as suggested for Alpha, we will not close the watch immediately, but rather stream the initial state and close the watch only after that. This seems fine as long it will not take ages.

But thinking about this more, the "watch initialization support in P&F" should be giving this to us. We should just double check if the support I added in kubernetes/kubernetes#103660 (+ the previous PR for it) is actually ok, or whether we should tweak it further.

So I think I'm fine with this.

What I think we should do though (maybe not for Alpha though) is to not start with sending the "current watch state", but rather ensure that watchcache is already up-to-date before starting sending anything.
I would like to avoid comparing things like "RV different <1000" - I think we should just do the pattern of
"waitUntilFresh + start sending"

The problem with this approach is that RV returned from etcd is a global one, so in many cases we would be forced to wait for "progress notify" event from etcd. Which isn't perfect because (with 5s we're using by default), it would effectively be adding 2.5s latency on avg.
But maybe what we can do then, is to effectively force "RequestProgres()" calls: https://pkg.go.dev/go.etcd.io/etcd/client/v3#section-readme
with some exponential backoff then - we would just have to coordinate it across different resources somehow, but that sounds doable.

I guess the bottom line is hat I believe we have a reasonable path forward here - it's more a matter of carefully doing the work and the amount of that work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k - FYI

// For watch calls, it begins with synthetic "Added" events of all resources up to the most recent ResourceVersion.
// It ends with a synthetic "Bookmark" event containing the most recent ResourceVersion.
// For list calls, it has the same semantics as leaving ResourceVersion and ResourceVersionMatch unset.
ResourceVersionMatchMostRecent ResourceVersionMatch = "MostRecent"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - I started wondering if we really need that.

"MostRecent" is effectively the semantic of resourceVersion="". We are using that semantic for List.

So instead of introducing yet another constant, I think we should just:

  • set RV="" for watch calls
  • handle this from watchcache (if it is enabled)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be very careful to ensure that we will be backward compatible here, but I think it's doable.

@nayihz
Copy link

nayihz commented Sep 26, 2023

The kube-apiserver is vulnerable to memory explosion.
The issue is apparent in larger clusters, where only a few LIST requests might cause serious disruption.

Hi @p0lyn0mial , could you give some references about this issue?

@p0lyn0mial
Copy link
Contributor Author

Hi @p0lyn0mial , could you give some references about this issue?

hey, please have a look at https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/3157-watch-list#manual-testing-without-the-changes-in-place

In addition to that you might be interested in the synthetical perf tests that have been prepared for this feature ( i.e. http://perf-dash.k8s.io/#/?jobname=watch-list-off&metriccategoryname=E2E&metricname=LoadResources&PodName=kube-apiserver-bootstrap-e2e-master%2Fkube-apiserver&Resource=memory)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants