KEP-3157: allow informers for getting a stream of data instead of chunking #3142

p0lyn0mial · 2022-01-14T15:17:31Z

One-line PR description: informers can get a stream of data to prime the caches

Issue link: Allow informers for getting a stream of data instead of chunking #3157

Other comments: The kube-apiserver is vulnerable to memory explosion.
The issue is apparent in larger clusters, where only a few LIST requests might cause serious disruption.
Uncontrolled and unbounded memory consumption of the servers does not only affect clusters that operate in an
HA mode but also other programs that share the same machine.
In this KEP we propose a potential solution to this issue.

p0lyn0mial · 2022-01-14T15:33:40Z

/assign @wojtek-t

wojtek-t

I did a first pass on this.

wojtek-t · 2022-01-17T11:35:21Z

keps/sig-api-machinery/NNNN-watch-list/README.md

+
+Today informers are the primary source of LIST requests.
+The LIST is used to get a consistent snapshot of data currently in etcd to build up a client-side in-memory cache.
+The primary issue with LIST requests is unpredictable memory consumption. 


It's worth mentioning that LIST can be served from two places:

from etcd (default)

from kube-apiserver cache (watchcache) - if explicitly requested by setting ResourceVersion param of the list (e.g. ResourceVersion="0").

FWIW, the second is what informers actually do by-default now (they can fallback to the former, but it's actually a fallback).

I think we should make this more explicit, as it's actually an important context for this KEP too.

updated the Appendix section and added a reference to it. PTAL

wojtek-t · 2022-01-17T11:40:45Z

keps/sig-api-machinery/NNNN-watch-list/README.md

+This might lead to thrashing, starving, and finally losing other processes running on the same node, including kubelet.
+Stopping kubelet has serious issues as it leads to workload disruption and a much bigger blast radius.
+Note that in that scenario even clusters in an HA setup are affected.
+Recovery of large clusters with therefore many kubelets and hence informers for pods, secrets, configmap can lead to this storm of LISTs.


Those aren't really a problem in if informers are not falling back to listing from etcd. If they are listing from watchcache, then actually we have proper indices in watchcache, and hence it all amortizes well.

The problem is when they fallback to listing from etcd, as e.g. to list pods from a given node from etcd, we actually need to list all pods, unmarshall and only the filter out unneded ones. And we're falling back e.g. in case of watchcache is not yet initialized, which happens often after kube-apiserver restart.

It's another piece of context that can be useful to add.

good point, updated this bit and the Appendix section PTAL

wojtek-t · 2022-01-17T11:45:25Z

keps/sig-api-machinery/NNNN-watch-list/README.md

+Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see [Appendix](#appendix) section for more details on how informers operate today).
+The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs.
+This would allow us to keep memory allocations constant.
+The server is bounded by the maximum allowed size of an object of 2 MB in etcd plus a few additional allocations, that will be explained later in this document.


IIRC, the etcd (default) limit is 1.5MB.
That's said - it's serialized object - by default we store protobufs in etcd, so the object in memory can be much bigger (even an order of magnitude potentially).

wojtek-t · 2022-01-17T11:49:30Z

keps/sig-api-machinery/NNNN-watch-list/README.md

+The rough idea/plan is as follows:
+
+- step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request.
+- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache.


Commenting here to avoid prevent me forgetting about it (it's only partially related to this point)

We should ensure that in case watchcache is not yet initialized, we won't be falling back to etcd:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L653-L657
but rather something like "wait with timeout and in case of timeout return an error of not-yet-initialized" or sth like that.

If we won't do that, we will be risking that in case of storm of requests on not-yet-initialized kube-apiserver, we will actually face the same failure mode as we experience now.

it makes sense but we should ensure it won't deadlock - in case of the timeout the informers should be able to recover i.e. fallback to LIST/WATCH semantics.

wojtek-t · 2022-01-17T11:55:11Z

keps/sig-api-machinery/NNNN-watch-list/README.md

+- step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request.
+- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache.
+- step 2a: send all objects currently stored in memory for the given resource.
+- step 2b: propagate any updates that might have happened meanwhile until the watch cache catches up to the latest RV received in step 2.


It's worth mentioning that RV in etcd is global (i.e. across all types).
Whereas watchcache is watching only for changes for objects of a given type.

So if you have a case that my watchcache of pods is synchronized to RV=100,
but the next operation was secret creation with RV=101, watchcache RV will not get updated.

This was to significant extent addressed by https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1904-efficient-watch-resumption and reusing progress-notify feature from etcd.
That said, the frequency of progress-notify events we're using now is 5s. Which means that your LIST is expected to take on average at least 2.5s before getting the bookmark mentioned below.

This sounds fine to me, but I think we should be very explicit about this.

wojtek-t · 2022-01-17T13:30:55Z

keps/sig-api-machinery/NNNN-watch-list/README.md

+The cacheWatcher.incoming is a buffered channel and has a different size for different Resources (10 or 1000).
+Since the cacheWatcher starts processing the cacheWatcher.incoming channel only after sending all initial events it might block once its buffered channel tips over.
+In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event.
+All watchers that have failed to deliver the event will be closed.


Where we will land effectively is that if we won't be able to send the "initial state" before +1 events will be delivered to the watch in the meantime, the watcher will get closed.

We may try to play a bit with the buffer size, to mitigate the impact of that.

I think we should be fine. The size of the buffer is different for different resources but in the range of 10 or 1000. In the worst-case scenario, we restart the WATCH request.

wojtek-t · 2022-01-17T13:32:23Z

keps/sig-api-machinery/NNNN-watch-list/README.md

+First of all, the primary purpose of bookmark events is to deliver the current resourceVersion to watchers, continuously even without regular events happening.
+There are two sources of resourceVersions.
+The first one is regular events that contain RVs besides objects.
+The second one is a special type of etcd event called progressNotification delivering the most up-to-date revision with the given interval.


those aren't delivered to end client at all - this only consumed only by kube-apiserver and not forwarded further

wojtek-t · 2022-01-17T13:33:36Z

keps/sig-api-machinery/NNNN-watch-list/README.md

+At regular intervals, the cacher checks expired watchers and tries to deliver a bookmark event.
+As of today, the interval is set to 1 second.
+The bookmark event contains an empty object and the current resourceVersion.
+By default, a watchCache expires roughly every 1 minute.


wojtek-t · 2022-01-17T13:34:45Z

keps/sig-api-machinery/NNNN-watch-list/README.md

+The bookmark event contains an empty object and the current resourceVersion.
+By default, a watchCache expires roughly every 1 minute.
+
+The expiry interval initially will be decreased to 1 second in this feature's code-path.


This won't scale. Delivering a bookmark every 1s to every watcher will blow us up (unless I misunderstood what you want to do here).

I think that the default delivery (expiry) time for bookmarks events is 1 minute. That means an initial request (RV=0 and the resourceVersionMatch set) would have to wait up to 60s. I propose to decrease that time only for watchers in that special state.

wojtek-t · 2022-01-17T13:38:15Z

keps/sig-api-machinery/NNNN-watch-list/README.md

+That means initial data is not wrapped into cachingObject and hence not subject to this existing optimization.<br><br>
+Before sending objects any further the cacheWatcher does a DeepCopy of every object that has not been wrapped into the cachingObject.
+Making a copy of every object is both CPU and memory intensive. It is a serious issue that needs to be addressed.<br><br>
+With RemoveSelfLink [graduating](https://github.com/kubernetes/kubernetes/blob/956dfb5196c2884b61804e94810e29ca077301ee/staging/src/k8s.io/apiserver/pkg/features/kube_features.go#L138) to GA (and already disabled by default) we are able to safely avoid this copying.


This seems unrelated to this KEP (I mean, we should do that, but we should drive that from RemoveSelfLink) - do we have to mention it here?

wojtek-t · 2022-01-17T19:40:47Z

keps/sig-api-machinery/NNNN-watch-list/README.md

+2. Reduce the number of allocations in the WatchServer<br><br>
+The WatchServer is largely responsible for streaming data received from the storage layer (in our case from the cacher) back to clients.
+It turns out that sending a single event per consumer requires 4 memory allocations, visualized in the following image.
+Two of which deserve special attention, namely the allocations 1 and 3 because they won't reuse memory and rely on the GC for cleanup.


Did you try running your experiment with JSON as an output format (as opposed to protobuf)?

JSON is obviously more verbose, but I'm wondering if it is doing that many allocations too (from quick glance into the code it seems to behave better).

nope, but I optimized that part and re-run the test with 2048 informers - all synced! (1 TB of data). Here is a graph of memory usage during that run:

p0lyn0mial · 2022-01-18T08:25:16Z

@wojtek-t many thanks for taking the time to read and review the KEP. PTAL

wojtek-t

Sorry - just quick next pass, I will look at it in more detail early next week.

wojtek-t · 2022-01-19T11:33:26Z

keps/sig-api-machinery/3157-watch-list/README.md

+-->
+
+Today informers are the primary source of LIST requests.
+The LIST is used to get a consistent snapshot of data currently in etcd to build up a client-side in-memory cache.


nit: "currently in etcd" is not necessary true for lists served from watchcache.
I would just remove this part of the sentence.

wojtek-t · 2022-01-19T11:45:27Z

keps/sig-api-machinery/3157-watch-list/README.md

+- reduce etcd load by serving from watch cache
+- get a replacement for paginated lists from watch-cache, which is not feasible without major investment
+- enforce consistency in the sense of freshness of the returned list
+- be backward compatible with new client -> old server


I would like to discuss one more thing, i.e. overhead on the client side.
There is a significant overhead from relisting on the client-side too.
In the ideal world, I would like to have this addressed too, but either case I would like to see this added either to goals or non-goals explicitly.

Clarifying this one - sorry for mental shortcut.

The problem I have on my mind, that in a happy case, when the component initializes its cache once (LIST at the beginning) and then watching, we can quite well predict its resource usage. This is how people often are setting requests/limits for components.

However, when your component (using an informer/reflector/...) needs to relist, what happens is that it effectively LISTs stuff from kube-apiserver and for a moment have both the old-copy of the state and the new-one.
Which means that in many cases we're effectively doubling its footprint.

For the largest (and thus most problematic) clusters, generally the diff between the old and new isn't that huge, but we still have two copies of objects even for those that didn't change (one in the cache and the second that we got from the LIST).

However, if we're going to switch to watching, at least in theory we wouldn't have to keep the whole second copy of the state. What we could do is:

for objects that didn't change, just mark them as fresh in the cache

for objects that changes, store their new version in the buffer below

once we got the whole state (via this watch stream) then atomically:
(a) sweep over the cache and remove objects not marked as fresh
(b) put the object from that buffer into the cache

This would help a lot with the client-side memory spikes.

So if I understand you correctly, you would – without me knowing exactly the factoring of the informers/reflectors right now – basically make use of the individual incoming events and compare those with the existing store of the informer. If they differ, you keep them for the later, atomic switch-over. If they don't differ, you throw away the events, but just take a reference on the existing object for your new snapshot.

In other words, streaming opens a door for optimizing away this double memory consumption of a giant unpaged watch-cache backed LIST request.

I.e. this is some additional value proposition of the enhancement because we can do these kind of optimizations later-on.

Yes - this is what I meant.

And yes - it allows us to also optimize client side - I was asking if this is in-scope of the enhancement, or we should be treating it as "future work".
[I would really like this to happen too, but for the sake of making progress, I can live with "future work" too :) ]

wojtek-t · 2022-01-19T11:47:04Z

keps/sig-api-machinery/3157-watch-list/README.md

+nitty-gritty.
+-->
+
+In order to lower memory consumption while getting a list of data and make it more predictable,we propose to use consistent streaming from the watch-cache instead of paging from etcd.


What does "consistent streaming" really mean?

it means that the returned data is consistent - served from etcd via a quorum read.

wojtek-t · 2022-01-19T11:48:12Z

keps/sig-api-machinery/3157-watch-list/README.md

+
+In order to lower memory consumption while getting a list of data and make it more predictable,we propose to use consistent streaming from the watch-cache instead of paging from etcd.
+Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see [Appendix](#appendix) section for more details on how informers operate today).
+The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs.


It's probably worth mentioning, that watching with RV=0 is already relatively close to what we need, and describe its current semantics.

clarified in the Note section (a few lines down)

wojtek-t · 2022-01-19T11:49:00Z

keps/sig-api-machinery/3157-watch-list/README.md

+Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see [Appendix](#appendix) section for more details on how informers operate today).
+The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs.
+This would allow us to keep memory allocations constant.
+The server is bounded by the maximum allowed size of an object of 1.5 MB in etcd (note that the same object in memory can be much bigger, even by an order of magnitude)


I think I don't understand the question - we don't have a strict limit for number of objects, so technically we're not bounded, right?

the memory usage is bounded by the max size of an object plus all the allocations that are needed to handle the requests.

This isn't strictly true.

Basically, to serve the current state (even via watch), you still need to copy (sure - not whole objects, but the pointers to them) [well - technically, we could be doing copy-on-write tree to store the current state, but that's not what we're doing]. So we're actually allocating a pointer-per-objects to store that:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/watch_cache.go#L598

Also, it's not for watch we wait for processing the next even until the all data from the previous one is send and the memory buffer is released. So I think this sentence is a bit misleading.

wojtek-t · 2022-01-19T11:52:00Z

keps/sig-api-machinery/3157-watch-list/README.md

+The rough idea/plan is as follows:
+
+- step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request.
+- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache.


It seems my comment got lost, so copying here:

I think we should try to make the new semantic as similar to the current one as possible. What I have on my mind in particular is that currently, when reflector has to re-list, it first sets the RV= to the version of last observed change. This is extremely important to avoid going back in time. I think we should maintain it - in the new API: if you don't specify the RV, then indeed you contact etcd to get the current one but if you specify one, you actually don't have to fallback to etcd, just use the provided one

I captured it in the Design Details section, see if you like it.

wojtek-t · 2022-01-19T11:57:12Z

keps/sig-api-machinery/3157-watch-list/README.md

+It is used to make sure the cacher is up to date (has seen data stored in etcd) and to let the reflector know it has seen all initial data.
+There are ways to do that cheaply, e.g. we could issue a count request against the datastore.
+Next, the cacher creates a new cacheWatcher (implements watch.Interface) passing the given bookmarkAfterResourceVersion, and gets initial data from the watchCache.
+After sending initial data the cacheWatcher starts listening on an incoming channel for new events, including a bookmark event.


nit: technically it starts watching on incoming channel immediately, just buffers those events until initial state is send out

wojtek-t · 2022-01-19T12:05:05Z

keps/sig-api-machinery/3157-watch-list/README.md

+The cacheWatcher.incoming is a buffered channel and has a different size for different Resources (10 or 1000).
+Since the cacheWatcher starts processing the cacheWatcher.incoming channel only after sending all initial events it might block once its buffered channel tips over.
+In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event.
+All watchers that have failed to deliver the event will be closed.


This effectively opens us for a new failure mode.

The scenario that I have on my mind is a type where we observe high churn (let's say 100 object changes per second) with a huge number of objects.

Currently, if we need to relist, we just:

LIST (let's assume it can finish successfully within its 1m timeout)

starts watching from that
Given that watchcache keeps at least 75s of history, watching should just work, and worst case in case of a bug that it lost its history, we at least made a progress (by updating the cache in the component) via the LIST request.

In the new proposed design:

We open a watch, and start streaming the whole state

At the same time we accumulate in the incoming buffer the watch events happening in the meantime

Soon after the buffers fills in (before we even managed to stream the initial state)

And we close the watch

On the client side we need to retry, but we didn't make any progress (because we didn't get the whole initial state even, so couldn't update the cache).

I think it's solvable (we just don't close the watch immediately, just wait until it will stream the data and close it ~then), but it requires some changes, which we should definitely do as part of these changes.

This is a good point. I think that simply delaying closure might break the consistency guarantees we are trying to provide here. I think that delaying closure only if the buffer contains data up to the desired RV (bookmarkAfterResourceVersion) would preserve consistency. Assuming we would send those data right after the initial events. Thoughts?

wojtek-t · 2022-01-19T12:07:55Z

keps/sig-api-machinery/3157-watch-list/README.md

+Note: the proposed watch-list semantics (without bookmark event and without the consistency guarantee) kube-apiserver follows already in RV="0" watches.
+The mode is not used in informers today but is supported by every kube-apiserver for legacy, compatibility reasons.
+
+Note 2: informers need consistent lists to avoid time-travel when switching to another HA instance of kube-apiserver with outdated/lagging watch cache.


Copying:

I had it on my plate to reply there - this is addressed within a single incarnation (i.e. if a binary doesn't restart, the reflector will not time-travel).
This wasn't addressed only across component restarts (so basically only for the initial list - relists work fine, worst case they just fallback to etcd).

@p0lyn0mial p0lyn0mial yesterday
should we put it as a goal of this KEP? (Ensure the watch cache is fresh when RV=0).
I think it should be easy once we have this KEP implemented.

Partially. I think that we shouldn't change the default behavior of RV=0 (it by definition means at least from RV=0, which is trivially always satisfied). And that's fine.
I think that we should address that by defining some new "ResourceVersionMatch" value that will be able to ensure that (or sth like that).

ic, we could use the same value used for the "new" WATCH request. I proposed MostRecent

wojtek-t · 2022-01-19T12:09:55Z

keps/sig-api-machinery/3157-watch-list/README.md

+Consider including folks who also work outside the SIG or subproject.
+-->
+
+## Design Details


What I would like to see in this section is the exact changes in the API that you're also proposing (the new ResourceVersionMatch option that you're proposing, the exact semantics, etc.).

updated, see if it make more sense now.

I would like to see it being more explicit - see https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/2523-consistent-resource-versions-semantics#add-a-resourceversionmatch-query-parameter as an example for somewhat related changes.

deads2k · 2022-01-20T20:53:43Z

keps/sig-api-machinery/3157-watch-list/README.md

+List the specific goals of the KEP. What is it trying to achieve? How will we
+know that this has succeeded?
+-->
+- protect kube-apiserver and its node against list-based OOM attacks


it doesn't sound like you can protect against attacks (someone intentionally listing). Your proposal appears to be about allowing well-behaved clients (informers) to change their client code to issue a watch instead.

To a degree also against attacks. If you know your system does not need many lists, you can heavily throttle (p&f). Today we cannot because everybody has to list.

deads2k · 2022-01-20T21:14:42Z

keps/sig-api-machinery/3157-watch-list/README.md

+
+In order to lower memory consumption while getting a list of data and make it more predictable, we propose to use consistent streaming from the watch-cache instead of paging from etcd.
+Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see [Appendix](#appendix) section for more details on how informers operate today).
+The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs.


Are you actually changing LIST or are you changing informers (client) to use a watch and then tweaking watches?

we are going to change informers to use a single WATCH request instead of a LIST followed by a WATCH request.

If the primary problem is that apiserver can run out of memory, fixing clients is not going to solve that problem. (I haven't read this yet)

deads2k · 2022-01-20T21:16:07Z

keps/sig-api-machinery/3157-watch-list/README.md

+The rough idea/plan is as follows:
+
+- step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request.
+- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache.


does our client actually care about getting the latest RV or does it are about knowing which RV it has gotten?

It does, that's the whole point here and a long-known problem we have today of time-travel under certain conditions. Behaviour then is undefined. See @smarterclayton's kubernetes/kubernetes#59848.

deads2k · 2022-01-20T21:17:05Z

keps/sig-api-machinery/3157-watch-list/README.md

+- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache.
+- step 2a: send all objects currently stored in memory for the given resource.
+- step 2b: propagate any updates that might have happened meanwhile until the watch cache catches up to the latest RV received in step 2.
+- step 2c: send a bookmark event to the informer with the given RV.


Are you trying to use this to indicate that the initial list is finished?

yes, @wojtek-t's proposal that the first bookmark event marks the beginning of normal, non-initial events.

lavalamp · 2022-01-26T00:57:11Z

keps/sig-api-machinery/3157-watch-list/README.md

+Moreover, around 16:40 we lost the server after running 16 informers. During an investigation, we realized that the server allocates a lot of memory for handling LIST requests.
+In short, it needs to bring data from the database, unmarshal it, do some conversions and prepare the final response for the client.
+The bottom line is around O(5*the_response_from_etcd) of temporary memory consumption.
+Neither priority and fairness nor Golang garbage collection is able to protect the system from exhausting memory.


APF should be able to limit a collection to 1 concurrent LIST (i.e., serialize all list calls). You're saying there's a condition where apiserver cannot handle a single LIST call?

lavalamp · 2022-01-26T01:05:13Z

keps/sig-api-machinery/3157-watch-list/README.md

+- get rid of list or list pagination
+- rewrite the list storage stack to allow streaming, but rather use the existing streaming infrastructure (watches).
+
+## Proposal


Here's an alternative:

a) Force everyone to use pagination (e.g. serialize non-paginated lists)
b) The server may decide to reduce the page size based on the size of the response from etcd.
c) The continue token is of course changed to point at the next unseen item.

See also: kubernetes/kubernetes#98541

Past discussion: kubernetes/kubernetes#90179

deads2k · 2022-02-02T19:25:29Z

keps/sig-api-machinery/3157-watch-list/README.md

+-->
+
+- [ ] Metrics
+ - Metric name:


Given the intent of the KEP, it's appropriate to have metrics for the alpha release.

This KEP intends to lower the memory consumption needed to handle "list" requests. The best metric for measuring memory consumption of an API server I have found so far is container_memory_working_set_bytes (already exposed). The sig-scalability offered to help test memory consumption during some load tests (i.e. on 5K nodes). I think these tests along with the metric are a good starting point during the alpha phase.

for the issue with the full buffer we could have the following metrics

apiserver_cache_watcher_buffer_length (histogram, what was the buffer size) apiserver_watch_cache_lag (histogram, for how far the cache is behind the expected RV) apiserver_terminated_watchers_total (counter, already defined, needs to be updated (by an attribute) so that we count closed watch requests due to an overfull buffer in the new mode)

lavalamp · 2022-02-02T19:36:24Z

My thoughts:

I don't like that this solution forces clients to upgrade to fix the problem. I think API machinery has a bit of a justly-deserved reputation for doing this and this makes it worse. It will be especially hard for clients in other languages.

I don't like that this solution leaves us with two ways of getting a list-- and the easy and intuitive one will also break the server sometimes.

I do think that streaming a list is a superior way of getting it to the client. Have we brainstormed ways to fix LIST? (Imagine changing the server such that an initial unpaginated list does something different under the hood: call watch, use a serializer that omits the watch event envelope, and stops after it is caught up.)

Finally, since we're changing clients anyway, I don't like that the solution doesn't optimize things as much as possible. There's a class of design where the server only sends data the client doesn't have cached; one example is written up here: kubernetes/kubernetes#90339

Ultimately, although I'm conflicted, I don't really think these reasons are quite compelling enough to block this KEP, so I guess I won't.

deads2k · 2022-02-02T19:52:54Z

This design provides the ability for well-behaved clients to reduce the load they place on the server and improve their own deserialization memory cost.

Providing this capability and using it in reflectors may make it possible to improve the list-via-watch-cache-pagination problem afterwards by reducing use.

@wojtek-t has implementation concerns that must be addressed before code in k/k is changed, but if he doesn't get back to reviewing this before KEP freeze, I think it is implementable.

@p0lyn0mial please update with alternatives suggested and considered and add a means of measuring the usage of this feature.

p0lyn0mial · 2022-02-03T09:52:42Z

@deads2k I have updated the KEP with a potential solution. Although I don't understand why it should block alpha. After all, in the worst-case scenario informers will simply retry requests asking for the entire set again. It is not efficient but a similar situation can happen today with regular LIST requests. Currently, there is a timeout for LIST requests of 60 seconds. That means a slow reflector might fail synchronization as well and would have to re-establish the connection again and again.

sttts · 2022-02-03T11:17:41Z

keps/sig-api-machinery/3157-watch-list/README.md

+Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before.
+
+For an alpha version, we might simply delay closing the watch request until all data is sent to the client.
+We could try collecting some metrics for measuring how far the cache is behind the expected RV,


We will collect metrics

sttts · 2022-02-03T11:18:05Z

keps/sig-api-machinery/3157-watch-list/README.md

@@ -385,13 +385,39 @@ Since the cacheWatcher starts processing the cacheWatcher.incoming channel only
 In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event.
 All watchers that have failed to deliver the event will be closed.

+Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before.
+
+For an alpha version, we might simply delay closing the watch request until all data is sent to the client.


we will delay

We expect this to behave well even in heavily loaded clusters. To increase confidence in the approach, we will collect metrics ...

sttts · 2022-02-03T11:19:28Z

keps/sig-api-machinery/3157-watch-list/README.md

+We could try collecting some metrics for measuring how far the cache is behind the expected RV,
+what's the average buffer size, and a counter for closed watch requests due to an overfull buffer.
+
+For a beta version, we could explore this area further. One viable option worth considering would be,


For a beta version, we have further options if they turn out to be necessary:

sttts · 2022-02-03T11:50:30Z

keps/sig-api-machinery/3157-watch-list/README.md

+2. make the buffer dynamic - especially when the difference between RVs is > than 1000
+3. inject new events directly to the initial list, i.e. to have the initial list loop consume the channel directly and avoid to wait for the whole initial list being processed before.
+4. cap the size (cannot allocate more than X MB of memory) of the buffer
+5. maybe even apply some compression techniques to the buffer


like only storing a low-memory shallow reference and take the actual objects for the event from the store.

…nking. The kube-apiserver is vulnerable to memory explosion. The issue is apparent in larger clusters, where only a few LIST requests might cause serious disruption. Uncontrolled and unbounded memory consumption of the servers does not only affect clusters that operate in an HA mode but also other programs that share the same machine. In this KEP we propose a potential solution to this issue.

deads2k · 2022-02-03T15:02:52Z

@deads2k I have updated the KEP with a potential solution. Although I don't understand why it should block alpha.

ordinarily I'd wait for @wojtek-t, but that's not possible at this moment before freeze. He would have a chance to hold if he wished. If he chooses to exercise that option later, I think these circumstances are extenuating enough to accommodate that. I think this KEP is well described enough that I suspect he'd be biased toward merge, so I'm doing that. I also think this a good approach, but @wojtek-t's knowledge of the watch cache mechanics exceeds mine.

/approve
/lgtm

PS, the PRR is fine too.

k8s-ci-robot · 2022-02-03T15:04:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, p0lyn0mial

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-api-machinery/OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2022-02-17T08:22:25Z

keps/sig-api-machinery/3157-watch-list/README.md

+- reduce etcd load by serving from watch cache
+- get a replacement for paginated lists from watch-cache, which is not feasible without major investment
+- enforce consistency in the sense of freshness of the returned list
+- be backward compatible with new client -> old server


Yes - this is what I meant.

And yes - it allows us to also optimize client side - I was asking if this is in-scope of the enhancement, or we should be treating it as "future work".
[I would really like this to happen too, but for the sake of making progress, I can live with "future work" too :) ]

wojtek-t · 2022-02-17T08:28:51Z

keps/sig-api-machinery/3157-watch-list/README.md

+
+Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before.
+
+For an alpha version, we will delay closing the watch request until all data is sent to the client.


You mean "all initial data", right?

wojtek-t · 2022-02-17T08:29:56Z

keps/sig-api-machinery/3157-watch-list/README.md

+Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before.
+
+For an alpha version, we will delay closing the watch request until all data is sent to the client.
+We expect this to behave well even in heavily loaded clusters. 


I don't have data to back it up, but I'm afraid it may not be enough. Let me think about it more.

But (let me read the text below) assuming we have an option to improve for Beta, I would be fine with this for Alpha.

wojtek-t · 2022-02-17T08:30:50Z

keps/sig-api-machinery/3157-watch-list/README.md

+For an alpha version, we will delay closing the watch request until all data is sent to the client.
+We expect this to behave well even in heavily loaded clusters. 
+To increase confidence in the approach, we will collect metrics for measuring how far the cache is behind the expected RV,
+what's the average buffer size, and a counter for closed watch requests due to an overfull buffer.


I think I'm not fully following what exactly we want to measure here, but we can probably clarify that during code review.

wojtek-t · 2022-02-17T08:56:17Z

keps/sig-api-machinery/3157-watch-list/README.md

+what's the average buffer size, and a counter for closed watch requests due to an overfull buffer.
+
+For a beta version, we have further options if they turn out to be necessary:
+1. comparing the bookmarkAfterResourceVersion (from Step 2) with the current RV the watchCache is on


This seem to be focusing on the case where watchCache is lagging compared to etcd.
This isn't really the case I'm afraid of.

If we're lagging significantly, I think it's perfectly ok to just wait (without doing anything) until we catch up (with timeout) and if we don't, just fail the request. Then it was super cheap and it's fine (and not really a regression).

The case I'm worried about is:

say watchcache is perfectly keeping up with etcd

suddenly we're opening a gazillions of watches to stream a huge amount of data (say e.g. all kube-proxies in 5k-node cluster were restarted at exactly the same time)

we're starting streaming the data to all of them

in the meantime there is high churn of other events

all buffers are getting full and we're triggering closing all the watches without finishing streaming the initial state

If we really close this watch, then the watcher didn't make any progress. So we can go into this kind of crashloop.

But now let's say that as suggested for Alpha, we will not close the watch immediately, but rather stream the initial state and close the watch only after that. This seems fine as long it will not take ages.

But thinking about this more, the "watch initialization support in P&F" should be giving this to us. We should just double check if the support I added in kubernetes/kubernetes#103660 (+ the previous PR for it) is actually ok, or whether we should tweak it further.

So I think I'm fine with this.

What I think we should do though (maybe not for Alpha though) is to not start with sending the "current watch state", but rather ensure that watchcache is already up-to-date before starting sending anything.
I would like to avoid comparing things like "RV different <1000" - I think we should just do the pattern of
"waitUntilFresh + start sending"

The problem with this approach is that RV returned from etcd is a global one, so in many cases we would be forced to wait for "progress notify" event from etcd. Which isn't perfect because (with 5s we're using by default), it would effectively be adding 2.5s latency on avg.
But maybe what we can do then, is to effectively force "RequestProgres()" calls: https://pkg.go.dev/go.etcd.io/etcd/client/v3#section-readme
with some exponential backoff then - we would just have to coordinate it across different resources somehow, but that sounds doable.

I guess the bottom line is hat I believe we have a reasonable path forward here - it's more a matter of carefully doing the work and the amount of that work.

@deads2k - FYI

wojtek-t · 2022-02-17T09:12:39Z

keps/sig-api-machinery/3157-watch-list/README.md

+ // For watch calls, it begins with synthetic "Added" events of all resources up to the most recent ResourceVersion.
+ // It ends with a synthetic "Bookmark" event containing the most recent ResourceVersion.
+ // For list calls, it has the same semantics as leaving ResourceVersion and ResourceVersionMatch unset.
+ ResourceVersionMatchMostRecent ResourceVersionMatch = "MostRecent"


Hmm - I started wondering if we really need that.

"MostRecent" is effectively the semantic of resourceVersion="". We are using that semantic for List.

So instead of introducing yet another constant, I think we should just:

set RV="" for watch calls

handle this from watchcache (if it is enabled)

We need to be very careful to ensure that we will be backward compatible here, but I think it's doable.

nayihz · 2023-09-26T01:19:51Z

The kube-apiserver is vulnerable to memory explosion.
The issue is apparent in larger clusters, where only a few LIST requests might cause serious disruption.

Hi @p0lyn0mial , could you give some references about this issue?

p0lyn0mial · 2023-09-26T08:08:15Z

Hi @p0lyn0mial , could you give some references about this issue?

hey, please have a look at https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/3157-watch-list#manual-testing-without-the-changes-in-place

In addition to that you might be interested in the synthetical perf tests that have been prepared for this feature ( i.e. http://perf-dash.k8s.io/#/?jobname=watch-list-off&metriccategoryname=E2E&metricname=LoadResources&PodName=kube-apiserver-bootstrap-e2e-master%2Fkube-apiserver&Resource=memory)

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 14, 2022

k8s-ci-robot requested review from annajung and fedebongio January 14, 2022 15:18

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jan 14, 2022

p0lyn0mial force-pushed the kep-watch-list branch 2 times, most recently from 119d33a to 2d778e2 Compare January 14, 2022 15:30

k8s-ci-robot assigned wojtek-t Jan 14, 2022

wojtek-t reviewed Jan 17, 2022

View reviewed changes

p0lyn0mial force-pushed the kep-watch-list branch 3 times, most recently from 68e5d14 to 9ce605b Compare January 18, 2022 12:00

p0lyn0mial mentioned this pull request Jan 18, 2022

Allow informers for getting a stream of data instead of chunking #3157

Open

8 tasks

p0lyn0mial changed the title ~~KEP-NNNN: allow informers for getting a stream of data instead of chunking~~ KEP-3157: allow informers for getting a stream of data instead of chunking Jan 18, 2022

wojtek-t reviewed Jan 19, 2022

View reviewed changes

deads2k reviewed Jan 20, 2022

View reviewed changes

lavalamp reviewed Jan 26, 2022

View reviewed changes

deads2k reviewed Feb 2, 2022

View reviewed changes

p0lyn0mial force-pushed the kep-watch-list branch from 8115937 to 87eb3f9 Compare February 3, 2022 09:10

p0lyn0mial force-pushed the kep-watch-list branch from 87eb3f9 to 94c6092 Compare February 3, 2022 11:12

sttts reviewed Feb 3, 2022

View reviewed changes

p0lyn0mial force-pushed the kep-watch-list branch from 94c6092 to b3c621a Compare February 3, 2022 11:28

sttts reviewed Feb 3, 2022

View reviewed changes

p0lyn0mial force-pushed the kep-watch-list branch from b3c621a to 9a37cad Compare February 3, 2022 12:04

p0lyn0mial force-pushed the kep-watch-list branch from 9a37cad to b14a2f7 Compare February 3, 2022 14:58

k8s-ci-robot assigned deads2k Feb 3, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2022

k8s-ci-robot merged commit 328e235 into kubernetes:master Feb 3, 2022

k8s-ci-robot added this to the v1.24 milestone Feb 3, 2022

wojtek-t reviewed Feb 17, 2022

View reviewed changes

p0lyn0mial mentioned this pull request Feb 23, 2022

reduce the number of allocations in the WatchServer during objects serialisation kubernetes/kubernetes#108186

Merged

p0lyn0mial mentioned this pull request Jul 5, 2022

client-go: Add support for API streaming to the reflector kubernetes/kubernetes#110772

Merged

This was referenced Jul 20, 2022

enable progress notification from etcd for all k8s e2e tests kubernetes-sigs/kind#2836

Closed

Add support for API streaming to the watch-cache kubernetes/kubernetes#110960

Merged

wojtek-t mentioned this pull request Nov 23, 2022

Change Watch behavior for resourceVersion=0 in v2 API kubernetes/kubernetes#13969

Closed

tosi3k mentioned this pull request Jul 26, 2024

feat: Make cluster cache target the watch cache instead of etcd argoproj/gitops-engine#617

Open


		Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before.

		For an alpha version, we will delay closing the watch request until all data is sent to the client.

KEP-3157: allow informers for getting a stream of data instead of chunking #3142

KEP-3157: allow informers for getting a stream of data instead of chunking #3142

Conversation

p0lyn0mial commented Jan 14, 2022 • edited Loading

p0lyn0mial commented Jan 14, 2022

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p0lyn0mial Jan 18, 2022 • edited Loading

Choose a reason for hiding this comment

p0lyn0mial commented Jan 18, 2022

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sttts Jan 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lavalamp commented Feb 2, 2022

deads2k commented Feb 2, 2022

p0lyn0mial commented Feb 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sttts Feb 3, 2022 • edited Loading

Choose a reason for hiding this comment

p0lyn0mial commented Jan 14, 2022 •

edited

Loading

p0lyn0mial Jan 18, 2022 •

edited

Loading

sttts Jan 20, 2022 •

edited

Loading

sttts Feb 3, 2022 •

edited

Loading

deads2k commented Feb 3, 2022 •

edited

Loading