diff --git a/keps/sig-api-machinery/3157-watch-list/README.md b/keps/sig-api-machinery/3157-watch-list/README.md index 5df72d3ecd8c..3e21d0d4571a 100644 --- a/keps/sig-api-machinery/3157-watch-list/README.md +++ b/keps/sig-api-machinery/3157-watch-list/README.md @@ -385,13 +385,37 @@ Since the cacheWatcher starts processing the cacheWatcher.incoming channel only In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event. All watchers that have failed to deliver the event will be closed. +Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before. + +For an alpha version, we will delay closing the watch request until all data is sent to the client. +We expect this to behave well even in heavily loaded clusters. +To increase confidence in the approach, we will collect metrics for measuring how far the cache is behind the expected RV, +what's the average buffer size, and a counter for closed watch requests due to an overfull buffer. + +For a beta version, we have further options if they turn out to be necessary: +1. comparing the bookmarkAfterResourceVersion (from Step 2) with the current RV the watchCache is on +and waiting until the difference between the RVs is < 1000 (the buffer size). We could do that even before sending the initial events. +If the difference is greater than that it seems there is no need to go on since the buffer could be filled before we will receive an event with the expected RV. +Assuming all updates would be for the resource the watch request was opened for (which seems unlikely). +In case the watchCache was unable to catch up to the bookmarkAfterResourceVersion for some timeout value hard-close (ends the current connection by tearing down the current TCP connection with the client) the current connection so that client re-connects to a different API server with most-up to date cache. +Taking into account the baseline etcd performance numbers waiting for 10 seconds will allow us to receive ~5K events, assuming ~500 QPS throughput (see https://etcd.io/docs/v3.4/op-guide/performance/) +Once we are past this step (we know the difference is smaller) and the buffer fills up we: + - case-1: won’t close the connection immediately if the bookmark event with the expected RV exists in the buffer. + In that case, we will deliver the initial events, any other events we have received which RVs are <= bookmarkAfterResourceVersion, and finally the bookmark event, and only then we will soft-close (simply ends the current connection without tearing down the TCP connection) the current connection. + An informer will reconnect with the RV from the bookmark event. + Note that any new event received was ignored since the buffer was full. + + - case-2: soft-close the connection if the bookmark event with the expected RV for some reason doesn't exist in the buffer. + An informer will reconnect arriving at the step that compares the RVs first. +2. make the buffer dynamic - especially when the difference between RVs is > than 1000 +3. inject new events directly to the initial list, i.e. to have the initial list loop consume the channel directly and avoid to wait for the whole initial list being processed before +4. cap the size (cannot allocate more than X MB of memory) of the buffer +5. maybe even apply some compression techniques to the buffer (for example by only storing a low-memory shallow reference and take the actual objects for the event from the store) + Note: The RV is effectively a global counter that is incremented every time an object is updated. This imposes a global order of events. It is equivalent to a LIST followed by a WATCH request. -Note 2: Given the above, there is a possibility of sending initial data taking too long and failing the given time budget. -In that case, the cacheWatcher will be closed and the client (reflector) would have to reconnect. - -Note 3: Currently, there is a timeout for LIST requests of 60s. +Note 2: Currently, there is a timeout for LIST requests of 60s. That means a slow reflector might fail synchronization as well and would have to re-establish the connection. Step 2c: How bookmarks are delivered to the cacheWatcher? @@ -806,7 +830,9 @@ Pick one more of these and delete the rest. --> - [ ] Metrics - - Metric name: + - Metric name: apiserver_cache_watcher_buffer_length (histogram, what was the buffer size) + - Metric name: apiserver_watch_cache_lag (histogram, for how far the cache is behind the expected RV) + - Metric name: apiserver_terminated_watchers_total (counter, already defined, needs to be updated (by an attribute) so that we count closed watch requests due to an overfull buffer in the new mode) - [Optional] Aggregation method: - Components exposing the metric: - [ ] Other (treat as last resort)