KEP 4568: Resilient watchcache initialization #4557

wojtek-t · 2024-03-21T19:09:25Z

/sig api-machinery

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

wojtek-t · 2024-03-22T11:10:01Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+
+## Design Details
+
+<!--


The POC: kubernetes/kubernetes#124029

logicalhan · 2024-03-24T13:51:40Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+For watch requests, unless it's a consistent watch request (setting RV="") - which
+isn't really used by none of our core components, it is always handled by the
+watchcache. For all such requests, watch is hanging until watchcache finishes
+initialiation. The problem here is that for clusters with a large amount of data


Suggested change

initialiation. The problem here is that for clusters with a large amount of data

initialization. The problem here is that for clusters with a large amount of data

logicalhan · 2024-03-24T13:51:55Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+isn't really used by none of our core components, it is always handled by the
+watchcache. For all such requests, watch is hanging until watchcache finishes
+initialiation. The problem here is that for clusters with a large amount of data
+of a given type, initialiation (being effectively a list from etcd) can take


Suggested change

of a given type, initialiation (being effectively a list from etcd) can take

of a given type, initialization (being effectively a list from etcd) can take

logicalhan · 2024-03-24T13:55:00Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+requests are not sent to it until kube-apiserver becomes ready. The above will allow us
+to delay this moment until watchcache is initialized.
+
+Why PostStartHook instead of /readyz check? Watchcache is per-type layer and we want


Technically a PostStartHook is a readyz check. It's just a readyz check that is gated on initialization parameters. But yeah, your solution makes intuitive sense to me, when I was reading your problem statement, I was just wondering why we don't just create a readiness check for this.

I was just wondering why we don't just create a readiness check for this

You're asking why not readyz check and PostStartHook instead? I'm trying to explain it in this paragraph exactly.

Yes, I realize that, I was (in a roundabout way) agreeing with your reasoning here.

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

jpbetz · 2024-03-28T21:57:00Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+
+Finally, we suggest starting just with builtin resources, as CRDs can be created at
+any time, making it harder to implement with unclear gain. If it appears to not be
+enough, this decision can easily be revisited in the future.


I agree, I don't think we need to include CRDs.. Being selective on which resources we require this for is probably a good idea, in general. All builtins seems like a good starting point. If in the future we found that we only need a subset of builtins, would limiting which builtins we wait for be a problem?

No - that seems easy to adjust later too. Added to the text.

jpbetz · 2024-03-28T22:12:07Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+no matter if they are served from etcd or from cache (the latency would be
+different in those cases though, so the actual seat-seconds cost too) and
+finally given we already did some work to process this request, we would
+simply delegate those requests to etcd.


I really like how this KEP, combined with list-watch and consistent reads from cache all seem to move the system in a direction where the actual cost of GET requests will have less expected variation than they do today.

jpbetz · 2024-03-28T22:18:49Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+However, that would also have a negative consequences, e.g. by slowing down
+kube-apiserver initialization (kube-apiserver on startup need to initialize
+its own informers by loop-client, and by rejecting all lists until watchcache
+is initialized, we would effectively block that initialization).


List-Watch would return 429? Do we benefit from getting list-watch enabled-by-default before enabling this enhancement by default?

Yes - ListWatch is effectively a regular Watch call underneathy (with some specific params, but handled by Watch()) methods. So yes - it will also return 429.

I don't think enabling list-watch first effectively gives us anything. Given that we want the behavior to be the same for watches and lists, the switch to list-watch should be no-op from that perspective.

jpbetz · 2024-03-28T22:27:53Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+gate, to allow for disabling them if needed.
+
+However, given we're not changing the API itself, we propose to go directly
+to Beta, as Alpha will not give us any feedback.


cc @deads2k

I haven't put much thought into what the criteria should be for skipping alpha. This would be on-by-default I'm assuming?

Clearly all Beta PRR criteria would need to be met.

This would be on-by-default I'm assuming?

Yes

Clearly all Beta PRR criteria would need to be met.

of course - whole PRR still requires filling in

Regarding the main comment - for me Alpha stage makes sense for things that change API [e.g. to enable rollback]. For purely in-memory changes (like this one), alpha is no-op and noone uses it, so we effectively have zero benefit from it.

I support going straight to beta for this. I consider startup readiness something we should be able to refine w/o going through all the stability levels.

I also support direct beta here, this is a reliability improvement and it's silly to gate this behind an alpha stage.

jpbetz · 2024-03-28T22:31:17Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+While we don't may decide to adjust the approach here, we will start with
+the approach that we will delegate to etcd the requests that (a) are not


Suggested change

While we don't may decide to adjust the approach here, we will start with

the approach that we will delegate to etcd the requests that (a) are not

While we didn't decide to reject _all_ list requests with 429 here, we will start with

an approach where we only delegate to etcd the requests that (a) are not

(wording suggestion only.. please only accept if this matches what was intended)

it matches - thanks for clarifying

jpbetz · 2024-03-28T22:37:02Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+
+Why PostStartHook instead of /readyz check? Watchcache is per-type layer and we want
+to avoid marking whole kube-apiserver as not-ready if watchcache for one of resources
+requires reinitialization. We will handle those cases differently.


Just making sure I understand. Would the first initialization of each builtin type be /readyz gated but any re-initialization be post- /readyz (but still result in 429 responses)?

The way poststarthook works is just it effectively:

is starteded once:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/hooks.go#L194-L206

it's reported as healthy as soon as it succeeds for the first time:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/hooks.go#L238-L245
[and it never changes back to false later]

So what will happen is effectively:

first initialization of each builtin type will prevent readyz from succeeding [initialization of kube-apiserver]

even if watchcache needs to reinitialize later, it will not bring readyz/ back to failing, but as you wrote, requests [which exactly described below] will be returning 429

tried to clarify in the text

wojtek-t

@jpbetz - PTAL

[still requires filling in PRR, but the rest should be more-or-less ready]

wojtek-t · 2024-04-04T10:47:17Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+requests are not sent to it until kube-apiserver becomes ready. The above will allow us
+to delay this moment until watchcache is initialized.
+
+Why PostStartHook instead of /readyz check? Watchcache is per-type layer and we want


I was just wondering why we don't just create a readiness check for this

You're asking why not readyz check and PostStartHook instead? I'm trying to explain it in this paragraph exactly.

wojtek-t · 2024-04-04T11:00:15Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+
+Why PostStartHook instead of /readyz check? Watchcache is per-type layer and we want
+to avoid marking whole kube-apiserver as not-ready if watchcache for one of resources
+requires reinitialization. We will handle those cases differently.


The way poststarthook works is just it effectively:

is starteded once:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/hooks.go#L194-L206

it's reported as healthy as soon as it succeeds for the first time:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/hooks.go#L238-L245
[and it never changes back to false later]

So what will happen is effectively:

first initialization of each builtin type will prevent readyz from succeeding [initialization of kube-apiserver]

even if watchcache needs to reinitialize later, it will not bring readyz/ back to failing, but as you wrote, requests [which exactly described below] will be returning 429

tried to clarify in the text

wojtek-t · 2024-04-04T11:07:25Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+
+Finally, we suggest starting just with builtin resources, as CRDs can be created at
+any time, making it harder to implement with unclear gain. If it appears to not be
+enough, this decision can easily be revisited in the future.


No - that seems easy to adjust later too. Added to the text.

wojtek-t · 2024-04-04T11:13:02Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+However, that would also have a negative consequences, e.g. by slowing down
+kube-apiserver initialization (kube-apiserver on startup need to initialize
+its own informers by loop-client, and by rejecting all lists until watchcache
+is initialized, we would effectively block that initialization).


Yes - ListWatch is effectively a regular Watch call underneathy (with some specific params, but handled by Watch()) methods. So yes - it will also return 429.

I don't think enabling list-watch first effectively gives us anything. Given that we want the behavior to be the same for watches and lists, the switch to list-watch should be no-op from that perspective.

wojtek-t · 2024-04-04T11:13:43Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+While we don't may decide to adjust the approach here, we will start with
+the approach that we will delegate to etcd the requests that (a) are not


it matches - thanks for clarifying

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

wojtek-t · 2024-04-04T11:17:03Z

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md

+gate, to allow for disabling them if needed.
+
+However, given we're not changing the API itself, we propose to go directly
+to Beta, as Alpha will not give us any feedback.


This would be on-by-default I'm assuming?

Yes

Clearly all Beta PRR criteria would need to be met.

of course - whole PRR still requires filling in

Regarding the main comment - for me Alpha stage makes sense for things that change API [e.g. to enable rollback]. For purely in-memory changes (like this one), alpha is no-op and noone uses it, so we effectively have zero benefit from it.

jpbetz · 2024-04-04T14:07:44Z

This LGTM

wojtek-t · 2024-04-04T14:49:48Z

This LGTM

Thanks - I switched to implementable and added prod-readiness file.

jpbetz

PRR looks straightforward. One suggestion then LGTM.

jpbetz · 2024-04-05T02:10:16Z

keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md

+- kubeapiserver /readyz behaving visibly different than before
+- `apiserver_request_total` (especially watch for requests finished with 429 code)


Be more prescriptive here to make it extra obvious to a cluster admin what to watch for? E.g. "If an apiserver continues to respond to /readyz requests with X significantly after startup should have completed ..."

keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md

deads2k · 2024-04-10T17:19:25Z

keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md

+its own informers by loop-client, and by rejecting all lists until watchcache
+is initialized, we would effectively block that initialization).
+
+While we didn't decide to reject _all_ list requests with 429, we will start with


could this be a "stick" to have people start using streaming lists? Streaming lists always allowed, but regular lists rejected?

No. Streaming lists will never be allowed here as they are served from watchcache - so if the watchcache is not initialized, we will not be able to serve them at all.

deads2k · 2024-04-10T17:21:15Z

keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md

+gate, to allow for disabling them if needed.
+
+However, given we're not changing the API itself, we propose to go directly
+to Beta, as Alpha will not give us any feedback.


Alpha gives the interested parties a chance to test this without impacting default installs (still often used in production).

Realistically, providers are not enabling Alpha. There are users who enable Alpha, because they want a given feature, but for changes like this (effectively not being a feature), I doubt anyone would enable it.

That said - are you against the decision of going directly to beta, or was it more of a wording issue with the current text?

That said - are you against the decision of going directly to beta, or was it more of a wording issue with the current text?

I'm against going directly to beta. I think we should produce an alpha that we can use to see if the feature achieves its goals before we enable such a change on all current default clusters (commonly used in production).

Updated the proposal based on yesterday discussion on SIG arch:

two separate feature-gates

both going directly to beta

one of them disabled by default, second enabled by default

wojtek-t

@deads2k - thanks for comments, PTAL

keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md

wojtek-t · 2024-04-11T08:19:56Z

keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md

+its own informers by loop-client, and by rejecting all lists until watchcache
+is initialized, we would effectively block that initialization).
+
+While we didn't decide to reject _all_ list requests with 429, we will start with


No. Streaming lists will never be allowed here as they are served from watchcache - so if the watchcache is not initialized, we will not be able to serve them at all.

wojtek-t · 2024-04-11T08:21:31Z

keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md

+gate, to allow for disabling them if needed.
+
+However, given we're not changing the API itself, we propose to go directly
+to Beta, as Alpha will not give us any feedback.


Realistically, providers are not enabling Alpha. There are users who enable Alpha, because they want a given feature, but for changes like this (effectively not being a feature), I doubt anyone would enable it.

That said - are you against the decision of going directly to beta, or was it more of a wording issue with the current text?

wojtek-t

I updated the proposal based on discussion yesterday, PTAL

wojtek-t · 2024-04-19T11:10:27Z

keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md

+gate, to allow for disabling them if needed.
+
+However, given we're not changing the API itself, we propose to go directly
+to Beta, as Alpha will not give us any feedback.


Updated the proposal based on yesterday discussion on SIG arch:

two separate feature-gates

both going directly to beta

one of them disabled by default, second enabled by default

deads2k · 2024-04-19T17:29:01Z

keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md

+However, we admit that the risk in both cases is different:
+- for the new post-start hook, there is a risk of kube-apiserver not initializing,
+   thus we will start with Beta disabled by-default
+- for the changes to handing requests when watchcache is uninitialized, we
+   believe the risk is visibly lower (and 429 errors were already returned
+   before, even in small clusters where priority-levels are small enough
+   to often admit only 1-2 inflight requests at once anyway) and we will
+   enable it by-default from the very beginning


lgtm, thanks.

deads2k · 2024-04-19T17:30:20Z

Thanks for the enablement updates. This lgtm from the sig perspective. I'll leave PRR and approve with @jpbetz who has also reviewed already.

/lgtm
/assign @jpbetz

jpbetz · 2024-04-22T18:15:06Z

/approve for PRR (feature gates and metrics are the key things I was looking for here. Using guidance from kubernetes/community#7828 for beta stability level)

k8s-ci-robot · 2024-04-22T18:15:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpbetz, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [jpbetz,wojtek-t]
~~keps/sig-api-machinery/OWNERS~~ [jpbetz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from apelisse and sttts March 21, 2024 19:09

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Mar 21, 2024

wojtek-t commented Mar 22, 2024

View reviewed changes

keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md Outdated Show resolved Hide resolved

wojtek-t mentioned this pull request Mar 22, 2024

[WIP] [POC] Watchcache readiness kubernetes/kubernetes#124029

Closed

wojtek-t commented Mar 22, 2024

View reviewed changes

wojtek-t assigned wojtek-t, jpbetz and deads2k and unassigned wojtek-t Mar 22, 2024

wojtek-t force-pushed the robust_watchcache branch 2 times, most recently from e93a990 to 17dc74b Compare March 22, 2024 11:34

logicalhan reviewed Mar 27, 2024

View reviewed changes

jpbetz reviewed Mar 28, 2024

View reviewed changes

wojtek-t mentioned this pull request Apr 4, 2024

Resilient Watchcache Initialization #4568

Open

4 tasks

wojtek-t force-pushed the robust_watchcache branch from 17dc74b to 5912cf6 Compare April 4, 2024 11:42

wojtek-t changed the title ~~[WIP] KEP XXXX: Resilient watchcache initialization~~ [WIP] KEP 4568: Resilient watchcache initialization Apr 4, 2024

wojtek-t commented Apr 4, 2024

View reviewed changes

wojtek-t force-pushed the robust_watchcache branch from 5912cf6 to 9f6602d Compare April 4, 2024 12:02

wojtek-t changed the title ~~[WIP] KEP 4568: Resilient watchcache initialization~~ KEP 4568: Resilient watchcache initialization Apr 4, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 4, 2024

wojtek-t force-pushed the robust_watchcache branch from 9f6602d to 3c09646 Compare April 4, 2024 14:49

jpbetz reviewed Apr 5, 2024

View reviewed changes

wojtek-t force-pushed the robust_watchcache branch from 3c09646 to 1d35f78 Compare April 5, 2024 09:37

deads2k reviewed Apr 10, 2024

View reviewed changes

keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md Show resolved Hide resolved

deads2k reviewed Apr 10, 2024

View reviewed changes

keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md Show resolved Hide resolved

deads2k reviewed Apr 10, 2024

View reviewed changes

wojtek-t force-pushed the robust_watchcache branch from 1d35f78 to 5f631d1 Compare April 11, 2024 08:22

wojtek-t commented Apr 11, 2024

View reviewed changes

wojtek-t force-pushed the robust_watchcache branch from 5f631d1 to b153041 Compare April 19, 2024 11:09

KEP 4568: Resilient watchcache initialization

51c92a6

wojtek-t force-pushed the robust_watchcache branch from b153041 to 51c92a6 Compare April 19, 2024 11:11

wojtek-t commented Apr 19, 2024

View reviewed changes

deads2k reviewed Apr 19, 2024

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2024

k8s-ci-robot merged commit 15e9464 into kubernetes:master Apr 22, 2024
4 checks passed

k8s-ci-robot added this to the v1.31 milestone Apr 22, 2024

	initialiation. The problem here is that for clusters with a large amount of data
	initialization. The problem here is that for clusters with a large amount of data

	of a given type, initialiation (being effectively a list from etcd) can take
	of a given type, initialization (being effectively a list from etcd) can take

		While we don't may decide to adjust the approach here, we will start with
		the approach that we will delegate to etcd the requests that (a) are not

		- kubeapiserver /readyz behaving visibly different than before
		- `apiserver_request_total` (especially watch for requests finished with 429 code)

KEP 4568: Resilient watchcache initialization #4557

KEP 4568: Resilient watchcache initialization #4557

Conversation

wojtek-t commented Mar 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpbetz commented Apr 4, 2024

wojtek-t commented Apr 4, 2024

jpbetz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Apr 19, 2024

jpbetz commented Apr 22, 2024

k8s-ci-robot commented Apr 22, 2024