-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP 4568: Resilient watchcache initialization #4557
Conversation
keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md
Outdated
Show resolved
Hide resolved
|
||
## Design Details | ||
|
||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The POC: kubernetes/kubernetes#124029
e93a990
to
17dc74b
Compare
For watch requests, unless it's a consistent watch request (setting RV="") - which | ||
isn't really used by none of our core components, it is always handled by the | ||
watchcache. For all such requests, watch is hanging until watchcache finishes | ||
initialiation. The problem here is that for clusters with a large amount of data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initialiation. The problem here is that for clusters with a large amount of data | |
initialization. The problem here is that for clusters with a large amount of data |
isn't really used by none of our core components, it is always handled by the | ||
watchcache. For all such requests, watch is hanging until watchcache finishes | ||
initialiation. The problem here is that for clusters with a large amount of data | ||
of a given type, initialiation (being effectively a list from etcd) can take |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of a given type, initialiation (being effectively a list from etcd) can take | |
of a given type, initialization (being effectively a list from etcd) can take |
requests are not sent to it until kube-apiserver becomes ready. The above will allow us | ||
to delay this moment until watchcache is initialized. | ||
|
||
Why PostStartHook instead of /readyz check? Watchcache is per-type layer and we want |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically a PostStartHook
is a readyz check. It's just a readyz check that is gated on initialization parameters. But yeah, your solution makes intuitive sense to me, when I was reading your problem statement, I was just wondering why we don't just create a readiness check for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just wondering why we don't just create a readiness check for this
You're asking why not readyz check and PostStartHook instead? I'm trying to explain it in this paragraph exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I realize that, I was (in a roundabout way) agreeing with your reasoning here.
keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md
Outdated
Show resolved
Hide resolved
|
||
Finally, we suggest starting just with builtin resources, as CRDs can be created at | ||
any time, making it harder to implement with unclear gain. If it appears to not be | ||
enough, this decision can easily be revisited in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I don't think we need to include CRDs.. Being selective on which resources we require this for is probably a good idea, in general. All builtins seems like a good starting point. If in the future we found that we only need a subset of builtins, would limiting which builtins we wait for be a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - that seems easy to adjust later too. Added to the text.
no matter if they are served from etcd or from cache (the latency would be | ||
different in those cases though, so the actual seat-seconds cost too) and | ||
finally given we already did some work to process this request, we would | ||
simply delegate those requests to etcd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how this KEP, combined with list-watch and consistent reads from cache all seem to move the system in a direction where the actual cost of GET requests will have less expected variation than they do today.
However, that would also have a negative consequences, e.g. by slowing down | ||
kube-apiserver initialization (kube-apiserver on startup need to initialize | ||
its own informers by loop-client, and by rejecting all lists until watchcache | ||
is initialized, we would effectively block that initialization). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List-Watch would return 429? Do we benefit from getting list-watch enabled-by-default before enabling this enhancement by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - ListWatch is effectively a regular Watch call underneathy (with some specific params, but handled by Watch()) methods. So yes - it will also return 429.
I don't think enabling list-watch first effectively gives us anything. Given that we want the behavior to be the same for watches and lists, the switch to list-watch should be no-op from that perspective.
gate, to allow for disabling them if needed. | ||
|
||
However, given we're not changing the API itself, we propose to go directly | ||
to Beta, as Alpha will not give us any feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @deads2k
I haven't put much thought into what the criteria should be for skipping alpha. This would be on-by-default I'm assuming?
Clearly all Beta PRR criteria would need to be met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be on-by-default I'm assuming?
Yes
Clearly all Beta PRR criteria would need to be met.
of course - whole PRR still requires filling in
Regarding the main comment - for me Alpha stage makes sense for things that change API [e.g. to enable rollback]. For purely in-memory changes (like this one), alpha is no-op and noone uses it, so we effectively have zero benefit from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support going straight to beta for this. I consider startup readiness something we should be able to refine w/o going through all the stability levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also support direct beta here, this is a reliability improvement and it's silly to gate this behind an alpha stage.
While we don't may decide to adjust the approach here, we will start with | ||
the approach that we will delegate to etcd the requests that (a) are not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we don't may decide to adjust the approach here, we will start with | |
the approach that we will delegate to etcd the requests that (a) are not | |
While we didn't decide to reject _all_ list requests with 429 here, we will start with | |
an approach where we only delegate to etcd the requests that (a) are not |
(wording suggestion only.. please only accept if this matches what was intended)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it matches - thanks for clarifying
|
||
Why PostStartHook instead of /readyz check? Watchcache is per-type layer and we want | ||
to avoid marking whole kube-apiserver as not-ready if watchcache for one of resources | ||
requires reinitialization. We will handle those cases differently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure I understand. Would the first initialization of each builtin type be /readyz gated but any re-initialization be post- /readyz (but still result in 429 responses)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way poststarthook works is just it effectively:
- is starteded once:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/hooks.go#L194-L206 - it's reported as healthy as soon as it succeeds for the first time:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/hooks.go#L238-L245
[and it never changes back to false later]
So what will happen is effectively:
- first initialization of each builtin type will prevent readyz from succeeding [initialization of kube-apiserver]
- even if watchcache needs to reinitialize later, it will not bring readyz/ back to failing, but as you wrote, requests [which exactly described below] will be returning 429
tried to clarify in the text
17dc74b
to
5912cf6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpbetz - PTAL
[still requires filling in PRR, but the rest should be more-or-less ready]
requests are not sent to it until kube-apiserver becomes ready. The above will allow us | ||
to delay this moment until watchcache is initialized. | ||
|
||
Why PostStartHook instead of /readyz check? Watchcache is per-type layer and we want |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just wondering why we don't just create a readiness check for this
You're asking why not readyz check and PostStartHook instead? I'm trying to explain it in this paragraph exactly.
|
||
Why PostStartHook instead of /readyz check? Watchcache is per-type layer and we want | ||
to avoid marking whole kube-apiserver as not-ready if watchcache for one of resources | ||
requires reinitialization. We will handle those cases differently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way poststarthook works is just it effectively:
- is starteded once:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/hooks.go#L194-L206 - it's reported as healthy as soon as it succeeds for the first time:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/hooks.go#L238-L245
[and it never changes back to false later]
So what will happen is effectively:
- first initialization of each builtin type will prevent readyz from succeeding [initialization of kube-apiserver]
- even if watchcache needs to reinitialize later, it will not bring readyz/ back to failing, but as you wrote, requests [which exactly described below] will be returning 429
tried to clarify in the text
|
||
Finally, we suggest starting just with builtin resources, as CRDs can be created at | ||
any time, making it harder to implement with unclear gain. If it appears to not be | ||
enough, this decision can easily be revisited in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - that seems easy to adjust later too. Added to the text.
However, that would also have a negative consequences, e.g. by slowing down | ||
kube-apiserver initialization (kube-apiserver on startup need to initialize | ||
its own informers by loop-client, and by rejecting all lists until watchcache | ||
is initialized, we would effectively block that initialization). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - ListWatch is effectively a regular Watch call underneathy (with some specific params, but handled by Watch()) methods. So yes - it will also return 429.
I don't think enabling list-watch first effectively gives us anything. Given that we want the behavior to be the same for watches and lists, the switch to list-watch should be no-op from that perspective.
While we don't may decide to adjust the approach here, we will start with | ||
the approach that we will delegate to etcd the requests that (a) are not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it matches - thanks for clarifying
keps/sig-api-machinery/xxx-resilient-watchcache-initialization/README.md
Outdated
Show resolved
Hide resolved
gate, to allow for disabling them if needed. | ||
|
||
However, given we're not changing the API itself, we propose to go directly | ||
to Beta, as Alpha will not give us any feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be on-by-default I'm assuming?
Yes
Clearly all Beta PRR criteria would need to be met.
of course - whole PRR still requires filling in
Regarding the main comment - for me Alpha stage makes sense for things that change API [e.g. to enable rollback]. For purely in-memory changes (like this one), alpha is no-op and noone uses it, so we effectively have zero benefit from it.
5912cf6
to
9f6602d
Compare
This LGTM |
9f6602d
to
3c09646
Compare
Thanks - I switched to implementable and added prod-readiness file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRR looks straightforward. One suggestion then LGTM.
- kubeapiserver /readyz behaving visibly different than before | ||
- `apiserver_request_total` (especially watch for requests finished with 429 code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be more prescriptive here to make it extra obvious to a cluster admin what to watch for? E.g. "If an apiserver continues to respond to /readyz requests with X significantly after startup should have completed ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
3c09646
to
1d35f78
Compare
keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md
Show resolved
Hide resolved
keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md
Show resolved
Hide resolved
its own informers by loop-client, and by rejecting all lists until watchcache | ||
is initialized, we would effectively block that initialization). | ||
|
||
While we didn't decide to reject _all_ list requests with 429, we will start with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be a "stick" to have people start using streaming lists? Streaming lists always allowed, but regular lists rejected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Streaming lists will never be allowed here as they are served from watchcache - so if the watchcache is not initialized, we will not be able to serve them at all.
gate, to allow for disabling them if needed. | ||
|
||
However, given we're not changing the API itself, we propose to go directly | ||
to Beta, as Alpha will not give us any feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alpha gives the interested parties a chance to test this without impacting default installs (still often used in production).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realistically, providers are not enabling Alpha. There are users who enable Alpha, because they want a given feature, but for changes like this (effectively not being a feature), I doubt anyone would enable it.
That said - are you against the decision of going directly to beta, or was it more of a wording issue with the current text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said - are you against the decision of going directly to beta, or was it more of a wording issue with the current text?
I'm against going directly to beta. I think we should produce an alpha that we can use to see if the feature achieves its goals before we enable such a change on all current default clusters (commonly used in production).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the proposal based on yesterday discussion on SIG arch:
- two separate feature-gates
- both going directly to beta
- one of them disabled by default, second enabled by default
1d35f78
to
5f631d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deads2k - thanks for comments, PTAL
keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md
Show resolved
Hide resolved
keps/sig-api-machinery/4568-resilient-watchcache-initialization/README.md
Show resolved
Hide resolved
its own informers by loop-client, and by rejecting all lists until watchcache | ||
is initialized, we would effectively block that initialization). | ||
|
||
While we didn't decide to reject _all_ list requests with 429, we will start with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Streaming lists will never be allowed here as they are served from watchcache - so if the watchcache is not initialized, we will not be able to serve them at all.
gate, to allow for disabling them if needed. | ||
|
||
However, given we're not changing the API itself, we propose to go directly | ||
to Beta, as Alpha will not give us any feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realistically, providers are not enabling Alpha. There are users who enable Alpha, because they want a given feature, but for changes like this (effectively not being a feature), I doubt anyone would enable it.
That said - are you against the decision of going directly to beta, or was it more of a wording issue with the current text?
5f631d1
to
b153041
Compare
b153041
to
51c92a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the proposal based on discussion yesterday, PTAL
gate, to allow for disabling them if needed. | ||
|
||
However, given we're not changing the API itself, we propose to go directly | ||
to Beta, as Alpha will not give us any feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the proposal based on yesterday discussion on SIG arch:
- two separate feature-gates
- both going directly to beta
- one of them disabled by default, second enabled by default
However, we admit that the risk in both cases is different: | ||
- for the new post-start hook, there is a risk of kube-apiserver not initializing, | ||
thus we will start with Beta disabled by-default | ||
- for the changes to handing requests when watchcache is uninitialized, we | ||
believe the risk is visibly lower (and 429 errors were already returned | ||
before, even in small clusters where priority-levels are small enough | ||
to often admit only 1-2 inflight requests at once anyway) and we will | ||
enable it by-default from the very beginning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks.
/approve for PRR (feature gates and metrics are the key things I was looking for here. Using guidance from kubernetes/community#7828 for beta stability level) |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jpbetz, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/sig api-machinery
@deads2k @jpbetz