Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 4568: Resilient watchcache initialization #4557

Merged
merged 1 commit into from
Apr 22, 2024

Conversation

wojtek-t
Copy link
Member

/sig api-machinery

@deads2k @jpbetz

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 21, 2024
@k8s-ci-robot k8s-ci-robot requested review from apelisse and sttts March 21, 2024 19:09
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Mar 21, 2024

## Design Details

<!--
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t wojtek-t assigned wojtek-t, jpbetz and deads2k and unassigned wojtek-t Mar 22, 2024
@wojtek-t wojtek-t force-pushed the robust_watchcache branch 2 times, most recently from e93a990 to 17dc74b Compare March 22, 2024 11:34
For watch requests, unless it's a consistent watch request (setting RV="") - which
isn't really used by none of our core components, it is always handled by the
watchcache. For all such requests, watch is hanging until watchcache finishes
initialiation. The problem here is that for clusters with a large amount of data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
initialiation. The problem here is that for clusters with a large amount of data
initialization. The problem here is that for clusters with a large amount of data

isn't really used by none of our core components, it is always handled by the
watchcache. For all such requests, watch is hanging until watchcache finishes
initialiation. The problem here is that for clusters with a large amount of data
of a given type, initialiation (being effectively a list from etcd) can take
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
of a given type, initialiation (being effectively a list from etcd) can take
of a given type, initialization (being effectively a list from etcd) can take

requests are not sent to it until kube-apiserver becomes ready. The above will allow us
to delay this moment until watchcache is initialized.

Why PostStartHook instead of /readyz check? Watchcache is per-type layer and we want
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically a PostStartHook is a readyz check. It's just a readyz check that is gated on initialization parameters. But yeah, your solution makes intuitive sense to me, when I was reading your problem statement, I was just wondering why we don't just create a readiness check for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just wondering why we don't just create a readiness check for this

You're asking why not readyz check and PostStartHook instead? I'm trying to explain it in this paragraph exactly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I realize that, I was (in a roundabout way) agreeing with your reasoning here.


Finally, we suggest starting just with builtin resources, as CRDs can be created at
any time, making it harder to implement with unclear gain. If it appears to not be
enough, this decision can easily be revisited in the future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't think we need to include CRDs.. Being selective on which resources we require this for is probably a good idea, in general. All builtins seems like a good starting point. If in the future we found that we only need a subset of builtins, would limiting which builtins we wait for be a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - that seems easy to adjust later too. Added to the text.

no matter if they are served from etcd or from cache (the latency would be
different in those cases though, so the actual seat-seconds cost too) and
finally given we already did some work to process this request, we would
simply delegate those requests to etcd.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how this KEP, combined with list-watch and consistent reads from cache all seem to move the system in a direction where the actual cost of GET requests will have less expected variation than they do today.

However, that would also have a negative consequences, e.g. by slowing down
kube-apiserver initialization (kube-apiserver on startup need to initialize
its own informers by loop-client, and by rejecting all lists until watchcache
is initialized, we would effectively block that initialization).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List-Watch would return 429? Do we benefit from getting list-watch enabled-by-default before enabling this enhancement by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - ListWatch is effectively a regular Watch call underneathy (with some specific params, but handled by Watch()) methods. So yes - it will also return 429.

I don't think enabling list-watch first effectively gives us anything. Given that we want the behavior to be the same for watches and lists, the switch to list-watch should be no-op from that perspective.

gate, to allow for disabling them if needed.

However, given we're not changing the API itself, we propose to go directly
to Beta, as Alpha will not give us any feedback.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @deads2k

I haven't put much thought into what the criteria should be for skipping alpha. This would be on-by-default I'm assuming?

Clearly all Beta PRR criteria would need to be met.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be on-by-default I'm assuming?

Yes

Clearly all Beta PRR criteria would need to be met.

of course - whole PRR still requires filling in

Regarding the main comment - for me Alpha stage makes sense for things that change API [e.g. to enable rollback]. For purely in-memory changes (like this one), alpha is no-op and noone uses it, so we effectively have zero benefit from it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support going straight to beta for this. I consider startup readiness something we should be able to refine w/o going through all the stability levels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also support direct beta here, this is a reliability improvement and it's silly to gate this behind an alpha stage.

Comment on lines 191 to 192
While we don't may decide to adjust the approach here, we will start with
the approach that we will delegate to etcd the requests that (a) are not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While we don't may decide to adjust the approach here, we will start with
the approach that we will delegate to etcd the requests that (a) are not
While we didn't decide to reject _all_ list requests with 429 here, we will start with
an approach where we only delegate to etcd the requests that (a) are not

(wording suggestion only.. please only accept if this matches what was intended)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it matches - thanks for clarifying


Why PostStartHook instead of /readyz check? Watchcache is per-type layer and we want
to avoid marking whole kube-apiserver as not-ready if watchcache for one of resources
requires reinitialization. We will handle those cases differently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure I understand. Would the first initialization of each builtin type be /readyz gated but any re-initialization be post- /readyz (but still result in 429 responses)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way poststarthook works is just it effectively:

  1. is starteded once:
    https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/hooks.go#L194-L206
  2. it's reported as healthy as soon as it succeeds for the first time:
    https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/hooks.go#L238-L245
    [and it never changes back to false later]

So what will happen is effectively:

  • first initialization of each builtin type will prevent readyz from succeeding [initialization of kube-apiserver]
  • even if watchcache needs to reinitialize later, it will not bring readyz/ back to failing, but as you wrote, requests [which exactly described below] will be returning 429

tried to clarify in the text

@wojtek-t wojtek-t force-pushed the robust_watchcache branch from 17dc74b to 5912cf6 Compare April 4, 2024 11:42
@wojtek-t wojtek-t changed the title [WIP] KEP XXXX: Resilient watchcache initialization [WIP] KEP 4568: Resilient watchcache initialization Apr 4, 2024
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpbetz - PTAL

[still requires filling in PRR, but the rest should be more-or-less ready]

requests are not sent to it until kube-apiserver becomes ready. The above will allow us
to delay this moment until watchcache is initialized.

Why PostStartHook instead of /readyz check? Watchcache is per-type layer and we want
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just wondering why we don't just create a readiness check for this

You're asking why not readyz check and PostStartHook instead? I'm trying to explain it in this paragraph exactly.


Why PostStartHook instead of /readyz check? Watchcache is per-type layer and we want
to avoid marking whole kube-apiserver as not-ready if watchcache for one of resources
requires reinitialization. We will handle those cases differently.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way poststarthook works is just it effectively:

  1. is starteded once:
    https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/hooks.go#L194-L206
  2. it's reported as healthy as soon as it succeeds for the first time:
    https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/hooks.go#L238-L245
    [and it never changes back to false later]

So what will happen is effectively:

  • first initialization of each builtin type will prevent readyz from succeeding [initialization of kube-apiserver]
  • even if watchcache needs to reinitialize later, it will not bring readyz/ back to failing, but as you wrote, requests [which exactly described below] will be returning 429

tried to clarify in the text


Finally, we suggest starting just with builtin resources, as CRDs can be created at
any time, making it harder to implement with unclear gain. If it appears to not be
enough, this decision can easily be revisited in the future.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - that seems easy to adjust later too. Added to the text.

However, that would also have a negative consequences, e.g. by slowing down
kube-apiserver initialization (kube-apiserver on startup need to initialize
its own informers by loop-client, and by rejecting all lists until watchcache
is initialized, we would effectively block that initialization).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - ListWatch is effectively a regular Watch call underneathy (with some specific params, but handled by Watch()) methods. So yes - it will also return 429.

I don't think enabling list-watch first effectively gives us anything. Given that we want the behavior to be the same for watches and lists, the switch to list-watch should be no-op from that perspective.

Comment on lines 191 to 192
While we don't may decide to adjust the approach here, we will start with
the approach that we will delegate to etcd the requests that (a) are not
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it matches - thanks for clarifying

gate, to allow for disabling them if needed.

However, given we're not changing the API itself, we propose to go directly
to Beta, as Alpha will not give us any feedback.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be on-by-default I'm assuming?

Yes

Clearly all Beta PRR criteria would need to be met.

of course - whole PRR still requires filling in

Regarding the main comment - for me Alpha stage makes sense for things that change API [e.g. to enable rollback]. For purely in-memory changes (like this one), alpha is no-op and noone uses it, so we effectively have zero benefit from it.

@wojtek-t wojtek-t force-pushed the robust_watchcache branch from 5912cf6 to 9f6602d Compare April 4, 2024 12:02
@wojtek-t wojtek-t changed the title [WIP] KEP 4568: Resilient watchcache initialization KEP 4568: Resilient watchcache initialization Apr 4, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 4, 2024
@jpbetz
Copy link
Contributor

jpbetz commented Apr 4, 2024

This LGTM

@wojtek-t wojtek-t force-pushed the robust_watchcache branch from 9f6602d to 3c09646 Compare April 4, 2024 14:49
@wojtek-t
Copy link
Member Author

wojtek-t commented Apr 4, 2024

This LGTM

Thanks - I switched to implementable and added prod-readiness file.

Copy link
Contributor

@jpbetz jpbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRR looks straightforward. One suggestion then LGTM.

Comment on lines 343 to 344
- kubeapiserver /readyz behaving visibly different than before
- `apiserver_request_total` (especially watch for requests finished with 429 code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more prescriptive here to make it extra obvious to a cluster admin what to watch for? E.g. "If an apiserver continues to respond to /readyz requests with X significantly after startup should have completed ..."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

@wojtek-t wojtek-t force-pushed the robust_watchcache branch from 3c09646 to 1d35f78 Compare April 5, 2024 09:37
its own informers by loop-client, and by rejecting all lists until watchcache
is initialized, we would effectively block that initialization).

While we didn't decide to reject _all_ list requests with 429, we will start with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be a "stick" to have people start using streaming lists? Streaming lists always allowed, but regular lists rejected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Streaming lists will never be allowed here as they are served from watchcache - so if the watchcache is not initialized, we will not be able to serve them at all.

gate, to allow for disabling them if needed.

However, given we're not changing the API itself, we propose to go directly
to Beta, as Alpha will not give us any feedback.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alpha gives the interested parties a chance to test this without impacting default installs (still often used in production).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realistically, providers are not enabling Alpha. There are users who enable Alpha, because they want a given feature, but for changes like this (effectively not being a feature), I doubt anyone would enable it.

That said - are you against the decision of going directly to beta, or was it more of a wording issue with the current text?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said - are you against the decision of going directly to beta, or was it more of a wording issue with the current text?

I'm against going directly to beta. I think we should produce an alpha that we can use to see if the feature achieves its goals before we enable such a change on all current default clusters (commonly used in production).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the proposal based on yesterday discussion on SIG arch:

  • two separate feature-gates
  • both going directly to beta
  • one of them disabled by default, second enabled by default

@wojtek-t wojtek-t force-pushed the robust_watchcache branch from 1d35f78 to 5f631d1 Compare April 11, 2024 08:22
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k - thanks for comments, PTAL

its own informers by loop-client, and by rejecting all lists until watchcache
is initialized, we would effectively block that initialization).

While we didn't decide to reject _all_ list requests with 429, we will start with
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Streaming lists will never be allowed here as they are served from watchcache - so if the watchcache is not initialized, we will not be able to serve them at all.

gate, to allow for disabling them if needed.

However, given we're not changing the API itself, we propose to go directly
to Beta, as Alpha will not give us any feedback.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realistically, providers are not enabling Alpha. There are users who enable Alpha, because they want a given feature, but for changes like this (effectively not being a feature), I doubt anyone would enable it.

That said - are you against the decision of going directly to beta, or was it more of a wording issue with the current text?

@wojtek-t wojtek-t force-pushed the robust_watchcache branch from 5f631d1 to b153041 Compare April 19, 2024 11:09
@wojtek-t wojtek-t force-pushed the robust_watchcache branch from b153041 to 51c92a6 Compare April 19, 2024 11:11
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the proposal based on discussion yesterday, PTAL

gate, to allow for disabling them if needed.

However, given we're not changing the API itself, we propose to go directly
to Beta, as Alpha will not give us any feedback.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the proposal based on yesterday discussion on SIG arch:

  • two separate feature-gates
  • both going directly to beta
  • one of them disabled by default, second enabled by default

Comment on lines +234 to +241
However, we admit that the risk in both cases is different:
- for the new post-start hook, there is a risk of kube-apiserver not initializing,
thus we will start with Beta disabled by-default
- for the changes to handing requests when watchcache is uninitialized, we
believe the risk is visibly lower (and 429 errors were already returned
before, even in small clusters where priority-levels are small enough
to often admit only 1-2 inflight requests at once anyway) and we will
enable it by-default from the very beginning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks.

@deads2k
Copy link
Contributor

deads2k commented Apr 19, 2024

Thanks for the enablement updates. This lgtm from the sig perspective. I'll leave PRR and approve with @jpbetz who has also reviewed already.

/lgtm
/assign @jpbetz

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2024
@jpbetz
Copy link
Contributor

jpbetz commented Apr 22, 2024

/approve for PRR (feature gates and metrics are the key things I was looking for here. Using guidance from kubernetes/community#7828 for beta stability level)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpbetz, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2024
@k8s-ci-robot k8s-ci-robot merged commit 15e9464 into kubernetes:master Apr 22, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants