add API support for controlling various timeouts #2463

micahhausler · 2021-04-29T14:57:50Z

edit: neolit123

ticket repurposed for a general / broader timeout support.

design draft :
https://docs.google.com/document/d/1q0OLHSD6M0JPjN8PxgvpJX1726omQh_Qd3os4LQqrtI/edit

What keywords did you search in kubeadm issues before filing this one?

"etcd timeout" (5 open, 173 closed)

Is this a BUG REPORT or FEATURE REQUEST?

FEATURE REQUEST
/kind feature
/kind api-change
/area etcd

Detailed motivation is covered in #2457, but the short version is AWS would like to add support to CAPI for Bottlerocket, and we have constraints that prevent us from just running kubeadm join without stopping part way along, running OS-specific configuration, and then resuming with kubeadm join --skip-phases ....

When bringing up etcd on additional control plane nodes, the subphase control-plane-join/etcd, kubeadm waits up to 40 seconds for etcd to be available before exiting. In our case, we expect kubeadm to fail, and want to perform our own configuration when it does. We'd like to make this timeout configurable so that we can exit as soon as configuration and certificates are ready. Other users have asked to increase this timeout

I think a new joinTimeout option under ClusterConfiguration.etcd.local makes the most sense, but feel free to point me to a better location. (Similar to the existing ClusterConfiguration.apiServer.timeoutForControlPlane option)

Versions

kubeadm version (use kubeadm version): All

Environment:

Kubernetes version (use kubectl version): All
Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): Bottlerocket
Kernel (e.g. uname -a):
Others:

cc @bcressey @jaxesn @vignesh-goutham

The text was updated successfully, but these errors were encountered:

neolit123 · 2021-04-29T15:47:10Z

@micahhausler you've mentioned that the kubelet bootstrap timeout is also something that you'd like to be configurable:
#2457 (comment)
https://github.com/kubernetes/kubernetes/blob/4f36038c0bc575cd33bc53017ed1e20184efae73/cmd/kubeadm/app/constants/constants.go#L184-L185
is that still true?

ClusterConfiguration.etcd.local

scoping this to local only seems correct, because this is waiting for local etcd members to join.

memberJoinTimeout seems like a better name to me.

one problem here is that this function is based on retries * period:
https://github.com/kubernetes/kubernetes/blob/dd6ee99f62ebd91cf22c03a7f2c2594c1f5807e6/cmd/kubeadm/app/util/etcd/etcd.go#L482
https://github.com/kubernetes/kubernetes/blob/dd6ee9/cmd/kubeadm/app/phases/etcd/local.go#L46-L47

so a single timeout field requires internal assumptions and possibly deviating from the exact user specified value.
exposing both retries & period in the API seems too granular.

@micahhausler do you have someone that can take the action item to work on this for v1.22 and kubeadm v1beta3?

cc @fabriziopandini
are you in favor of a controllable timeout for etcd join and controllable timeout for tls bootstrap for kubelet?

micahhausler · 2021-04-29T16:05:43Z

you've mentioned that the kubelet bootstrap timeout is also something that you'd like to be configurable

Yep, I'll open a separate issue for that.

memberJoinTimeout seems like a better name to me

SGTM

one problem here is that this function is based on retries * period:
https://github.com/kubernetes/kubernetes/blob/dd6ee99f62ebd91cf22c03a7f2c2594c1f5807e6/cmd/kubeadm/app/util/etcd/etcd.go#L482
https://github.com/kubernetes/kubernetes/blob/dd6ee9/cmd/kubeadm/app/phases/etcd/local.go#L46-L47

so a single timeout field requires internal assumptions and possibly deviating from the exact user specified value.
exposing both retries & period in the API seems too granular.

I agree on not making the API too granular. Can we keep the existing retries/period values as defaults, but increase/decrease the retry count based on the supplied timeout? It also seems like getClusterStatus() could benefit from adding context.Context that includes the max timeout.

@micahhausler do you have someone that can take the action item to work on this for v1.22 and kubeadm v1beta3?

Yes, I'm good to take this one

neolit123 · 2021-04-29T16:48:24Z

I agree on not making the API too granular. Can we keep the existing retries/period values as defaults, but increase/decrease the retry count based on the supplied timeout?

we could drop the max retires approach and just execute getClusterStatus() in a loop without delay between calls, counting the retires, until memberJoinTimeout passes or until getClusterStatus() returns success.

It also seems like getClusterStatus() could benefit from adding context.Context that includes the max timeout.

etcd.go needs cleanup / refactors., but getClusterStatus() is called in a few places so i think we might want to avoid touching it for the time being and leave the changes to the side of the caller around this API change - in this case WaitForClusterAvailable().

neolit123 · 2021-05-19T14:33:02Z

@micahhausler

we had a discussion with @fabriziopandini on the topic of timeouts in the kubeadm API today.
we concluded that perhaps we should have not added timeoutForControlPlane to be under ClusterConfiguration.apiServer.
the problem with that is that this is really a node specific setting (e.g. bound to node hardware) and ClusterConfiguration makes it cluster wide. so a slower node joining will be bound to a setting that may not be suitable.

instead what we should have "timeouts" sub-structure under InitConfiguration and JoinConfiguration, that holds a number of different timeouts.

so if we are going to make timeout related changes in v1beta3 we should make a more broader change:

drop ClusterConfiguration.apiServer.timeoutForControlPlane, replace it with something like {Init|JoinConfiguration{.timeouts.apiServerHealthCheck
add JoinConfiguration.timeouts.kubeletTLSBootstrap
add JoinConfiguration.timeouts.etcdMemberJoin

naming / field locations are just ideas at this point.

open questions from me are:

do we want to have these timeouts struct under NodeRegistrationOptions or under under the root Init|JoinConfiguration
should we use specific JoinTimeouts | InitTimeouts structs or have one common Timeouts that has some fileds no-op for init/join with warnings that they are not used.

wangyysde · 2021-05-31T03:08:33Z

/cc @wangyysde

wangyysde · 2021-06-01T14:11:36Z

@micahhausler Are you start working on this? If not, I will to this.

micahhausler · 2021-06-07T13:12:49Z

@wangyysde Feel free to take this one!

wangyysde · 2021-06-16T02:50:59Z

@neolit123 Then, can I try to add timeouts.etcdMemberJoin to JoinConfiguration first?

neolit123 · 2021-06-16T14:25:04Z

I think a single PR with multiple commits that covers all the changes is preferred.

wangyysde · 2021-06-23T05:46:53Z

/assign

wangyysde · 2021-06-26T07:33:28Z

Hi @neolit123 @fabriziopandini I have create a PR (kubernetes/kubernetes#103226) to fix this issue. Could you review it?
Another, I want to make sure with you for somethings.

To fix this issue, many files must be modified ,I found. For review PR easier， I want to create a few of PRs to fix this issue. Is it OK?
The changes are base on v1beta3 .There are more and more difference between cmd/kubeadm/app/apis/kubeadm/v1beta2/types.go with cmd/kubeadm/app/apis/kubeadm/v1beta3/types.go. Then as if we should add more and more functions into cmd/kubeadm/app/apis/kubeadm/v1beta2/conversion.go. I am not sure whether it is right. Can you give me a hand？Thanks.

neolit123 · 2021-06-27T04:11:59Z

i think it is more appropriate to have a single PR with multiple commits.
i can have a look next week.

wangyysde · 2021-06-29T02:17:10Z

OK， I try to create a single PR with multiple commits for this issue.

neolit123 · 2021-07-02T15:00:08Z

commented on the PR:
kubernetes/kubernetes#103226 (review)

given code freeze is Thursday next week, i'm not sure how much time i will have to review this again until then. if we can't get it for v1beta3, it has to be scheduled for v1beta4.

neolit123 · 2021-07-02T15:04:39Z

something else,

while i enumerated some timeouts that we might want in above.
i think we might need to better discuss what timeouts we want to have for init and join...

for example, we might want to have etcd timeouts and kubelet timeout during init too.
a design doc and some research might be needed, which we can then document in the kubeadm API KEP.

wangyysde · 2021-07-03T08:58:15Z

given code freeze is Thursday next week, i'm not sure how much time i will have to review this again until then. if we can't get it for v1beta3, it has to be scheduled for v1beta4

Ok. I found there are some errors in the PR. So I think that it has to be scheduled for v1beta4.

for example, we might want to have etcd timeouts and kubelet timeout during init too.

In fact, I am agree with you. But I try to add etcd timeout and kubelet timeout to join only as above discussing.

a design doc and some research might be needed, which we can then document in the kubeadm API KEP.

I want to prepare an API KEP for this. But I have never created API KEP. So maybe I need some help.

neolit123 · 2021-07-03T12:24:01Z

There is no need for a new KEP i'd say. Just a hack.md or a Google doc where we can agree on the design. Once we agree we can summarize the change in the kubeadm API KEP.

wangyysde · 2021-07-21T06:13:53Z

I have prepared a document at here: https://docs.google.com/document/d/1q0OLHSD6M0JPjN8PxgvpJX1726omQh_Qd3os4LQqrtI

neolit123 · 2021-07-22T19:41:39Z

@wangyysde i have added some comments in the doc. again, thanks a lot for looking in to this.

a couple of things to note about the design:

usually it's fine to not include most of the implementation details, like affect files, conversion code, renames. sometimes this can change during review, so unless the reviewers ask explicitly about having the detail, it's not needed - e.g. sections "The relations of functions’ call", "How to do" can be omitted.
we don't know when we are going to start work on v1beta4...so it might not be for 1.23.
for v1beta3 it took a number of releases to accumulate sufficient number of new features. the api-change label shows some of the pending ones for v1beta4 kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API
my personal preference would be for the kubeadm API work to move slightly faster, but we shall see.

k8s-triage-robot · 2022-04-11T11:58:04Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

neolit123 · 2022-04-11T13:23:12Z

/lifecycle frozen

pacoxu · 2022-09-14T02:01:16Z

/unassign wangyysde

chenk008 · 2022-09-15T03:33:40Z

Can I work on it?

neolit123 · 2022-09-15T05:23:13Z

@chenk008 sure you can try asking @wangyysde what progress he made.
see linked docs and prs. also note that we do not add new fields in old apis, do this must be added in the future beta4. we could add the boiler plate now and just expose the fields in beta4.

chendave · 2023-06-11T04:29:16Z

I am trying to figure out who can own this for v1beta4, note that v1beta4 has been planned for .29 release cycle. this is does something we can have for v1beta4, and there is a WIP pr for this: kubernetes/kubernetes#103226

I haven't seen any major problem for this, but we can leave some details for review.

@chenk008 @wangyysde do you still want to work on it, if yes, pls assign it to yourself. Otherwise, we will find someone else to help (myself is a candidate as well).

neolit123 · 2023-07-02T05:44:32Z

@chendave i see you sent a new PR:
kubernetes/kubernetes#119015

having some sort of a PR for experiments is OK, but we are missing status update on the latest proposal in doc form.
#2463 (comment)

there were some discussions around nested vs flat timeouts, but i have forgotten the details as this was from 2 years ago...
in any case, we should not invest in PRs until we know what we are doing.

this is arguably the biggest change for #2890 or API-next and it needs an owner.

chendave · 2023-07-02T09:07:16Z

in any case, we should not invest in PRs until we know what we are doing.

I will refresh the doc (properly, in the coming week) so that we can continue the discussion there.

chendave · 2023-07-02T09:14:12Z

/assign

chendave · 2023-07-12T08:37:26Z

Refreshed the doc and shared with "kubernetes-sig-cluster-lifecycle@googlegroups.com", all the member of the sig should able to access the doc.

https://docs.google.com/document/d/1VNGpGWb-vqblfZQLp9DY1lx777rwhwIFAjDFQNIwWh0

@fabriziopandini I added your idea about the nested structure of each components, is that what you meant?

@neolit123 @pacoxu @SataQiu pls review and comment, thanks!

Hope we can finalize the approach before the opening of the 1.29 release cycle.

neolit123 · 2023-12-28T12:51:03Z

i am assigning myself to this to try to get it done for 1.30.

neolit123 · 2024-01-31T16:50:04Z

what remains here is support for upgrade timeouts.
they must be added once we have UpgradeConfiguration in v1beta4.

neolit123 · 2024-05-20T13:34:23Z

this was done as part of v1beta4.
separate requests can be done in new issues.
#2890

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API area/etcd labels Apr 29, 2021

neolit123 added this to the v1.22 milestone Apr 29, 2021

neolit123 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Apr 29, 2021

neolit123 mentioned this issue Jun 9, 2021

v1beta3 config #1796

Closed

16 tasks

k8s-ci-robot assigned wangyysde Jun 23, 2021

wangyysde mentioned this issue Jun 26, 2021

[WIP][don't review] kubeadm: add Timeout structure and move TimeoutForControlPlane from APIServer into it kubernetes/kubernetes#103226

Closed

neolit123 modified the milestones: v1.22, v1.23 Jul 19, 2021

neolit123 changed the title ~~Add timeout option in API for etcd on control plane join~~ add API support for controlling various timeouts during init / join Jul 22, 2021

neolit123 added this to the v1.24 milestone Nov 23, 2021

neolit123 modified the milestones: v1.24, Next Jan 11, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2022

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 11, 2022

neolit123 mentioned this issue Sep 13, 2022

Reduce kubeadm tls bootstrap wait kubernetes/kubernetes#112350

Merged

k8s-ci-robot unassigned wangyysde Sep 14, 2022

pacoxu mentioned this issue Mar 22, 2023

kubeadm: graduate the kubeadm configuration kubernetes/enhancements#970

Open

1 task

chendave mentioned this issue Jun 9, 2023

release v1beta4 #2890

Closed

26 tasks

k8s-ci-robot assigned chendave Jul 2, 2023

neolit123 mentioned this issue Dec 19, 2023

kubeadm: drop concurrency when waiting for kubelet /healthz kubernetes/kubernetes#121958

Merged

neolit123 assigned neolit123 and unassigned chendave Dec 28, 2023

neolit123 mentioned this issue Dec 29, 2023

kubeadm: add a Timeouts struct to v1beta4 kubernetes/kubernetes#122529

Merged

neolit123 changed the title ~~add API support for controlling various timeouts during init / join~~ add API support for controlling various timeouts Jan 31, 2024

neolit123 closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add API support for controlling various timeouts #2463

add API support for controlling various timeouts #2463

micahhausler commented Apr 29, 2021 •

edited by neolit123

Loading

neolit123 commented Apr 29, 2021 •

edited

Loading

micahhausler commented Apr 29, 2021

neolit123 commented Apr 29, 2021 •

edited

Loading

neolit123 commented May 19, 2021 •

edited

Loading

wangyysde commented May 31, 2021

wangyysde commented Jun 1, 2021

micahhausler commented Jun 7, 2021

wangyysde commented Jun 16, 2021

neolit123 commented Jun 16, 2021 via email

wangyysde commented Jun 23, 2021

wangyysde commented Jun 26, 2021

neolit123 commented Jun 27, 2021

wangyysde commented Jun 29, 2021

neolit123 commented Jul 2, 2021

neolit123 commented Jul 2, 2021

wangyysde commented Jul 3, 2021

neolit123 commented Jul 3, 2021

wangyysde commented Jul 21, 2021

neolit123 commented Jul 22, 2021 •

edited

Loading

k8s-triage-robot commented Apr 11, 2022

neolit123 commented Apr 11, 2022

pacoxu commented Sep 14, 2022

chenk008 commented Sep 15, 2022

neolit123 commented Sep 15, 2022

chendave commented Jun 11, 2023

neolit123 commented Jul 2, 2023

chendave commented Jul 2, 2023

chendave commented Jul 2, 2023

chendave commented Jul 12, 2023

neolit123 commented Dec 28, 2023

neolit123 commented Jan 31, 2024

neolit123 commented May 20, 2024

add API support for controlling various timeouts #2463

add API support for controlling various timeouts #2463

Comments

micahhausler commented Apr 29, 2021 • edited by neolit123 Loading

What keywords did you search in kubeadm issues before filing this one?

Is this a BUG REPORT or FEATURE REQUEST?

Versions

neolit123 commented Apr 29, 2021 • edited Loading

micahhausler commented Apr 29, 2021

neolit123 commented Apr 29, 2021 • edited Loading

neolit123 commented May 19, 2021 • edited Loading

wangyysde commented May 31, 2021

wangyysde commented Jun 1, 2021

micahhausler commented Jun 7, 2021

wangyysde commented Jun 16, 2021

neolit123 commented Jun 16, 2021 via email

wangyysde commented Jun 23, 2021

wangyysde commented Jun 26, 2021

neolit123 commented Jun 27, 2021

wangyysde commented Jun 29, 2021

neolit123 commented Jul 2, 2021

neolit123 commented Jul 2, 2021

wangyysde commented Jul 3, 2021

neolit123 commented Jul 3, 2021

wangyysde commented Jul 21, 2021

neolit123 commented Jul 22, 2021 • edited Loading

k8s-triage-robot commented Apr 11, 2022

neolit123 commented Apr 11, 2022

pacoxu commented Sep 14, 2022

chenk008 commented Sep 15, 2022

neolit123 commented Sep 15, 2022

chendave commented Jun 11, 2023

neolit123 commented Jul 2, 2023

chendave commented Jul 2, 2023

chendave commented Jul 2, 2023

chendave commented Jul 12, 2023

neolit123 commented Dec 28, 2023

neolit123 commented Jan 31, 2024

neolit123 commented May 20, 2024

micahhausler commented Apr 29, 2021 •

edited by neolit123

Loading

neolit123 commented Apr 29, 2021 •

edited

Loading

neolit123 commented Apr 29, 2021 •

edited

Loading

neolit123 commented May 19, 2021 •

edited

Loading

neolit123 commented Jul 22, 2021 •

edited

Loading