-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add API support for controlling various timeouts #2463
Comments
@micahhausler you've mentioned that the kubelet bootstrap timeout is also something that you'd like to be configurable:
scoping this to local only seems correct, because this is waiting for local etcd members to join.
one problem here is that this function is based on retries * period: so a single timeout field requires internal assumptions and possibly deviating from the exact user specified value. @micahhausler do you have someone that can take the action item to work on this for v1.22 and kubeadm v1beta3? cc @fabriziopandini |
Yep, I'll open a separate issue for that.
SGTM
I agree on not making the API too granular. Can we keep the existing retries/period values as defaults, but increase/decrease the retry count based on the supplied timeout? It also seems like
Yes, I'm good to take this one |
we could drop the max retires approach and just execute getClusterStatus() in a loop without delay between calls, counting the retires, until memberJoinTimeout passes or until getClusterStatus() returns success.
etcd.go needs cleanup / refactors., but getClusterStatus() is called in a few places so i think we might want to avoid touching it for the time being and leave the changes to the side of the caller around this API change - in this case WaitForClusterAvailable(). |
we had a discussion with @fabriziopandini on the topic of timeouts in the kubeadm API today. instead what we should have "timeouts" sub-structure under InitConfiguration and JoinConfiguration, that holds a number of different timeouts. so if we are going to make timeout related changes in v1beta3 we should make a more broader change:
naming / field locations are just ideas at this point. open questions from me are:
|
/cc @wangyysde |
@micahhausler Are you start working on this? If not, I will to this. |
@wangyysde Feel free to take this one! |
@neolit123 Then, can I try to add timeouts.etcdMemberJoin to JoinConfiguration first? |
I think a single PR with multiple commits that covers all the changes is
preferred.
|
/assign |
Hi @neolit123 @fabriziopandini I have create a PR (kubernetes/kubernetes#103226) to fix this issue. Could you review it?
|
i think it is more appropriate to have a single PR with multiple commits. |
OK, I try to create a single PR with multiple commits for this issue. |
commented on the PR:
|
something else, while i enumerated some timeouts that we might want in above. for example, we might want to have etcd timeouts and kubelet timeout during init too. |
Ok. I found there are some errors in the PR. So I think that it has to be scheduled for v1beta4.
In fact, I am agree with you. But I try to add etcd timeout and kubelet timeout to join only as above discussing.
I want to prepare an API KEP for this. But I have never created API KEP. So maybe I need some help. |
There is no need for a new KEP i'd say. Just a hack.md or a Google doc where we can agree on the design. Once we agree we can summarize the change in the kubeadm API KEP. |
I have prepared a document at here: https://docs.google.com/document/d/1q0OLHSD6M0JPjN8PxgvpJX1726omQh_Qd3os4LQqrtI |
@wangyysde i have added some comments in the doc. again, thanks a lot for looking in to this. a couple of things to note about the design:
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/lifecycle frozen |
/unassign wangyysde |
Can I work on it? |
@chenk008 sure you can try asking @wangyysde what progress he made. |
I am trying to figure out who can own this for v1beta4, note that v1beta4 has been planned for .29 release cycle. this is does something we can have for v1beta4, and there is a WIP pr for this: kubernetes/kubernetes#103226 I haven't seen any major problem for this, but we can leave some details for review. @chenk008 @wangyysde do you still want to work on it, if yes, pls assign it to yourself. Otherwise, we will find someone else to help (myself is a candidate as well). |
@chendave i see you sent a new PR: having some sort of a PR for experiments is OK, but we are missing status update on the latest proposal in doc form. there were some discussions around nested vs flat timeouts, but i have forgotten the details as this was from 2 years ago... this is arguably the biggest change for #2890 or API-next and it needs an owner. |
I will refresh the doc (properly, in the coming week) so that we can continue the discussion there. |
/assign |
Refreshed the doc and shared with "kubernetes-sig-cluster-lifecycle@googlegroups.com", all the member of the sig should able to access the doc. https://docs.google.com/document/d/1VNGpGWb-vqblfZQLp9DY1lx777rwhwIFAjDFQNIwWh0 @fabriziopandini I added your idea about the nested structure of each components, is that what you meant? @neolit123 @pacoxu @SataQiu pls review and comment, thanks! Hope we can finalize the approach before the opening of the 1.29 release cycle. |
i am assigning myself to this to try to get it done for 1.30. |
what remains here is support for upgrade timeouts. |
this was done as part of v1beta4. |
edit: neolit123
ticket repurposed for a general / broader timeout support.
design draft :
https://docs.google.com/document/d/1q0OLHSD6M0JPjN8PxgvpJX1726omQh_Qd3os4LQqrtI/edit
What keywords did you search in kubeadm issues before filing this one?
"etcd timeout" (5 open, 173 closed)
Similar questions, but not formal asks:
#1712 (comment)
#2092 (comment by @neolit123)
Is this a BUG REPORT or FEATURE REQUEST?
FEATURE REQUEST
/kind feature
/kind api-change
/area etcd
Detailed motivation is covered in #2457, but the short version is AWS would like to add support to CAPI for Bottlerocket, and we have constraints that prevent us from just running
kubeadm join
without stopping part way along, running OS-specific configuration, and then resuming withkubeadm join --skip-phases ...
.When bringing up etcd on additional control plane nodes, the subphase
control-plane-join/etcd
, kubeadm waits up to 40 seconds for etcd to be available before exiting. In our case, we expect kubeadm to fail, and want to perform our own configuration when it does. We'd like to make this timeout configurable so that we can exit as soon as configuration and certificates are ready. Other users have asked to increase this timeoutI think a new
joinTimeout
option underClusterConfiguration.etcd.local
makes the most sense, but feel free to point me to a better location. (Similar to the existingClusterConfiguration.apiServer.timeoutForControlPlane
option)Versions
kubeadm version (use
kubeadm version
): AllEnvironment:
kubectl version
): Alluname -a
):cc @bcressey @jaxesn @vignesh-goutham
The text was updated successfully, but these errors were encountered: