-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading from 1.9.6 to 1.10.0 fails with timeout #740
Comments
Working on reproducing this bug locally. |
After retrying this for 10 times it finally worked |
Here's my etcd manifest diff
|
1.9.6 cluster on Ubuntu 17.10 Vagrant:
|
This is my repro environment: https://github.com/stealthybox/vagrant-kubeadm-testing Change these lines to Then download the 1.10 server binaries into the repo, and they will be available on the guest in |
kubelet etcd related logs:
|
The current workaround is to keep retrying the upgrade and at some point it will succeed. |
@stealthybox Do you happen to get logs out of docker for etcd container ? also, |
I just hit another weird edge case related to this bug. The kubeadm upgrade marked the etcd upgrade as complete prior to the new etcd image being pulled and the new static pod being deployed. This causes the upgrade to timeout at a later step and the upgrade rollback to fail. This also leaves the cluster in a broken state. Restoring the original etcd static pod manifest is needed to recover the cluster. |
Oh yes I'm also stuck in there. my cluster is completely down. Can someone share some instructions on how to rescue from this state? |
Been there on my 2nd attempt to upgrade, just like @detiber described, quite painful. 😢 Found some backed up stuff in /etc/kubernetes/tmp, feeling that etcd might be the culprit, I copied its old manifest over the new one in the manifests folder. At that point I had nothing to lose, 'cause I completely lost control of the cluster. Then, I don't remember exactly, but I think I restarted the whole machine, and later on downgraded all stuff back to v1.9.6. Eventually, I gained control of the cluster and then lost any motivation to mess with v1.10.0 again. It was not fun at all... |
If you roll back the etcd static pod manifest from ^ you probably won't need to do this though because I believe the etcd upgrade blocks the rest of the controlplane upgrade. |
It seems that only the etcd manifest isn't rolled back on a failed upgrade, everything else is fine. After moving the backup manifest over and restarting the kubelet, everything comes back fine. |
I faced the same timeout problem and kubeadm rolled back kube-apiserv manifest to 1.9.6, but left etcd manifest as is (read: with TLS enabled), obviously leading apiserv to fail miserably, effectively breaking my master node. Good candidate for a separate issue report, I suppose. |
@dvdmuckle @codepainters, unfortunately it depends on which component hits the race condition (etcd or api server) whether the rollback is successful. I found a fix for the race condition, but it completely breaks the kubeadm upgrade. I'm working with @stealthybox to try and find a proper path forward for properly fixing the upgrade. |
@codepainters I think it is the same issue. There are a few underlying problems causing this issue:
As a result, the upgrade only succeeds currently when there happens to be a pod status update for the etcd pod that causes the hash to change prior to the kubelet picking up the new static manifest for etcd. Additionally, the api server needs to remain available for the first part of the apiserver upgrade when the upgrade tooling is querying the api prior to updating the apiserver manifest. |
@detiber and I got on a call to discuss changes we need to make to the upgrade process.
For 1.11 we want to:
alternative: Use the CRI to get pod info (demo'd viable using |
TODO:
|
PR to address the static pod update race condition: kubernetes/kubernetes#61942 |
@detiber do you mind explaining what race condition are we talking about? I'm not that familiar with kubeadm internals, yet it sounds interesting. |
FYI - same problem/issue upgrading from 1.9.3 |
@stealthybox thx, I didn't get it at first reading. |
I am having the same issue..[ERROR APIServerHealth]: the API Server is unhealthy; /healthz didn't return "ok" |
Temporary workaround is to ensure certs and upgrade the etcd and apiserver pods by bypassing the checks. Be sure to check your Config and add any flags for your use case: kubectl -n kube-system edit cm kubeadm-config # change featureFlags
...
featureGates: {}
...
kubeadm alpha phase certs all
kubeadm alpha phase etcd local
kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config |
Thanks @stealthybox |
@stealthybox I'm not sure, but seems that something is broken after these steps, because
When apply update I had hanged |
@kvaps @stealthybox this is most likely Honestly, I can't understand why is the same TCP port used for both TLS and non-TLS |
OH! Do this to finish the upgrade: kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config edited the above workaround to be correct |
@stealthybox the second kubeadm command doesn't work:
|
@renich just give it the filepath of your config If you don't use any custom settings, you can pass it an empty file. 1.10_kubernetes/server/bin/kubeadm alpha phase upload-config --config <(echo) |
This should now be resolved with the merging of kubernetes/kubernetes#62655 and will be part of the v1.10.2 release. |
I can confirm that 1.10.0 -> 1.10.2 upgrade with kubeadm 1.10.2 was smooth, no timeouts |
I still have timeout on 1.10.0 -> 1.10.2 but another one: I'm not sure what to do... |
@denis111 check the API server logs while doing the upgrade using |
@dvdmuckle Well, I don't see any error in that logs, only entries starting with I and a few W. |
I have an ARM64 cluster on 1.9.3 and successfully updated to 1.9.7 but got the same timeout problem to upgrade from 1.9.7 to 1.10.2. I even tried editing and recompiling kubeadm increasing the timeouts (like these last commits https://github.com/anguslees/kubernetes/commits/kubeadm-gusfork) with same results.
|
Upgrade v1.10.2 -> v1.10.2 (which may be nonsense. just testing...) Ubuntu 16.04. And, it fails with an error.
|
I wonder if this is still tracked on some issue... couldn't find. |
I'm also seeing upgrades still failing with the Edit: Moved discussion to a new ticket #850, please discuss there. |
If anyone else has this problem with 1.9.x: If you are in aws with custom hostnames you need to edit the kubeadm-config configmap and set at nodeName the aws internal name: ip-xx-xx-xx-xx.$REGION.compute.internal)
This besides setting etc client to http. I'm not yet on letter versions to see if they fixed that. This is because kubeadm tries to read this path in api: /api/v1/namespaces/kube-system/pods/kube-apiserver-$NodeName |
Since the timeout has been increased on 1.10.6, I've successfully updated my 1.9.7 deployment to 1.10.6 a couple weeks ago. Planning to upgrade to 1.11.2 as soon as the .deb packages are ready as the same changes are in this version. My cluster run on-premises on ARM64 boards. |
BUG REPORT
Versions
kubeadm version (use
kubeadm version
):kubeadm version: &version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Environment:
kubectl version
):Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:21:50Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Scaleway baremetal C2S
Ubuntu Xenial (16.04 LTS) (GNU/Linux 4.4.122-mainline-rev1 x86_64 )
uname -a
):Linux amd64-master-1 4.4.122-mainline-rev1 #1 SMP Sun Mar 18 10:44:19 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
What happened?
Trying to upgrade from 1.9.6 to 1.10.0 I'm getting this error:
What you expected to happen?
Successful upgrade
How to reproduce it (as minimally and precisely as possible)?
Install 1.9.6 packages and init a 1.9.6 cluster:
Edit the kubeadm-config and change the featureGates from string to map as reported in kubernetes/kubernetes#61764 .
Download kubeadm 1.10.0 and run
kubeadm upgrade plan
andkubeadm upgrade apply v1.10.0
.The text was updated successfully, but these errors were encountered: