Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading from 1.9.6 to 1.10.0 fails with timeout #740

Closed
stefanprodan opened this issue Mar 28, 2018 · 42 comments
Closed

Upgrading from 1.9.6 to 1.10.0 fails with timeout #740

stefanprodan opened this issue Mar 28, 2018 · 42 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triaged
Milestone

Comments

@stefanprodan
Copy link

stefanprodan commented Mar 28, 2018

BUG REPORT

Versions

kubeadm version (use kubeadm version):

kubeadm version: &version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:21:50Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration:

Scaleway baremetal C2S

  • OS (e.g. from /etc/os-release):

Ubuntu Xenial (16.04 LTS) (GNU/Linux 4.4.122-mainline-rev1 x86_64 )

  • Kernel (e.g. uname -a):

Linux amd64-master-1 4.4.122-mainline-rev1 #1 SMP Sun Mar 18 10:44:19 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

What happened?

Trying to upgrade from 1.9.6 to 1.10.0 I'm getting this error:

kubeadm upgrade apply v1.10.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.0"
[upgrade/versions] Cluster version: v1.9.6
[upgrade/versions] kubeadm version: v1.10.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.0"...
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests411909119/etcd.yaml"
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [localhost] and IPs [127.0.0.1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [arm-master-1] and IPs [10.1.244.57]
[certificates] Generated etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests180476754/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: fatal error when trying to upgrade the etcd cluster: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition], rolled the state back to pre-upgrade state

What you expected to happen?

Successful upgrade

How to reproduce it (as minimally and precisely as possible)?

Install 1.9.6 packages and init a 1.9.6 cluster:

curl -fsSL https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" | tee /etc/apt/sources.list.d/kubernetes.list
apt-get update -qq
apt-get install -qy kubectl=1.9.6-00
apt-get install -qy kubelet=1.9.6-00
apt-get install -qy kubeadm=1.9.6-00

Edit the kubeadm-config and change the featureGates from string to map as reported in kubernetes/kubernetes#61764 .

kubectl -n kube-system edit cm kubeadm-config

....
featureGates: {}
....

Download kubeadm 1.10.0 and run kubeadm upgrade plan and kubeadm upgrade apply v1.10.0.

@stealthybox
Copy link
Member

Working on reproducing this bug locally.

@stefanprodan
Copy link
Author

stefanprodan commented Mar 28, 2018

After retrying this for 10 times it finally worked

@stealthybox
Copy link
Member

Here's my etcd manifest diff

16,17c16,17
<     - --listen-client-urls=https://127.0.0.1:2379
<     - --advertise-client-urls=https://127.0.0.1:2379
---
>     - --listen-client-urls=http://127.0.0.1:2379
>     - --advertise-client-urls=http://127.0.0.1:2379
19,27c19
<     - --key-file=/etc/kubernetes/pki/etcd/server.key
<     - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
<     - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
<     - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
<     - --client-cert-auth=true
<     - --peer-client-cert-auth=true
<     - --cert-file=/etc/kubernetes/pki/etcd/server.crt
<     - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
<     image: gcr.io/google_containers/etcd-amd64:3.1.12
---
>     image: gcr.io/google_containers/etcd-amd64:3.1.11
29,35d20
<       exec:
<         command:
<         - /bin/sh
<         - -ec
<         - ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt
<           --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
<           get foo
36a22,26
>       httpGet:
>         host: 127.0.0.1
>         path: /health
>         port: 2379
>         scheme: HTTP
43,45c33
<       name: etcd-data
<     - mountPath: /etc/kubernetes/pki/etcd
<       name: etcd-certs
---
>       name: etcd
51,55c39
<     name: etcd-data
<   - hostPath:
<       path: /etc/kubernetes/pki/etcd
<       type: DirectoryOrCreate
<     name: etcd-certs
---
>     name: etcd
root@vagrant:~# ls /etc/kubernetes/pki/etcd
ca.crt  ca.key  healthcheck-client.crt  healthcheck-client.key  peer.crt  peer.key  server.crt  server.key```

@stealthybox
Copy link
Member

1.9.6 cluster on Ubuntu 17.10 Vagrant:

root@vagrant:/vagrant# 1.10_kubernetes/server/bin/kubeadm upgrade apply v1.10.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.0"
[upgrade/versions] Cluster version: v1.9.6
[upgrade/versions] kubeadm version: v1.10.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.0"...
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests262738652/etcd.yaml"
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [localhost] and IPs [127.0.0.1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [vagrant] and IPs [10.0.2.15]
[certificates] Generated etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests858209931/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[apiclient] Error getting Pods with label selector "component=etcd" [the server was unable to return a response in the time allotted, but may still be processing the request (get pods)]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug=""]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: net/http: TLS handshake timeout]
[apiclient] Error getting Pods with label selector "component=etcd" [the server was unable to return a response in the time allotted, but may still be processing the request (get pods)]
[apiclient] Error getting Pods with label selector "component=etcd" [Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods?labelSelector=component%3Detcd: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""]
[upgrade/apply] FATAL: fatal error when trying to upgrade the etcd cluster: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition], rolled the state back to pre-upgrade state

@stealthybox
Copy link
Member

stealthybox commented Mar 28, 2018

This is my repro environment: https://github.com/stealthybox/vagrant-kubeadm-testing

Change these lines to 1.9.6-00 for the bootstrap: https://github.com/stealthybox/vagrant-kubeadm-testing/blob/9d4493e990c9bd742107b317641267c3ef3640cd/Vagrantfile#L18-L20

Then download the 1.10 server binaries into the repo, and they will be available on the guest in /vagrant
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.10.md#server-binaries

@stealthybox
Copy link
Member

kubelet etcd related logs:

root@vagrant:~# journalctl -xefu kubelet | grep -i etcd
Mar 28 16:32:07 vagrant kubelet[14676]: W0328 16:32:07.808776   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:32:07 vagrant kubelet[14676]: I0328 16:32:07.880412   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "etcd-vagrant" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 16:34:27 vagrant kubelet[14676]: W0328 16:34:27.472534   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:57:33 vagrant kubelet[14676]: W0328 16:57:33.683648   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:57:33 vagrant kubelet[14676]: I0328 16:57:33.725564   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") pod "etcd-vagrant" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 16:57:33 vagrant kubelet[14676]: I0328 16:57:33.725637   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") pod "etcd-vagrant" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 16:57:35 vagrant kubelet[14676]: E0328 16:57:35.484901   14676 kuberuntime_container.go:66] Can't make a ref to pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)", container etcd: selfLink was empty, can't make reference
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.889458   14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "7278f85057e8bf5cb81c9f96d3b25320" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.889595   14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd" (OuterVolumeSpecName: "etcd") pod "7278f85057e8bf5cb81c9f96d3b25320" (UID: "7278f85057e8bf5cb81c9f96d3b25320"). InnerVolumeSpecName "etcd". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 16:57:35 vagrant kubelet[14676]: I0328 16:57:35.989892   14676 reconciler.go:297] Volume detached for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") on node "vagrant" DevicePath ""
Mar 28 16:58:03 vagrant kubelet[14676]: E0328 16:58:03.688878   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Timeout: request did not complete within allowed duration
Mar 28 16:58:03 vagrant kubelet[14676]: E0328 16:58:03.841447   14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff626cfbc5", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"37936d2107e31b457cada6c2433469f1", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"SuccessfulMountVolume", Message:"MountVolume.SetUp succeeded for volume \"etcd-certs\" ", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e59c5, ext:1534226953099, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e59c5, ext:1534226953099, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:58:33 vagrant kubelet[14676]: E0328 16:58:33.844276   14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff626cfb82", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"37936d2107e31b457cada6c2433469f1", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"SuccessfulMountVolume", Message:"MountVolume.SetUp succeeded for volume \"etcd-data\" ", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e5982, ext:1534226953033, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f713e5982, ext:1534226953033, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:59:03 vagrant kubelet[14676]: E0328 16:59:03.692450   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": the server was unable to return a response in the time allotted, but may still be processing the request (post pods)
Mar 28 16:59:03 vagrant kubelet[14676]: E0328 16:59:03.848007   14676 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-vagrant.152023ff641f915f", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"etcd-vagrant", UID:"7278f85057e8bf5cb81c9f96d3b25320", APIVersion:"v1", ResourceVersion:"", FieldPath:"spec.containers{etcd}"}, Reason:"Killing", Message:"Killing container with id docker://etcd:Need to kill Pod", Source:v1.EventSource{Component:"kubelet", Host:"vagrant"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f72f0ef5f, ext:1534255433999, loc:(*time.Location)(0x5859e60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbea7103f72f0ef5f, ext:1534255433999, loc:(*time.Location)(0x5859e60)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
Mar 28 16:59:14 vagrant kubelet[14676]: W0328 16:59:14.472661   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:59:14 vagrant kubelet[14676]: W0328 16:59:14.473138   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:14 vagrant kubelet[14676]: E0328 16:59:14.473190   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:14 vagrant kubelet[14676]: E0328 16:59:14.473658   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:15 vagrant kubelet[14676]: W0328 16:59:15.481336   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 16:59:15 vagrant kubelet[14676]: E0328 16:59:15.483705   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 16:59:15 vagrant kubelet[14676]: E0328 16:59:15.497391   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:00:34 vagrant kubelet[14676]: W0328 17:00:34.475851   14676 kubelet.go:1597] Deleting mirror pod "etcd-vagrant_kube-system(122348c3-32a6-11e8-8dc5-080027d6be16)" because it is outdated
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.720076   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: http2: server sent GOAWAY and closed the connection; LastStreamID=47, ErrCode=NO_ERROR, debug=""
Mar 28 17:01:07 vagrant kubelet[14676]: E0328 17:01:07.720107   14676 mirror_client.go:88] Failed deleting a mirror pod "etcd-vagrant_kube-system": Delete https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: http2: server sent GOAWAY and closed the connection; LastStreamID=47, ErrCode=NO_ERROR, debug=""; some request body already written
Mar 28 17:01:07 vagrant kubelet[14676]: E0328 17:01:07.725335   14676 kubelet.go:1612] Failed creating a mirror pod for "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Post https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:07 vagrant kubelet[14676]: I0328 17:01:07.728709   14676 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "etcd" (UniqueName: "kubernetes.io/host-path/7278f85057e8bf5cb81c9f96d3b25320-etcd") pod "etcd-vagrant" (UID: "7278f85057e8bf5cb81c9f96d3b25320")
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.734475   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:07 vagrant kubelet[14676]: W0328 17:01:07.740642   14676 status_manager.go:459] Failed to get status for pod "etcd-vagrant_kube-system(7278f85057e8bf5cb81c9f96d3b25320)": Get https://10.0.2.15:6443/api/v1/namespaces/kube-system/pods/etcd-vagrant: dial tcp 10.0.2.15:6443: getsockopt: connection refused
Mar 28 17:01:09 vagrant kubelet[14676]: E0328 17:01:09.484412   14676 kuberuntime_container.go:66] Can't make a ref to pod "etcd-vagrant_kube-system(37936d2107e31b457cada6c2433469f1)", container etcd: selfLink was empty, can't make reference
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.848794   14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849282   14676 reconciler.go:191] operationExecutor.UnmountVolume started for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1")
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849571   14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data" (OuterVolumeSpecName: "etcd-data") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1"). InnerVolumeSpecName "etcd-data". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.849503   14676 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs" (OuterVolumeSpecName: "etcd-certs") pod "37936d2107e31b457cada6c2433469f1" (UID: "37936d2107e31b457cada6c2433469f1"). InnerVolumeSpecName "etcd-certs". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.949925   14676 reconciler.go:297] Volume detached for volume "etcd-certs" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-certs") on node "vagrant" DevicePath ""
Mar 28 17:01:09 vagrant kubelet[14676]: I0328 17:01:09.949975   14676 reconciler.go:297] Volume detached for volume "etcd-data" (UniqueName: "kubernetes.io/host-path/37936d2107e31b457cada6c2433469f1-etcd-data") on node "vagrant" DevicePath ""

@stefanprodan
Copy link
Author

The current workaround is to keep retrying the upgrade and at some point it will succeed.

@kad
Copy link
Member

kad commented Mar 28, 2018

@stealthybox Do you happen to get logs out of docker for etcd container ? also, grep -i etcd might be masking some of the kubelet output, e.g. some error messages that don't have container name in it, but still relevant.

@detiber
Copy link
Member

detiber commented Mar 28, 2018

I just hit another weird edge case related to this bug. The kubeadm upgrade marked the etcd upgrade as complete prior to the new etcd image being pulled and the new static pod being deployed. This causes the upgrade to timeout at a later step and the upgrade rollback to fail. This also leaves the cluster in a broken state. Restoring the original etcd static pod manifest is needed to recover the cluster.

@kesavkolla
Copy link

Oh yes I'm also stuck in there. my cluster is completely down. Can someone share some instructions on how to rescue from this state?

@zcalusic
Copy link

Been there on my 2nd attempt to upgrade, just like @detiber described, quite painful. 😢

Found some backed up stuff in /etc/kubernetes/tmp, feeling that etcd might be the culprit, I copied its old manifest over the new one in the manifests folder. At that point I had nothing to lose, 'cause I completely lost control of the cluster. Then, I don't remember exactly, but I think I restarted the whole machine, and later on downgraded all stuff back to v1.9.6. Eventually, I gained control of the cluster and then lost any motivation to mess with v1.10.0 again. It was not fun at all...

@stealthybox
Copy link
Member

stealthybox commented Mar 28, 2018

If you roll back the etcd static pod manifest from /etc/kubernetes/tmp it's important to also roll back the apiserver manifest to the 1.9 version because of the new TLS configuration in 1.10.

^ you probably won't need to do this though because I believe the etcd upgrade blocks the rest of the controlplane upgrade.

@dvdmuckle
Copy link

It seems that only the etcd manifest isn't rolled back on a failed upgrade, everything else is fine. After moving the backup manifest over and restarting the kubelet, everything comes back fine.

@codepainters
Copy link

I faced the same timeout problem and kubeadm rolled back kube-apiserv manifest to 1.9.6, but left etcd manifest as is (read: with TLS enabled), obviously leading apiserv to fail miserably, effectively breaking my master node. Good candidate for a separate issue report, I suppose.

@detiber
Copy link
Member

detiber commented Mar 29, 2018

@dvdmuckle @codepainters, unfortunately it depends on which component hits the race condition (etcd or api server) whether the rollback is successful. I found a fix for the race condition, but it completely breaks the kubeadm upgrade. I'm working with @stealthybox to try and find a proper path forward for properly fixing the upgrade.

@detiber
Copy link
Member

detiber commented Mar 29, 2018

@codepainters I think it is the same issue.

There are a few underlying problems causing this issue:

  • The upgrade is generating a hash of the mirror pod for each component from the result of querying the mirror pod from the API. The upgrade then tests to see if this hashed value changes to determine if the pod is updated from the static manifest change. The hashed value includes fields that can be mutated for reasons other than the static manifest change (such as pod status updates). If the pod status changes between the hash comparisons, then the upgrade will continue to the next component prematurely.
  • The upgrade performs the etcd static pod manifest update (including adding tls security to etcd) and attempts to use the apiserver to verify that the pod has been updated, however the apiserver manifest has not been updated at this point to use tls for communicating with etcd.

As a result, the upgrade only succeeds currently when there happens to be a pod status update for the etcd pod that causes the hash to change prior to the kubelet picking up the new static manifest for etcd. Additionally, the api server needs to remain available for the first part of the apiserver upgrade when the upgrade tooling is querying the api prior to updating the apiserver manifest.

@stealthybox
Copy link
Member

@detiber and I got on a call to discuss changes we need to make to the upgrade process.
We plan to implement 3 fixes for this bug in the 1.10.x patch releases:

  • Remove etcd TLS from upgrade.
    The current upgrade loop does batch modifications per component in a serial manner.
    Upgrading a component has no knowledge of dependent component configurations.
    Verifying an upgrade requires the APIServer is available to check the pod status.
    Etcd TLS requires a coupled etcd+apiserver configuration change which breaks this contract.
    This is the minimum viable change to fix this issue, and leaves upgraded clusters with insecure etcd.

  • Fix the mirror-pod hash race condition on pod status change.
    https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/upgrade/staticpods.go#L189.
    Upgrades will now be correct assuming compatibility between etcd and apiserver flags.

  • Upgrade TLS specifically in a separate phase.
    Etcd and the APIServer need to be upgraded together.
    kubeadm alpha phase ensure-etcd-tls ?.
    This phase should be runnable independently of a cluster upgrade.
    During a cluster upgrade, this phase should run before updating all of the components.


For 1.11 we want to:

  • Use the kubelet API for runtime checks of upgraded static pods.
    It's undesirable to rely on the apiserver and etcd for monitoring local processes like we are currently doing.
    A local source of data about pods is superior to relying on higher-order distributed kubernetes components.
    This will replace the current pod runtime checks in the upgrade loop.
    This will allow us to add checks to the ensure-etcd-tls phase.

alternative: Use the CRI to get pod info (demo'd viable using crictl).
caveat: CRI on dockershim and possibly other container runtimes does not currently support backward compatibility for CRI breaking changes.

@stealthybox
Copy link
Member

TODO:

  • open and link issues for these 4 changes.

@detiber
Copy link
Member

detiber commented Mar 30, 2018

PR to address the static pod update race condition: kubernetes/kubernetes#61942
cherry-pick PR for release-1.10 branch: kubernetes/kubernetes#61954

@codepainters
Copy link

@detiber do you mind explaining what race condition are we talking about? I'm not that familiar with kubeadm internals, yet it sounds interesting.

@timothysc timothysc self-assigned this Apr 3, 2018
@timothysc timothysc added this to the v1.10 milestone Apr 3, 2018
@timothysc timothysc added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triaged labels Apr 3, 2018
@stealthybox
Copy link
Member

@codepainters see #740 (comment)

@kable-wilmoth
Copy link

kable-wilmoth commented Apr 9, 2018

FYI - same problem/issue upgrading from 1.9.3
Tried the workaround of re-trying a number of times. Finally hit the race condition w/ the API server and the upgrade couldn't roll back.

@codepainters
Copy link

@stealthybox thx, I didn't get it at first reading.

@S569
Copy link

S569 commented Apr 9, 2018

I am having the same issue..[ERROR APIServerHealth]: the API Server is unhealthy; /healthz didn't return "ok"
[ERROR MasterNodesReady]: couldn't list masters in cluster: Get https.......while upgrading. Please help me with this. I am upgrading from 1.9.3 to 1.10.0. Initially, it was able to reach certain point of "[upgrade/staticpods] Waiting for the kubelet to restart the component".

@stealthybox
Copy link
Member

stealthybox commented Apr 16, 2018

Temporary workaround is to ensure certs and upgrade the etcd and apiserver pods by bypassing the checks.

Be sure to check your Config and add any flags for your use case:

kubectl -n kube-system edit cm kubeadm-config  # change featureFlags
...
  featureGates: {}
...
kubeadm alpha phase certs all
kubeadm alpha phase etcd local
kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config

@DylanGraham
Copy link
Member

DylanGraham commented Apr 17, 2018

Thanks @stealthybox
For me the upgrade apply process stalled on [upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"... however the cluster was successfully upgraded.

@kvaps
Copy link
Member

kvaps commented Apr 17, 2018

@stealthybox I'm not sure, but seems that something is broken after these steps, because kubeadm upgrade plan is hangs after that:

[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.10.1
[upgrade/versions] kubeadm version: v1.10.1
[upgrade/versions] Latest stable version: v1.10.1

When apply update I had hanged [upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"... too

@codepainters
Copy link

codepainters commented Apr 17, 2018

@kvaps @stealthybox this is most likely etcd issue (kubeadm speaks plain HTTP/2 to TLS-enabled etcd), I hit it too. See this other issue: #755

Honestly, I can't understand why is the same TCP port used for both TLS and non-TLS etcd listeners, it only causes troubles like this one. Getting plain, old connection refused would give immediate hint, here I had to resort to tcpdump to understand what's going on.

@stealthybox
Copy link
Member

stealthybox commented Apr 18, 2018

OH!
Shoot you're right, that only works with my local TLS patch for the Etcd status check.

Do this to finish the upgrade:

kubeadm alpha phase controlplane all
kubeadm alpha phase upload-config

edited the above workaround to be correct

@renich
Copy link

renich commented Apr 24, 2018

@stealthybox the second kubeadm command doesn't work:

# kubeadm alpha phase upload-config
The --config flag is mandatory

@stealthybox
Copy link
Member

@renich just give it the filepath of your config

If you don't use any custom settings, you can pass it an empty file.
Here's a simple way to do that in bash:

1.10_kubernetes/server/bin/kubeadm alpha phase upload-config --config <(echo)

@detiber
Copy link
Member

detiber commented Apr 26, 2018

This should now be resolved with the merging of kubernetes/kubernetes#62655 and will be part of the v1.10.2 release.

@detiber detiber closed this as completed Apr 26, 2018
@vaizki
Copy link

vaizki commented Apr 29, 2018

I can confirm that 1.10.0 -> 1.10.2 upgrade with kubeadm 1.10.2 was smooth, no timeouts

@denis111
Copy link

denis111 commented May 4, 2018

I still have timeout on 1.10.0 -> 1.10.2 but another one:
[upgrade/staticpods] Waiting for the kubelet to restart the component Static pod: kube-apiserver-master hash: a273591d3207fcd9e6fd0c308cc68d64 [upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

I'm not sure what to do...

@dvdmuckle
Copy link

@denis111 check the API server logs while doing the upgrade using docker ps. I feel like you might be running into an issue I'm also running in to.

@denis111
Copy link

denis111 commented May 7, 2018

@dvdmuckle Well, I don't see any error in that logs, only entries starting with I and a few W.
And i think the hash of kube-apiserver doesn't change during the upgrade.

@carlosedp
Copy link

I have an ARM64 cluster on 1.9.3 and successfully updated to 1.9.7 but got the same timeout problem to upgrade from 1.9.7 to 1.10.2.

I even tried editing and recompiling kubeadm increasing the timeouts (like these last commits https://github.com/anguslees/kubernetes/commits/kubeadm-gusfork) with same results.

$ sudo kubeadm upgrade apply  v1.10.2 --force
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.2"
[upgrade/versions] Cluster version: v1.9.7
[upgrade/versions] kubeadm version: v1.10.2-dirty
[upgrade/version] Found 1 potential version compatibility errors but skipping since the --force flag is set:

   - Specified version to upgrade to "v1.10.2" is higher than the kubeadm version "v1.10.2-dirty". Upgrade kubeadm first using the tool you used to install kubeadm
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.2"...
Static pod: kube-apiserver-kubemaster1 hash: ed7578d5bf9314188dca798386bcfb0e
Static pod: kube-controller-manager-kubemaster1 hash: e0c3f578f1c547dcf9996e1d3390c10c
Static pod: kube-scheduler-kubemaster1 hash: 52e767858f52ac4aba448b1a113884ee
[upgrade/etcd] Upgrading to TLS for etcd
Static pod: etcd-kubemaster1 hash: 413224efa82e36533ce93e30bd18e3a8
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/etcd.yaml"
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing etcd/server certificate and key.
[certificates] Using the existing etcd/peer certificate and key.
[certificates] Using the existing etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests190581659/etcd.yaml"
[upgrade/staticpods] Not waiting for pod-hash change for component "etcd"
[upgrade/etcd] Waiting for etcd to become available
[util/etcd] Waiting 30s for initial delay
[util/etcd] Attempting to get etcd status 1/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 2/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 3/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 4/10
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests346927148/kube-scheduler.yaml"
[upgrade/staticpods] The etcd manifest will be restored if component "kube-apiserver" fails to upgrade
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing apiserver-etcd-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests190581659/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

@sdarwin
Copy link

sdarwin commented May 18, 2018

Upgrade v1.10.2 -> v1.10.2 (which may be nonsense. just testing...)

Ubuntu 16.04.

And, it fails with an error.

kubeadm upgrade apply v1.10.2

[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

@carlosedp
Copy link

I wonder if this is still tracked on some issue... couldn't find.

@mattkelly
Copy link

mattkelly commented May 22, 2018

I'm also seeing upgrades still failing with the timed out waiting for the condition error.

Edit: Moved discussion to a new ticket #850, please discuss there.

@cristifalcas
Copy link

cristifalcas commented Aug 10, 2018

If anyone else has this problem with 1.9.x:

If you are in aws with custom hostnames you need to edit the kubeadm-config configmap and set at nodeName the aws internal name: ip-xx-xx-xx-xx.$REGION.compute.internal)

kubectl -n kube-system edit cm kubeadm-config -oyaml

This besides setting etc client to http. I'm not yet on letter versions to see if they fixed that.

This is because kubeadm tries to read this path in api: /api/v1/namespaces/kube-system/pods/kube-apiserver-$NodeName

@carlosedp
Copy link

Since the timeout has been increased on 1.10.6, I've successfully updated my 1.9.7 deployment to 1.10.6 a couple weeks ago.

Planning to upgrade to 1.11.2 as soon as the .deb packages are ready as the same changes are in this version.

My cluster run on-premises on ARM64 boards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triaged
Projects
None yet
Development

No branches or pull requests