Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apiserver timeouts after rolling-update of etcd cluster #47131

Closed
javipolo opened this issue Jun 7, 2017 · 67 comments
Closed

apiserver timeouts after rolling-update of etcd cluster #47131

javipolo opened this issue Jun 7, 2017 · 67 comments
Assignees
Labels
area/etcd kind/bug Categorizes issue or PR as related to a bug. milestone/removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@javipolo
Copy link

javipolo commented Jun 7, 2017

Kubernetes version: 1.6.2
Etcd version: 3.1.8

Environment:

  • apiservers are Virtual Machines on KVM
  • Debian jessie
  • Kernel: 3.16.0-4-amd64
  • Others: both etcd and kube-apiserver run on the same machines

What happened:
We were doing an upgrade of the configuration of etcd (flags election_timeout and heartbeat_interval)
We did upgrade all of our etcd servers one at a time (we have 5)
We did check that the etcd cluster was healthy issuing etcdctl cluster-health and etcdctl member list
Then kube-apiserver started to behave erratically, giving timeout to almost every request sent
On the logs of the apiserver we can see lots of lines like that:

E0607 17:45:11.447234     367 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.452760     367 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.452898     367 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.453120     367 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted

In order for the apiservers to start behaving correctly, we had to perform a restart of the kube-apiserver service (just that one service, in all of our apiservers)

We did this twice, and twice it did happen the same. The cluster is in production, so we cannot risk a third outage to reproduce it again. The two times we tried it the behaviour was pretty consistent

What you expected to happen:
That etcd just updated its configuration and apiserver never stoped working

How to reproduce it (as minimally and precisely as possible):

  1. Change the flags election-timeout and heartbeat-interval and then do a rolling restart on the etcd cluster
  2. Query apiserver ... and it should be failing

Anything else we need to know:

Just ask :)

@k8s-github-robot
Copy link

@javipolo There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
(2) specifying the label manually: /sig <label>

Note: method (1) will trigger a notification to the team. You can find the team list here and label list here

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 7, 2017
@javipolo
Copy link
Author

javipolo commented Jun 7, 2017

@kubernetes/sig-api-machinery-misc I guess :)

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jun 7, 2017
@k8s-ci-robot
Copy link
Contributor

@javipolo: Reiterating the mentions to trigger a notification:
@kubernetes/sig-api-machinery-misc.

In response to this:

@kubernetes/sig-api-machinery-misc I guess :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 7, 2017
@xiang90
Copy link
Contributor

xiang90 commented Jun 7, 2017

@javipolo Can you provide more logging? Probably in 10minutes window? The compacted thing is expected as the connection between etcd and api server was cut off. But it should disappear when etcd is back and resync is finished.

@javipolo
Copy link
Author

javipolo commented Jun 7, 2017

Yep. before I just provided the ERROR part of the logs. Now I'm putting the INFO part of them also. I leave logs since 16:50 until restart. Around 17:45 was when we did the rolling update of etcd. If you need anything else just ask for it and I'll try to provide it :) (thanks!)

Log file created at: 2017/06/07 16:23:48
Running on machine: kubeapi-05
Binary: Built with gc go1.7.5 for linux/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
.
.
.
.
I0607 16:50:59.279231   19698 trace.go:61] Trace "List /api/v1/namespaces/search/events" (started 2017-06-07 16:50:58.733629042 +0200 CEST):
[12.69µs] [12.69µs] About to List from storage
[280.765661ms] [280.752971ms] Listing from storage done
[293.367569ms] [12.601908ms] Self-linking done
[545.53202ms] [252.164451ms] Writing http response done (4300 items)
"List /api/v1/namespaces/search/events" [545.534815ms] [2.795µs] END
E0607 17:45:11.393411   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:45:11.393624   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *extensions.Deployment ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.448063   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:45:11.448149   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *api.PersistentVolume ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.453552   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:45:11.453700   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *api.PodTemplate ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.476405   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:45:11.476536   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *api.PersistentVolumeClaim ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.476637   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:45:11.476735   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *api.ConfigMap ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.492509   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:45:11.492594   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *api.LimitRange ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.492611   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:45:11.492665   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *api.ResourceQuota ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.492806   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:45:11.492862   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *api.ServiceAccount ended with: etcdserver: mvcc: required revision has been compacted
W0607 17:45:12.448461   19698 cacher.go:125] Terminating all watchers from cacher *api.PersistentVolume
W0607 17:45:12.476786   19698 cacher.go:125] Terminating all watchers from cacher *api.PersistentVolumeClaim
W0607 17:45:12.492813   19698 cacher.go:125] Terminating all watchers from cacher *api.ResourceQuota
W0607 17:45:12.492860   19698 cacher.go:125] Terminating all watchers from cacher *api.LimitRange
W0607 17:45:12.493052   19698 cacher.go:125] Terminating all watchers from cacher *api.ServiceAccount
W0607 17:45:12.499419   19698 reflector.go:323] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: watch of *api.ServiceAccount ended with: too old resource version: 8902015 (8925724)
W0607 17:45:12.499526   19698 reflector.go:323] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: watch of *api.ResourceQuota ended with: too old resource version: 8902015 (8925724)
W0607 17:45:12.499581   19698 reflector.go:323] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: watch of *api.LimitRange ended with: too old resource version: 8902015 (8925724)
E0607 17:45:52.439496   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0607 17:45:52.439634   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0607 17:45:52.440143   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:45:52.441236   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *batch.Job ended with: etcdserver: mvcc: required revision has been compacted
W0607 17:45:52.441066   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *storage.StorageClass ended with: etcdserver: mvcc: required revision has been compacted
W0607 17:45:52.441336   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *api.Secret ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:45:52.456094   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:45:52.457264   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *api.Service ended with: etcdserver: mvcc: required revision has been compacted
W0607 17:45:53.441577   19698 cacher.go:125] Terminating all watchers from cacher *api.Secret
W0607 17:45:53.452317   19698 reflector.go:323] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: watch of *api.Secret ended with: too old resource version: 8902015 (8925793)
W0607 17:45:53.457554   19698 cacher.go:125] Terminating all watchers from cacher *api.Service
E0607 17:47:36.116540   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:47:36.116631   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *api.ReplicationController ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:47:36.141317   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:47:36.141439   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *extensions.Ingress ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:47:36.141463   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0607 17:47:36.141467   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:47:36.141521   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *apps.StatefulSet ended with: etcdserver: mvcc: required revision has been compacted
W0607 17:47:36.141531   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *storage.StorageClass ended with: etcdserver: mvcc: required revision has been compacted
W0607 17:47:37.116969   19698 cacher.go:125] Terminating all watchers from cacher *api.ReplicationController
W0607 17:47:37.141672   19698 cacher.go:125] Terminating all watchers from cacher *apps.StatefulSet
E0607 17:48:26.488736   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:48:26.489735   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *api.Namespace ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:48:26.501637   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0607 17:48:26.501656   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:48:26.501754   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *rbac.ClusterRoleBinding ended with: etcdserver: mvcc: required revision has been compacted
W0607 17:48:26.501776   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *rbac.ClusterRole ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:48:26.502128   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:48:26.502178   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *rbac.RoleBinding ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:48:26.502311   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:48:26.502357   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *policy.PodDisruptionBudget ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:48:26.502602   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:48:26.502654   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *rbac.Role ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:48:26.560101   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:48:26.560191   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *autoscaling.HorizontalPodAutoscaler ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:48:26.560404   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:48:26.560563   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *api.ReplicationController ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:48:26.560592   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:48:26.560623   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *batch.CronJob ended with: etcdserver: mvcc: required revision has been compacted
E0607 17:48:26.560625   19698 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0607 17:48:26.560717   19698 reflector.go:323] k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *certificates.CertificateSigningRequest ended with: etcdserver: mvcc: required revision has been compacted
W0607 17:48:27.489933   19698 cacher.go:125] Terminating all watchers from cacher *api.Namespace
W0607 17:48:27.498083   19698 reflector.go:323] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: watch of *api.Namespace ended with: too old resource version: 8902015 (8926322)
W0607 17:49:32.064253   19698 storage_extensions.go:127] third party resource sync failed: the server cannot complete the requested operation at this time, try again later (get thirdpartyresources.extensions)
W0607 17:50:42.066257   19698 storage_extensions.go:127] third party resource sync failed: the server cannot complete the requested operation at this time, try again later (get thirdpartyresources.extensions)
I0607 17:51:14.583499   19698 trace.go:61] Trace "Create /apis/extensions/v1beta1/namespaces/search/daemonsets" (started 2017-06-07 17:50:44.581144781 +0200 CEST):
[1.332617ms] [1.332617ms] About to convert to expected version
[1.608209ms] [275.592µs] Conversion done
[1.644553ms] [36.344µs] About to store object in database
"Create /apis/extensions/v1beta1/namespaces/search/daemonsets" [30.00229701s] [30.000652457s] END
W0607 17:51:52.068431   19698 storage_extensions.go:127] third party resource sync failed: the server cannot complete the requested operation at this time, try again later (get thirdpartyresources.extensions)

@xiang90
Copy link
Contributor

xiang90 commented Jun 7, 2017

@javipolo

When did you restart API server? Are there more logs like etcdserver: mvcc: required revision has been compacted after 17:51? I want to know if API server simply prints out these errors forever or it will stop after you upgrade ALL your etcd members.

@dims
Copy link
Member

dims commented Jun 7, 2017

/cc @jbeda

@javipolo
Copy link
Author

javipolo commented Jun 7, 2017

The logs I provided end up when restarting the apiserver, so no, there are no more of those errors after 17:51

I can provide the restart times of etcds and apiservers (date/time of the host are the same in every server)

etcd restart times

   Active: active (running) since Wed 2017-06-07 17:45:12 CEST; 1h 33min ago
   Active: active (running) since Wed 2017-06-07 17:48:27 CEST; 1h 30min ago
   Active: active (running) since Wed 2017-06-07 17:45:53 CEST; 1h 33min ago
   Active: active (running) since Wed 2017-06-07 17:46:43 CEST; 1h 32min ago
   Active: active (running) since Wed 2017-06-07 17:47:37 CEST; 1h 31min ago

apiserver restart times:

   Active: active (running) since Wed 2017-06-07 17:52:47 CEST; 1h 27min ago
   Active: active (running) since Wed 2017-06-07 17:52:47 CEST; 1h 27min ago
   Active: active (running) since Wed 2017-06-07 17:52:48 CEST; 1h 27min ago
   Active: active (running) since Wed 2017-06-07 17:52:48 CEST; 1h 27min ago
   Active: active (running) since Wed 2017-06-07 17:52:49 CEST; 1h 27min ago

@wangycc
Copy link

wangycc commented Jun 8, 2017

I have encountered the same problem, this problem has been troubled me.

kubernetes version: v1.6.4
etcd version: 3.1.8

etcd status:

[root@ip-10-255-2-76 kubernetes]# etcdctl \
> --ca-file=/etc/kubernetes/ssl/ca.pem \
> --cert-file=/etc/kubernetes/ssl/kubernetes.pem \
> --key-file=/etc/kubernetes/ssl/kubernetes-key.pem \
> cluster-health
2017-06-08 15:09:07.272311 I | warning: ignoring ServerName for user-provided CA for 
backwards compatibility is deprecated
2017-06-08 15:09:07.273036 I | warning: ignoring ServerName for user-provided CA for    
backwards compatibility is deprecated 
member 1ccf784fffd9bd65 is healthy: got healthy result from https://10.255.2.174:2379
member 81eb13a8d04dcfee is healthy: got healthy result from https://10.255.2.76:2379
member a05de482e89e5998 is healthy: got healthy result from https://10.255.2.147:2379
cluster is healthy

show logs: kube-aiserver

[root@ip-10-255-2-76 kubernetes]# tail -500f /data/logs/kubernetes/kube-apiserver.INFO
	Log file created at: 2017/06/08 14:32:51
	Running on machine: ip-10-255-2-76
	Binary: Built with gc go1.7.5 for linux/amd64
	Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
	I0608 14:32:51.608111    9597 aws.go:762] Building AWS cloudprovider
	I0608 14:32:51.608413    9597 aws.go:725] Zone not specified in configuration file; querying AWS metadata service
	I0608 14:32:51.776140    9597 tags.go:76] AWS cloud filtering on ClusterID: kubernetes
	E0608 14:32:52.320857    9597 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *rbac.ClusterRole: Get https://10.255.2.76:6443/apis/rbac.authorization.k8s.io/v1beta1/clusterroles?resourceVersion=0: dial tcp 10.255.2.76:6443: getsockopt: connection refused
	E0608 14:32:52.321118    9597 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.Secret: Get https://10.255.2.76:6443/api/v1/secrets?resourceVersion=0: dial tcp 10.255.2.76:6443: getsockopt: connection refused
	E0608 14:32:52.321172    9597 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *storage.StorageClass: Get https://10.255.2.76:6443/apis/storage.k8s.io/v1beta1/storageclasses?resourceVersion=0: dial tcp 10.255.2.76:6443: getsockopt: connection refused
	E0608 14:32:52.321214    9597 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.ServiceAccount: Get https://10.255.2.76:6443/api/v1/serviceaccounts?resourceVersion=0: dial tcp 10.255.2.76:6443: getsockopt: connection refused
	E0608 14:32:52.321249    9597 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.LimitRange: Get https://10.255.2.76:6443/api/v1/limitranges?resourceVersion=0: dial tcp 10.255.2.76:6443: getsockopt: connection refused
	E0608 14:32:52.321279    9597 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.Namespace: Get https://10.255.2.76:6443/api/v1/namespaces?resourceVersion=0: dial tcp 10.255.2.76:6443: getsockopt: connection refused
	E0608 14:32:52.321317    9597 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *rbac.RoleBinding: Get https://10.255.2.76:6443/apis/rbac.authorization.k8s.io/v1beta1/rolebindings?resourceVersion=0: dial tcp 10.255.2.76:6443: getsockopt: connection refused
	E0608 14:32:52.321352    9597 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *rbac.Role: Get https://10.255.2.76:6443/apis/rbac.authorization.k8s.io/v1beta1/roles?resourceVersion=0: dial tcp 10.255.2.76:6443: getsockopt: connection refused
	E0608 14:32:52.321384    9597 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.ResourceQuota: Get https://10.255.2.76:6443/api/v1/resourcequotas?resourceVersion=0: dial tcp 10.255.2.76:6443: getsockopt: connection refused
	E0608 14:32:52.367614    9597 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *rbac.ClusterRoleBinding: Get https://10.255.2.76:6443/apis/rbac.authorization.k8s.io/v1beta1/clusterrolebindings?resourceVersion=0: dial tcp 10.255.2.76:6443: getsockopt: connection refused
	I0608 14:32:52.650280    9597 serve.go:79] Serving securely on 10.255.2.76:6443
	I0608 14:32:52.650325    9597 serve.go:94] Serving insecurely on 10.255.2.76:8080
	I0608 14:32:53.671016    9597 trace.go:61] Trace "GuaranteedUpdate etcd3: *api.ConfigMap" (started 2017-06-08 14:32:53.098384831 +0800 CST):
	[32.359241ms] [32.359241ms] initial value restored
	[555.829144ms] [523.469903ms] Transaction prepared
	[572.583965ms] [16.754821ms] Transaction committed
	"GuaranteedUpdate etcd3: *api.ConfigMap" [572.599509ms] [15.544µs] END
	I0608 14:32:53.671102    9597 trace.go:61] Trace "Update /api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication" (started 2017-06-08 14:32:53.098317621 +0800 CST):
	[22.218µs] [22.218µs] About to convert to expected version
	[39.51µs] [17.292µs] Conversion done
	[42.677µs] [3.167µs] About to store object in database
	[572.712537ms] [572.66986ms] Object stored in database
	[572.716218ms] [3.681µs] Self-link added
	"Update /api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication" [572.763593ms] [47.375µs] END
	I0608 14:32:53.891199    9597 storage_rbac.go:227] created role.rbac.authorization.k8s.io/system:controller:bootstrap-signer in kube-system
	I0608 14:32:53.925799    9597 storage_rbac.go:227] created role.rbac.authorization.k8s.io/system:controller:token-cleaner in kube-system
	I0608 14:32:53.973401    9597 storage_rbac.go:227] created role.rbac.authorization.k8s.io/system:controller:bootstrap-signer in kube-public
	I0608 14:32:54.011158    9597 storage_rbac.go:257] created rolebinding.rbac.authorization.k8s.io/system:controller:bootstrap-signer in kube-system
	I0608 14:32:54.040216    9597 storage_rbac.go:257] created rolebinding.rbac.authorization.k8s.io/system:controller:token-cleaner in kube-system
	I0608 14:32:54.104971    9597 storage_rbac.go:257] created rolebinding.rbac.authorization.k8s.io/system:controller:bootstrap-signer in kube-public
	I0608 14:42:51.937522    9597 compact.go:159] etcd: compacted rev (404), endpoints ([https://10.255.2.147:2379 https://10.255.2.76:2379 https://10.255.2.174:2379])
	I0608 14:47:51.945296    9597 compact.go:159] etcd: compacted rev (709), endpoints ([https://10.255.2.147:2379 https://10.255.2.76:2379 https://10.255.2.174:2379])
	E0608 14:49:58.232233    9597 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
	I0608 14:52:51.953504    9597 compact.go:159] etcd: compacted rev (1012), endpoints ([https://10.255.2.147:2379 https://10.255.2.76:2379 https://10.255.2.174:2379])
	I0608 14:57:51.962993    9597 compact.go:159] etcd: compacted rev (1311), endpoints ([https://10.255.2.147:2379 https://10.255.2.76:2379 https://10.255.2.174:2379])
	E0608 14:58:13.284420    9597 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
	I0608 15:02:51.971159    9597 compact.go:159] etcd: compacted rev (1612), endpoints ([https://10.255.2.147:2379 https://10.255.2.76:2379 https://10.255.2.174:2379])
	I0608 15:07:51.979360    9597 compact.go:159] etcd: compacted rev (1911), endpoints ([https://10.255.2.147:2379 https://10.255.2.76:2379 https://10.255.2.174:2379])
	E0608 15:07:58.354849    9597 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted  

@riclava
Copy link

riclava commented Jul 4, 2017

+1

@Sartner
Copy link

Sartner commented Jul 20, 2017

the same problem
etcd 3.1.7
k8s: 1.6.7

@edevil
Copy link
Contributor

edevil commented Jul 26, 2017

I also have the same messages on the apiserver log, however it's rare that I get timeouts on operations invoked on it.

K8s: 1.6.2
etcd: 3.1.10 && 3.0.17 (these messages appeared before I upgraded to 3.1.10 and are still there)

@timothysc
Copy link
Member

/assign @hongchaodeng @xiang90

@timothysc timothysc added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Jul 26, 2017
@timothysc timothysc added this to the v1.8 milestone Jul 26, 2017
@hongchaodeng
Copy link
Contributor

OK. So this issue already exists. I can share my story as well.

Basically, we were doing some etcd scale up and down stress testing. We saw that APIserver hanged and some resources unusable. I tried a dozen times and they were whimsical:

  • Mostly you could still read, but write didn't work.
  • Not all resources hang. Not the same resources hang every time. Sometimes it is endpoints, sometimes it is services, sometimes it is pods, third party resources, etc.

I have tried to reproduce with scratch apiserver:

--etcd-servers=http://${lb_service_ip}:2379 --service-cluster-ip-range=10.3.0.0/24

Haven't reproduced yet. So I doubt it needs some data.

Note that in my case it happened only on scaling down from 3 to 1. Scaling down from 2 to 1 wouldn't happen. I also tried to scale down slowly, waiting 60s before removing the second one, and it wouldn't happen.

@xigang
Copy link
Contributor

xigang commented Aug 9, 2017

@xiang90 @hongchaodeng
the same problem.

etcd version:

etcd Version: 3.1.0
Git SHA: 8ba2897
Go Version: go1.7.4
Go OS/Arch: linux/amd64

kubernetes version:

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T23:15:59Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T22:55:19Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

show kube-apiserver logs:

Aug 09 10:50:56 kube-apiserver[62362]: E0809 10:50:56.548757   62362 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted
Aug 09 10:51:38 kube-apiserver[62362]: E0809 10:51:38.087021   62362 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted
Aug 09 10:51:55 kube-apiserver[62362]: E0809 10:51:55.890132   62362 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted
Aug 09 10:51:55 kube-apiserver[62362]: E0809 10:51:55.890135   62362 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted
Aug 09 10:51:55 kube-apiserver[62362]: E0809 10:51:55.890136   62362 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted
Aug 09 10:51:55 kube-apiserver[62362]: E0809 10:51:55.890191   62362 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted
Aug 09 10:51:55 kube-apiserver[62362]: E0809 10:51:55.915202   62362 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted
Aug 09 10:51:55 kube-apiserver[62362]: E0809 10:51:55.915298   62362 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted
Aug 09 10:51:55 kube-apiserver[62362]: E0809 10:51:55.915311   62362 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted
Aug 09 10:51:55 kube-apiserver[62362]: E0809 10:51:55.915315   62362 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted

this issue will affect the production environment?

thx

@zxpower
Copy link

zxpower commented Aug 14, 2017

Similar problem (see log below).

In short - installing new K8S master node based on this guide: https://coreos.com/kubernetes/docs/latest/getting-started.html - when trying to start kubelet service on it - everything starts up, but the apiserver all the time crashes (?). Due to that worker node can't register it self as well master node isn't fully working.

CoreOS Linux: 1492.4.0
etcd version: 3.2.5 (cluster from 4 nodes)
kubernetes version 1.6.1_coreos.0

E0814 14:32:14.665943       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.ResourceQuota: Get https://localhost:443/api/v1/resourcequotas?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 14:32:14.665943       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.Secret: Get https://localhost:443/api/v1/secrets?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 14:32:14.665943       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.Namespace: Get https://localhost:443/api/v1/namespaces?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 14:32:14.665947       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.LimitRange: Get https://localhost:443/api/v1/limitranges?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 14:32:14.665950       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *storage.StorageClass: Get https://localhost:443/apis/storage.k8s.io/v1beta1/storageclasses?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 14:32:14.665950       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.ServiceAccount: Get https://localhost:443/api/v1/serviceaccounts?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
[restful] 2017/08/14 14:32:14 log.go:30: [restful/swagger] listing is available at https://192.168.178.155:443/swaggerapi/
[restful] 2017/08/14 14:32:14 log.go:30: [restful/swagger] https://192.168.178.155:443/swaggerui/ is mapped to folder /swagger-ui/
I0814 14:32:14.740919       1 serve.go:79] Serving securely on 0.0.0.0:443
I0814 14:32:14.741036       1 serve.go:94] Serving insecurely on 127.0.0.1:8080
I0814 14:32:25.189208       1 trace.go:61] Trace "Create /api/v1/namespaces/kube-system/pods" (started 2017-08-14 14:32:15.188789893 +0000 UTC):
[25.862µs] [25.862µs] About to convert to expected version
[138.138µs] [112.276µs] Conversion done
"Create /api/v1/namespaces/kube-system/pods" [10.00036266s] [10.000224522s] END
I0814 14:32:29.307733       1 trace.go:61] Trace "Create /api/v1/namespaces/kube-system/pods" (started 2017-08-14 14:32:19.307339406 +0000 UTC):
[50.67µs] [50.67µs] About to convert to expected version
[178.618µs] [127.948µs] Conversion done
"Create /api/v1/namespaces/kube-system/pods" [10.000361s] [10.000182382s] END
I0814 14:32:43.639855       1 trace.go:61] Trace "Create /api/v1/namespaces/kube-system/pods" (started 2017-08-14 14:32:33.639564997 +0000 UTC):
[27.653µs] [27.653µs] About to convert to expected version
[109.672µs] [82.019µs] Conversion done
"Create /api/v1/namespaces/kube-system/pods" [10.000262566s] [10.000152894s] END

Occasionally it runs further:

E0814 15:43:13.660251       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *storage.StorageClass: Get https://localhost:443/apis/storage.k8s.io/v1beta1/storageclasses?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 15:43:13.660252       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.Secret: Get https://localhost:443/api/v1/secrets?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 15:43:13.660255       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.ResourceQuota: Get https://localhost:443/api/v1/resourcequotas?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 15:43:13.660257       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.LimitRange: Get https://localhost:443/api/v1/limitranges?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 15:43:13.660260       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.ServiceAccount: Get https://localhost:443/api/v1/serviceaccounts?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 15:43:13.660260       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.Namespace: Get https://localhost:443/api/v1/namespaces?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
[restful] 2017/08/14 15:43:13 log.go:30: [restful/swagger] listing is available at https://192.168.178.155:443/swaggerapi/
[restful] 2017/08/14 15:43:13 log.go:30: [restful/swagger] https://192.168.178.155:443/swaggerui/ is mapped to folder /swagger-ui/
I0814 15:43:13.734803       1 serve.go:79] Serving securely on 0.0.0.0:443
I0814 15:43:13.734949       1 serve.go:94] Serving insecurely on 127.0.0.1:8080
I0814 15:43:24.910942       1 trace.go:61] Trace "Create /api/v1/namespaces/default/events" (started 2017-08-14 15:43:14.910564996 +0000 UTC):
[21.041µs] [21.041µs] About to convert to expected version
[61.814µs] [40.773µs] Conversion done
"Create /api/v1/namespaces/default/events" [10.000326549s] [10.000264735s] END
I0814 15:43:34.912092       1 trace.go:61] Trace "Create /api/v1/namespaces/default/events" (started 2017-08-14 15:43:24.911732735 +0000 UTC):
[19.587µs] [19.587µs] About to convert to expected version
[48.974µs] [29.387µs] Conversion done
"Create /api/v1/namespaces/default/events" [10.000322522s] [10.000273548s] END
I0814 15:43:43.762259       1 trace.go:61] Trace "Create /api/v1/namespaces" (started 2017-08-14 15:43:13.761803275 +0000 UTC):
[23.309µs] [23.309µs] About to convert to expected version
[86.721µs] [63.412µs] Conversion done
[98.387µs] [11.666µs] About to store object in database
"Create /api/v1/namespaces" [30.000423311s] [30.000324924s] END
E0814 15:43:43.762743       1 client_ca_hook.go:58] Timeout: request did not complete within allowed duration
I0814 15:43:44.913107       1 trace.go:61] Trace "Create /api/v1/namespaces/default/events" (started 2017-08-14 15:43:34.912799142 +0000 UTC):
[19.308µs] [19.308µs] About to convert to expected version
[50.372µs] [31.064µs] Conversion done
"Create /api/v1/namespaces/default/events" [10.000280805s] [10.000230433s] END
I0814 15:43:45.902091       1 trace.go:61] Trace "Create /api/v1/namespaces/kube-system/pods" (started 2017-08-14 15:43:35.90159532 +0000 UTC):
[44.964µs] [44.964µs] About to convert to expected version
[179.268µs] [134.304µs] Conversion done
"Create /api/v1/namespaces/kube-system/pods" [10.000458778s] [10.00027951s] END
I0814 15:43:49.236026       1 trace.go:61] Trace "Create /api/v1/nodes" (started 2017-08-14 15:43:19.235552055 +0000 UTC):
[39.003µs] [39.003µs] About to convert to expected version
[118.204µs] [79.201µs] Conversion done
[129.321µs] [11.117µs] About to store object in database
"Create /api/v1/nodes" [30.000421849s] [30.000292528s] END
I0814 15:43:50.501101       1 trace.go:61] Trace "Create /api/v1/nodes" (started 2017-08-14 15:43:20.500588234 +0000 UTC):
[58.037µs] [58.037µs] About to convert to expected version
[176.001µs] [117.964µs] Conversion done
[181.311µs] [5.31µs] About to store object in database
"Create /api/v1/nodes" [30.000478595s] [30.000297284s] END
I0814 15:43:54.914119       1 trace.go:61] Trace "Create /api/v1/namespaces/default/events" (started 2017-08-14 15:43:44.913818764 +0000 UTC):
[18.28µs] [18.28µs] About to convert to expected version
[44.975µs] [26.695µs] Conversion done
"Create /api/v1/namespaces/default/events" [10.000262551s] [10.000217576s] END
I0814 15:44:02.669816       1 trace.go:61] Trace "Create /api/v1/namespaces/kube-system/pods" (started 2017-08-14 15:43:52.669523306 +0000 UTC):
[25.546µs] [25.546µs] About to convert to expected version
[95.48µs] [69.934µs] Conversion done
"Create /api/v1/namespaces/kube-system/pods" [10.000261635s] [10.000166155s] END

@zxpower
Copy link

zxpower commented Aug 16, 2017

Update to the situation - I solved my issue. Turns out that if you are running etcd3 cluster with TLS then in kube-apiserver.yaml file you should add cert, key and CA file locations - otherwise API server isn't able to connect and store information in etcd. That I found out only by adding --v=4 for verbose logging in kube-apiserver.yaml and then I discovered that all the URLs that kube-apiserver tries to access has 403 error code.

@abel-von
Copy link

abel-von commented Aug 23, 2017

Any progress about this issue? we have also encountered this, and even worse is that some updating operation failed. the log is:

E0812 19:02:48.471971      24 status.go:62] apiserver received an error that is not an metav1.Status: rpc error: code = 13 desc = transport is closing
E0812 19:02:48.536491      24 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
W0812 19:02:48.536547      24 reflector.go:323] vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: watch of *paas.Stack ended with: etcdserver: mvcc: required revision has been compacted

this happens twice in our environment, and we did't find the reproducing condition.

k8s version: v1.6.0
etcd version: v3.1.9

@tbchj
Copy link

tbchj commented Aug 29, 2017

also have the same question. any solution ? I checked log find many line like:
E0812 19:02:48.536491 24 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted

k8s version: 1.7.2
etcd version: 3.0.15

and also have a strange queation.
I created some pod, but etcd record nothing except network config record.
use "etcdctl ls /" ,I didn't see any records about the pod I created.
My colleagues are also experiencing this problem, and didn't resolve it.
but when I changed k8s version to 1.3.10, etcd has the pod record existed (use "etcdctl ls /"). why ?
need help. waiting...

@liggitt
Copy link
Member

liggitt commented Aug 29, 2017

etcdctl ls / uses the etcd v2 API, which is not what kubernetes 1.6+ uses by default

ETCDCTL_API=3 etcdctl get --prefix / uses the etcd v3 API, which will show you your kubernetes data

@roberthbailey roberthbailey removed the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Aug 29, 2017
@tbchj
Copy link

tbchj commented Aug 30, 2017

@liggitt thank you. it works.
the same question like #43893

@luxas luxas added this to the v1.9 milestone Nov 17, 2017
@luxas luxas added status/approved-for-milestone and removed milestone/incomplete-labels kind/feature Categorizes issue or PR as related to a new feature. labels Nov 17, 2017
@luxas
Copy link
Member

luxas commented Nov 17, 2017

cc @kubernetes/kubernetes-release-managers as this is most probably a show-stopper for v1.9; we will most probably consider blocking the release on fixing this issue.

@jpbetz
Copy link
Contributor

jpbetz commented Nov 17, 2017

@luxas That's the basic set of steps planned, yes. Note that we're planning to merge the work to the k8s 1.10 branch, and then cherry pick to a 1.9.x release once it proves stable, we're not targeting the initial 1.9 release.

I'll be picking up the engineering work on this. I'll link to a issue explaining in more detail soon.

@enisoc
Copy link
Member

enisoc commented Nov 17, 2017

@jpbetz Thanks for clarifying. Sounds like fun for @mbohlool when he gets back from vacation.

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Removed From Issue

@hongchaodeng @javipolo @xiang90 @kubernetes/sig-api-machinery-misc @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-scalability-misc

Important: Code freeze is in effect and only issues with priority/critical-urgent may remain in the v1.9 milestone.

Help

@k8s-github-robot k8s-github-robot removed this from the v1.9 milestone Nov 22, 2017
@jianglingxia
Copy link
Contributor

also have the same question,I want to know that the cause of the problem is what? @xiang90
k8s version is 1.6.9 thanks
the log in api-server is like this
I1123 03:11:02.243886 1185 trace.go:61] Trace "Update /api/v1/namespaces/admin/pods/f1a32f2d-a43a-4a74-ac35-a4936fb070be-r2qsd/status" (started 2017-11-23 03:11:01.710184949 +0800 CST): [69.492µs] [69.492µs] About to convert to expected version [384.168µs] [314.676µs] Conversion done [390.729µs] [6.561µs] About to store object in database [533.50445ms] [533.113721ms] Object stored in database [533.511425ms] [6.975µs] Self-link added "Update /api/v1/namespaces/admin/pods/f1a32f2d-a43a-4a74-ac35-a4936fb070be-r2qsd/status" [533.65373ms] [142.305µs] END E1123 03:11:14.142108 1185 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted I1123 03:13:58.495456 1185 trace.go:61] Trace "List /api/v1/events" (started 2017-11-23 03:13:57.985830247 +0800 CST): [9.582µs] [9.582µs] About to List from storage [190.568834ms] [190.559252ms] Listing from storage done [276.924895ms] [86.356061ms] Self-linking done [509.573188ms] [232.648293ms] Writing http response done (5721 items) "List /api/v1/events" [509.574659ms] [1.471µs] END I1123 03:14:09.477836 1185 compact.go:159] etcd: compacted rev (6740640), endpoints ([http://90.169.0.165:2379]) I1123 03:19:09.546562 1185 compact.go:159] etcd: compacted rev (6746029), endpoints ([http://90.169.0.165:2379]) I1123 03:24:09.615500 1185 compact.go:159] etcd: compacted rev (6751463), endpoints ([http://90.169.0.165:2379]) E1123 03:24:51.219757 1185 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted I1123 03:29:10.332547 1185 trace.go:61] Trace "GuaranteedUpdate etcd3: *extensions.DaemonSet" (started 2017-11-23 03:29:09.750180607 +0800 CST): [143.417µs] [143.417µs] initial value restored [494.675µs] [351.258µs] Transaction prepared [582.174725ms] [581.68005ms] Transaction committed

@sgmiller
Copy link

sgmiller commented Dec 8, 2017

Any update on this? We've had to implement a brute force restart of kube-apiserver regularly as we roll the etcd credentials frequently (every couple of days) in a deployment with short lived TLS credentials from Vault. We weren't sure what was causing the apiserver to become unresponsive but every symptom in this issue rings true. K8S 1.7.8, etcd 3.2.6.

@jpbetz
Copy link
Contributor

jpbetz commented Dec 8, 2017

@sgmiller etc v3.2.11 has just been released and contains multiple fixes in the etcd client to gRPC. We plan to upgrade kubernetes to use the new etcd client; just upgrading the etcd server(s) to 3.2.11 is not sufficient here, we need to upgrade the client code vendored into kubernetes as well.See #56114 for details.

I should warn that the bulk of the reports found in this (lengthy) issue provide insufficient detail for us to say for certain if they all have the same root cause, or if the fixes in etcd 3.2.11 will resolve them. But they are close enough related that we'd like to push out these changes and get feedback on what (if any) of these types of issues remain.

If you have any more details you can provide about the specific issues you've encountered, please let us know.

@smarterclayton
Copy link
Contributor

Note that #57061 very likely to be grpc/grpc-go#1005, which was fixed after v1.0.4. Users on Kube 1.6 or 1.7 would be broken against etcd 3.2 servers at very large loads.

@rayterrill
Copy link

rayterrill commented Jan 5, 2018

Is it possible to run into this issue on a new Kubernetes install with 1.9.1?

Stuck here:

 sudo kubeadm init --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.9.1
[init] Using Authorization modes: [Node RBAC]
[preflight] Running pre-flight checks.
        [WARNING FileExisting-crictl]: crictl not found in system path
[preflight] Starting the kubelet service
[certificates] Generated ca certificate and key.
[certificates] Generated apiserver certificate and key.
[certificates] apiserver serving cert is signed for DNS names [popkubem kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 172.31.29.60]
[certificates] Generated apiserver-kubelet-client certificate and key.
[certificates] Generated sa key and public key.
[certificates] Generated front-proxy-ca certificate and key.
[certificates] Generated front-proxy-client certificate and key.
[certificates] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "admin.conf"
[kubeconfig] Wrote KubeConfig file to disk: "kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "controller-manager.conf"
[kubeconfig] Wrote KubeConfig file to disk: "scheduler.conf"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/manifests/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/manifests/etcd.yaml"
[init] Waiting for the kubelet to boot up the control plane as Static Pods from directory "/etc/kubernetes/manifests".
[init] This might take a minute or longer if the control plane images have to be pulled.

It looks like the API server is failing to fully deploy, and I don't see an etcd container. Should I?

sudo docker logs 7ac8aa056296
I0105 22:38:06.981611       1 server.go:121] Version: v1.9.1
I0105 22:38:07.271031       1 feature_gate.go:184] feature gates: map[Initializers:true]
I0105 22:38:07.271056       1 initialization.go:90] enabled Initializers feature as part of admission plugin setup
I0105 22:38:07.283529       1 master.go:225] Using reconciler: master-count
W0105 22:38:07.322097       1 genericapiserver.go:342] Skipping API batch/v2alpha1 because it has no resources.
W0105 22:38:07.332187       1 genericapiserver.go:342] Skipping API rbac.authorization.k8s.io/v1alpha1 because it has no resources.
W0105 22:38:07.333085       1 genericapiserver.go:342] Skipping API storage.k8s.io/v1alpha1 because it has no resources.
W0105 22:38:07.345359       1 genericapiserver.go:342] Skipping API admissionregistration.k8s.io/v1alpha1 because it has no resources.
[restful] 2018/01/05 22:38:07 log.go:33: [restful/swagger] listing is available at https://172.31.29.60:6443/swaggerapi
[restful] 2018/01/05 22:38:07 log.go:33: [restful/swagger] https://172.31.29.60:6443/swaggerui/ is mapped to folder /swagger-ui/
[restful] 2018/01/05 22:38:08 log.go:33: [restful/swagger] listing is available at https://172.31.29.60:6443/swaggerapi
[restful] 2018/01/05 22:38:08 log.go:33: [restful/swagger] https://172.31.29.60:6443/swaggerui/ is mapped to folder /swagger-ui/
I0105 22:38:10.984334       1 serve.go:89] Serving securely on [::]:6443
I0105 22:38:10.988383       1 apiservice_controller.go:112] Starting APIServiceRegistrationController
I0105 22:38:10.988395       1 cache.go:32] Waiting for caches to sync for APIServiceRegistrationController controller
I0105 22:38:10.988492       1 available_controller.go:262] Starting AvailableConditionController
I0105 22:38:10.988500       1 cache.go:32] Waiting for caches to sync for AvailableConditionController controller
I0105 22:38:10.989071       1 crd_finalizer.go:242] Starting CRDFinalizer
I0105 22:38:10.989563       1 controller.go:84] Starting OpenAPI AggregationController
I0105 22:38:11.059043       1 naming_controller.go:274] Starting NamingConditionController
I0105 22:38:11.088902       1 crdregistration_controller.go:110] Starting crd-autoregister controller
I0105 22:38:11.088966       1 controller_utils.go:1019] Waiting for caches to sync for crd-autoregister controller
I0105 22:38:11.089024       1 customresource_discovery_controller.go:152] Starting DiscoveryController
I0105 22:38:25.045005       1 trace.go:76] Trace[723418120]: "Create /api/v1/namespaces/default/events" (started: 2018-01-05 22:38:11.044239793 +0000 UTC m=+4.207731430) (total time: 14.000729847s):
Trace[723418120]: [14.000729847s] [14.000687312s] END
I0105 22:38:35.047389       1 trace.go:76] Trace[1319983526]: "Create /api/v1/namespaces/default/events" (started: 2018-01-05 22:38:25.047018697 +0000 UTC m=+18.210510289) (total time: 10.000340856s):
Trace[1319983526]: [10.000340856s] [10.000275818s] END
I0105 22:38:45.044971       1 trace.go:76] Trace[200541705]: "Create /api/v1/nodes" (started: 2018-01-05 22:38:11.043822246 +0000 UTC m=+4.207313825) (total time: 34.001110155s):
Trace[200541705]: [4.000724928s] [4.000618881s] About to store object in database
Trace[200541705]: [34.001110155s] [30.000385227s] END
I0105 22:38:45.050026       1 trace.go:76] Trace[1283603501]: "Create /api/v1/namespaces/default/events" (started: 2018-01-05 22:38:35.049574339 +0000 UTC m=+28.213065973) (total time: 10.000418685s):
Trace[1283603501]: [10.000418685s] [10.000173571s] END
I0105 22:38:45.986463       1 trace.go:76] Trace[295087653]: "Create /api/v1/namespaces" (started: 2018-01-05 22:38:11.985283044 +0000 UTC m=+5.148774636) (total time: 34.001141453s):
Trace[295087653]: [4.000683598s] [4.000621727s] About to store object in database
Trace[295087653]: [34.001141453s] [30.000457855s] END
E0105 22:38:45.987123       1 client_ca_hook.go:78] Timeout: request did not complete within allowed duration

@notxcain
Copy link

We see something similar when doing hard poweroff of a VM on which one of etcd nodes is running. K8s version 1.9.1

@xiang90
Copy link
Contributor

xiang90 commented Feb 12, 2018

This problem should be solved or at least mitigated by #57160. The PR bumps both gRPC and etcd client to fix the timeout problem caused by connection reset and balancing.

I am closing out this issue. If anyone still see the timeout problem in a release (after 1.10) with #57160, please create a new issue with reproduce steps.

@xiang90 xiang90 closed this as completed Feb 12, 2018
@notxcain
Copy link

@xiang90 any chance this gets backported to 1.9.x?

@xiang90
Copy link
Contributor

xiang90 commented Feb 14, 2018

@notxcain

I do not manage the releases. @jpbetz might have an answer for you.

@jpbetz
Copy link
Contributor

jpbetz commented Feb 14, 2018

I agree, #57160 should resolve this. Unfortunately, the PR introduces a lot of change. It bumps the versions of multiple major dependencies (etcd, docker, containerd, grpc, ...). I do not believe it is well suited for backport to stable kubernetes branches at this time. After 1.10 is released and this has had some time to bake in production, I'd be open to revisiting this, but for now let's limit it to k8s 1.10.

@xarses
Copy link

xarses commented Jun 15, 2018

We've recently updated to etcd 3.1.13 and have kube on 1.6.13, as we got onto the v3 etcd store we appear to have been hit hard by this, we are working our way towards 1.10.x but are a few months from getting there especially since the lockups of api-server are distracting from moving forward. I see #56114 should Improve the problem, I was wondering if we would get most of the behavior if we introduced a grpc proxy from 3.2.11 (or newer) onto the host with api-server to back to the etcd server.

digging through https://kubernetes.io/docs/imported/release/notes/ I see 'Upgrade to etcd client 3.2.13 and grpc 1.7.5 to improve HA etcd cluster stability. (#57480, ...)' and 'The supported etcd server version is 3.1.12, as compared to 3.0.17 in v1.9 (#60988)' so I see strong evidence that the client can talk to our etcd server.

It seems viable to switch to the gRPC proxy to avoid this. looking for confirmation that the proxy would use the loadbalancer as updated in etcd-io/etcd#8840

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd kind/bug Categorizes issue or PR as related to a bug. milestone/removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests