openshift-api issues breaking cluster #21612

kikisdeliveryservice · 2018-12-04T21:01:47Z

There are currently openshift-api issues that are breaking cluster to become unusable.

Steps To Reproduce

run the installer to get a cluster running
see the failures

Current Result

openshift-apiserver is down with various connection refused errors:
E1204 19:21:10.940510 1 memcache.go:147] couldn't get resource list for samplesoperator.config.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/samplesoperator.config.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:10.948714 1 memcache.go:147] couldn't get resource list for servicecertsigner.config.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/servicecertsigner.config.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:10.953465 1 memcache.go:147] couldn't get resource list for tuned.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/tuned.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:11.805850 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.RoleBinding: Get https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:11.818657 1 reflector.go:136] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: Failed to list *core.LimitRange: Get https://172.30.0.1:443/api/v1/limitranges?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:11.818742 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused
Cannot ssh into master, tcp/connection errors throughout pods, for a time oc and kubectl commands also not working. Eventually most of the pods break with CrashLoopError.

Expected Result

I don't expect any errors.

Additional Information

[try to run $ oc adm diagnostics (or oadm diagnostics) command if possible]
[if you are reporting issue related to builds, provide build logs with BUILD_LOGLEVEL=5]
[consider attaching output of the $ oc get all -o json -n <namespace> command to the issue]
[visit https://docs.openshift.org/latest/welcome/index.html]

The text was updated successfully, but these errors were encountered:

ashcrow · 2018-12-04T21:12:21Z

This issue came out of a conversation with @deads2k and @brancz.

deads2k · 2018-12-04T21:13:55Z

When your server comes back up (it will crash and try to recover) quickly run

oc label ns openshift-monitoring 'openshift.io/run-level=1'
oc create quota -nopenshift-monitoring stoppods --hard=pods=0
oc -n openshift-monitoring delete pods --all

And you should end up with a stable cluster. Just one without any monitoring

derekwaynecarr · 2018-12-04T22:02:49Z

we can bump domain memory for masters to 4Gi
https://github.com/openshift/installer/blob/ee73f72017bdfe681629e8d3f41cb5ae1d5b4775/pkg/asset/machines/libvirt/machines.go#L70

@openshift/sig-cloud

wking · 2018-12-05T19:47:42Z

we can bump domain memory for masters to 4Gi
https://github.com/openshift/installer/blob/ee73f72017bdfe681629e8d3f41cb5ae1d5b4775/pkg/asset/machines/libvirt/machines.go#L70

Currently only the installer provisions masters (because once the cluster is running you'd need manual intervention to attach new etcd nodes). And installer-launched masters got bumped to 4GB in openshift/installer#785 (just landed).

deads2k · 2018-12-06T14:26:25Z

should be resolved now

kikisdeliveryservice · 2018-12-06T23:24:21Z

Reopening bc I'm still seeing these errors and tcp timeouts.
Seeing the following on openshift-apiserver:

E1206 23:15:36.782303       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:36.783546       1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.785915       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:36.786749       1 memcache.go:147] couldn't get resource list for image.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.788326       1 memcache.go:147] couldn't get resource list for oauth.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.789024       1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.789890       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.791221       1 memcache.go:147] couldn't get resource list for route.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.792344       1 memcache.go:147] couldn't get resource list for security.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.793448       1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.794602       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:36.795432       1 memcache.go:147] couldn't get resource list for packages.apps.redhat.com/v1alpha1: the server could not find the requested resource
E1206 23:15:46.170716       1 watch.go:212] unable to encode watch object: expected pointer, but got invalid kind
E1206 23:15:46.840537       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.843581       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.849583       1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.850103       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server could not find the requested resource
E1206 23:15:46.851487       1 memcache.go:147] couldn't get resource list for route.openshift.io/v1: the server could not find the requested resource
E1206 23:15:46.852618       1 memcache.go:147] couldn't get resource list for security.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.853553       1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server could not find the requested resource
E1206 23:15:46.854751       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.909115       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.921756       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.935902       1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.959769       1 memcache.go:147] couldn't get resource list for security.openshift.io/v1: the server is currently unable to handle the request

wking · 2018-12-08T08:07:07Z

The current API instability may be a symptom of some underlying master instability. I don't know what's going on yet, but in a recent CI run, there was a running die-off of pods before the machine-config daemon pulled the plug and rebooted the node. Notes for the MCD part in openshift/machine-config-operator#224. Notes on etcd-member (the first pod to die) in openshift/installer#844. I don't know what's going on there, but I can certainly see occasional master reboots causing connectivity issues like these.

wking · 2018-12-14T06:16:51Z

I think this might have been resolved by openshift/machine-config-operator#225. Can anyone still reproduce? If not, can we close this?

mslovy · 2019-01-04T16:36:35Z

I still see the similar error in origin-template-service-broker

E0104 16:14:25.634855 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request

| E0104 16:15:56.017707 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:16:26.196299 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:18:26.868674 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| W0104 16:19:11.850715 1 reflector.go:272] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.Template ended with: The resourceVersion for the provided watch is too old.
| E0104 16:19:27.129940 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:20:57.681989 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:21:27.845002 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:21:57.932396 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:23:28.291818 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:25:28.747637 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:26:59.112399 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:27:59.312813 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| W0104 16:28:04.023096 1 reflector.go:272] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.Template ended with: The resourceVersion for the provided watch is too old.
| E0104 16:31:30.049245 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:32:30.262111 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:33:00.432617 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:35:00.870254 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:36:01.147749 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request

In the command line, Get 503 error:
[root@openshift-master-1 ~]# oc get clusterserviceclass -n kube-service-catalog --loglevel=8
I0105 00:39:26.688956 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.690015 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.699546 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.713030 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.714057 20712 round_trippers.go:383] GET https://openshift.sunhocapital.com:8443/apis/servicecatalog.k8s.io/v1beta1/clusterserviceclasses?limit=500
I0105 00:39:26.714277 20712 round_trippers.go:390] Request Headers:
I0105 00:39:26.714498 20712 round_trippers.go:393] User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0
I0105 00:39:26.714684 20712 round_trippers.go:393] Accept: application/json;as=Table;v=v1beta1;g=meta.k8s.io, application/json
I0105 00:39:26.742742 20712 round_trippers.go:408] Response Status: 503 Service Unavailable in 27 milliseconds
I0105 00:39:26.743036 20712 round_trippers.go:411] Response Headers:
I0105 00:39:26.743222 20712 round_trippers.go:414] Cache-Control: no-store
I0105 00:39:26.743457 20712 round_trippers.go:414] Content-Type: text/plain; charset=utf-8
I0105 00:39:26.743489 20712 round_trippers.go:414] X-Content-Type-Options: nosniff
I0105 00:39:26.743530 20712 round_trippers.go:414] Content-Length: 20
I0105 00:39:26.743544 20712 round_trippers.go:414] Date: Fri, 04 Jan 2019 16:39:26 GMT
I0105 00:39:26.743643 20712 request.go:897] Response Body: service unavailable
No resources found.
I0105 00:39:26.743843 20712 helpers.go:201] server response object: [{
"metadata": {},
"status": "Failure",
"message": "the server is currently unable to handle the request (get clusterserviceclasses.servicecatalog.k8s.io)",
"reason": "ServiceUnavailable",
"details": {
"group": "servicecatalog.k8s.io",
"kind": "clusterserviceclasses",
"causes": [
{
"reason": "UnexpectedServerResponse",
"message": "service unavailable"
}
]
},
"code": 503
}]
F0105 00:39:26.744007 20712 helpers.go:119] Error from server (ServiceUnavailable): the server is currently unable to handle the request (get clusterserviceclasses.servicecatalog.k8s.io)

openshift-bot · 2019-04-04T22:30:34Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2019-05-04T23:59:16Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2019-06-04T01:17:22Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2019-06-04T01:17:24Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dlakatos847 · 2019-08-06T08:54:34Z

Issue still present with v3.11 on OpenSuSE Tw.

ashcrow added the kind/bug Categorizes issue or PR as related to a bug. label Dec 4, 2018

deads2k closed this as completed Dec 6, 2018

kikisdeliveryservice mentioned this issue Dec 6, 2018

Design for osImageURL updates - integration with CVO/release payload openshift/machine-config-operator#183

Closed

kikisdeliveryservice reopened this Dec 6, 2018

kikisdeliveryservice mentioned this issue Dec 6, 2018

Pass ssh keys from clusterconfig to machineconfig openshift/machine-config-operator#164

Merged

This was referenced Dec 7, 2018

openshift-apiserver drops out temporarily after install. openshift/installer#813

Closed

Pin OS image for release builds openshift/installer#757

Merged

kikisdeliveryservice mentioned this issue Dec 8, 2018

timeouts preventing installer from completing properly openshift/installer#843

Closed

wking mentioned this issue Dec 8, 2018

Appending /etc/hosts is triggering master reboot cycles openshift/cluster-dns-operator#63

Closed

gustavocoding mentioned this issue Feb 7, 2019

Test against OKD 4.0.0 infinispan/infinispan-operator#14

Closed

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2019

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 4, 2019

openshift-ci-robot closed this as completed Jun 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openshift-api issues breaking cluster #21612

openshift-api issues breaking cluster #21612

kikisdeliveryservice commented Dec 4, 2018 •

edited

Loading

ashcrow commented Dec 4, 2018

deads2k commented Dec 4, 2018

derekwaynecarr commented Dec 4, 2018

wking commented Dec 5, 2018

deads2k commented Dec 6, 2018

kikisdeliveryservice commented Dec 6, 2018 •

edited

Loading

wking commented Dec 8, 2018

wking commented Dec 14, 2018

mslovy commented Jan 4, 2019 •

edited

Loading

openshift-bot commented Apr 4, 2019

openshift-bot commented May 4, 2019

openshift-bot commented Jun 4, 2019

openshift-ci-robot commented Jun 4, 2019

dlakatos847 commented Aug 6, 2019

openshift-api issues breaking cluster #21612

openshift-api issues breaking cluster #21612

Comments

kikisdeliveryservice commented Dec 4, 2018 • edited Loading

Steps To Reproduce

Current Result

Expected Result

Additional Information

ashcrow commented Dec 4, 2018

deads2k commented Dec 4, 2018

derekwaynecarr commented Dec 4, 2018

wking commented Dec 5, 2018

deads2k commented Dec 6, 2018

kikisdeliveryservice commented Dec 6, 2018 • edited Loading

wking commented Dec 8, 2018

wking commented Dec 14, 2018

mslovy commented Jan 4, 2019 • edited Loading

E0104 16:14:25.634855 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request

openshift-bot commented Apr 4, 2019

openshift-bot commented May 4, 2019

openshift-bot commented Jun 4, 2019

openshift-ci-robot commented Jun 4, 2019

dlakatos847 commented Aug 6, 2019

kikisdeliveryservice commented Dec 4, 2018 •

edited

Loading

kikisdeliveryservice commented Dec 6, 2018 •

edited

Loading

mslovy commented Jan 4, 2019 •

edited

Loading