Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openshift-api issues breaking cluster #21612

Closed
kikisdeliveryservice opened this issue Dec 4, 2018 · 14 comments
Closed

openshift-api issues breaking cluster #21612

kikisdeliveryservice opened this issue Dec 4, 2018 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@kikisdeliveryservice
Copy link

kikisdeliveryservice commented Dec 4, 2018

There are currently openshift-api issues that are breaking cluster to become unusable.

Steps To Reproduce
  1. run the installer to get a cluster running
  2. see the failures
Current Result

openshift-apiserver is down with various connection refused errors:
E1204 19:21:10.940510 1 memcache.go:147] couldn't get resource list for samplesoperator.config.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/samplesoperator.config.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:10.948714 1 memcache.go:147] couldn't get resource list for servicecertsigner.config.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/servicecertsigner.config.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:10.953465 1 memcache.go:147] couldn't get resource list for tuned.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/tuned.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:11.805850 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.RoleBinding: Get https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:11.818657 1 reflector.go:136] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: Failed to list *core.LimitRange: Get https://172.30.0.1:443/api/v1/limitranges?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:11.818742 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused
Cannot ssh into master, tcp/connection errors throughout pods, for a time oc and kubectl commands also not working. Eventually most of the pods break with CrashLoopError.

Expected Result

I don't expect any errors.

Additional Information

[try to run $ oc adm diagnostics (or oadm diagnostics) command if possible]
[if you are reporting issue related to builds, provide build logs with BUILD_LOGLEVEL=5]
[consider attaching output of the $ oc get all -o json -n <namespace> command to the issue]
[visit https://docs.openshift.org/latest/welcome/index.html]

@ashcrow
Copy link
Member

ashcrow commented Dec 4, 2018

This issue came out of a conversation with @deads2k and @brancz.

@deads2k
Copy link
Contributor

deads2k commented Dec 4, 2018

When your server comes back up (it will crash and try to recover) quickly run

oc label ns openshift-monitoring 'openshift.io/run-level=1'
oc create quota -nopenshift-monitoring stoppods --hard=pods=0
oc -n openshift-monitoring delete pods --all

And you should end up with a stable cluster. Just one without any monitoring

@ashcrow ashcrow added the kind/bug Categorizes issue or PR as related to a bug. label Dec 4, 2018
@derekwaynecarr
Copy link
Member

@wking
Copy link
Member

wking commented Dec 5, 2018

we can bump domain memory for masters to 4Gi
https://github.com/openshift/installer/blob/ee73f72017bdfe681629e8d3f41cb5ae1d5b4775/pkg/asset/machines/libvirt/machines.go#L70

Currently only the installer provisions masters (because once the cluster is running you'd need manual intervention to attach new etcd nodes). And installer-launched masters got bumped to 4GB in openshift/installer#785 (just landed).

@deads2k
Copy link
Contributor

deads2k commented Dec 6, 2018

should be resolved now

@kikisdeliveryservice
Copy link
Author

kikisdeliveryservice commented Dec 6, 2018

Reopening bc I'm still seeing these errors and tcp timeouts.
Seeing the following on openshift-apiserver:

E1206 23:15:36.782303       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:36.783546       1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.785915       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:36.786749       1 memcache.go:147] couldn't get resource list for image.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.788326       1 memcache.go:147] couldn't get resource list for oauth.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.789024       1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.789890       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.791221       1 memcache.go:147] couldn't get resource list for route.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.792344       1 memcache.go:147] couldn't get resource list for security.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.793448       1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.794602       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:36.795432       1 memcache.go:147] couldn't get resource list for packages.apps.redhat.com/v1alpha1: the server could not find the requested resource
E1206 23:15:46.170716       1 watch.go:212] unable to encode watch object: expected pointer, but got invalid kind
E1206 23:15:46.840537       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.843581       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.849583       1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.850103       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server could not find the requested resource
E1206 23:15:46.851487       1 memcache.go:147] couldn't get resource list for route.openshift.io/v1: the server could not find the requested resource
E1206 23:15:46.852618       1 memcache.go:147] couldn't get resource list for security.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.853553       1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server could not find the requested resource
E1206 23:15:46.854751       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.909115       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.921756       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.935902       1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.959769       1 memcache.go:147] couldn't get resource list for security.openshift.io/v1: the server is currently unable to handle the request

@wking
Copy link
Member

wking commented Dec 8, 2018

The current API instability may be a symptom of some underlying master instability. I don't know what's going on yet, but in a recent CI run, there was a running die-off of pods before the machine-config daemon pulled the plug and rebooted the node. Notes for the MCD part in openshift/machine-config-operator#224. Notes on etcd-member (the first pod to die) in openshift/installer#844. I don't know what's going on there, but I can certainly see occasional master reboots causing connectivity issues like these.

@wking
Copy link
Member

wking commented Dec 14, 2018

I think this might have been resolved by openshift/machine-config-operator#225. Can anyone still reproduce? If not, can we close this?

@mslovy
Copy link

mslovy commented Jan 4, 2019

I still see the similar error in origin-template-service-broker

E0104 16:14:25.634855 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request

  | E0104 16:15:56.017707 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | E0104 16:16:26.196299 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | E0104 16:18:26.868674 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | W0104 16:19:11.850715 1 reflector.go:272] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.Template ended with: The resourceVersion for the provided watch is too old.
  | E0104 16:19:27.129940 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | E0104 16:20:57.681989 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | E0104 16:21:27.845002 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | E0104 16:21:57.932396 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | E0104 16:23:28.291818 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | E0104 16:25:28.747637 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | E0104 16:26:59.112399 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | E0104 16:27:59.312813 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | W0104 16:28:04.023096 1 reflector.go:272] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.Template ended with: The resourceVersion for the provided watch is too old.
  | E0104 16:31:30.049245 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | E0104 16:32:30.262111 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | E0104 16:33:00.432617 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | E0104 16:35:00.870254 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
  | E0104 16:36:01.147749 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request

In the command line, Get 503 error:
[root@openshift-master-1 ~]# oc get clusterserviceclass -n kube-service-catalog --loglevel=8
I0105 00:39:26.688956 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.690015 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.699546 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.713030 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.714057 20712 round_trippers.go:383] GET https://openshift.sunhocapital.com:8443/apis/servicecatalog.k8s.io/v1beta1/clusterserviceclasses?limit=500
I0105 00:39:26.714277 20712 round_trippers.go:390] Request Headers:
I0105 00:39:26.714498 20712 round_trippers.go:393] User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0
I0105 00:39:26.714684 20712 round_trippers.go:393] Accept: application/json;as=Table;v=v1beta1;g=meta.k8s.io, application/json
I0105 00:39:26.742742 20712 round_trippers.go:408] Response Status: 503 Service Unavailable in 27 milliseconds
I0105 00:39:26.743036 20712 round_trippers.go:411] Response Headers:
I0105 00:39:26.743222 20712 round_trippers.go:414] Cache-Control: no-store
I0105 00:39:26.743457 20712 round_trippers.go:414] Content-Type: text/plain; charset=utf-8
I0105 00:39:26.743489 20712 round_trippers.go:414] X-Content-Type-Options: nosniff
I0105 00:39:26.743530 20712 round_trippers.go:414] Content-Length: 20
I0105 00:39:26.743544 20712 round_trippers.go:414] Date: Fri, 04 Jan 2019 16:39:26 GMT
I0105 00:39:26.743643 20712 request.go:897] Response Body: service unavailable
No resources found.
I0105 00:39:26.743843 20712 helpers.go:201] server response object: [{
"metadata": {},
"status": "Failure",
"message": "the server is currently unable to handle the request (get clusterserviceclasses.servicecatalog.k8s.io)",
"reason": "ServiceUnavailable",
"details": {
"group": "servicecatalog.k8s.io",
"kind": "clusterserviceclasses",
"causes": [
{
"reason": "UnexpectedServerResponse",
"message": "service unavailable"
}
]
},
"code": 503
}]
F0105 00:39:26.744007 20712 helpers.go:119] Error from server (ServiceUnavailable): the server is currently unable to handle the request (get clusterserviceclasses.servicecatalog.k8s.io)

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2019
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 4, 2019
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dlakatos847
Copy link

Issue still present with v3.11 on OpenSuSE Tw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants