some k8s components report unhealthy status after cluster bootsrap #64

sym3tri · 2016-06-23T17:43:27Z

curl 127.0.0.1:8080/api/v1/componentstatuses

{
  "kind": "ComponentStatusList",
  "apiVersion": "v1",
  "metadata": {
    "selfLink": "/api/v1/componentstatuses"
  },
  "items": [
    {
      "metadata": {
        "name": "scheduler",
        "selfLink": "/api/v1/componentstatuses/scheduler",
        "creationTimestamp": null
      },
      "conditions": [
        {
          "type": "Healthy",
          "status": "False",
          "message": "Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: getsockopt: connection refused"
        }
      ]
    },
    {
      "metadata": {
        "name": "controller-manager",
        "selfLink": "/api/v1/componentstatuses/controller-manager",
        "creationTimestamp": null
      },
      "conditions": [
        {
          "type": "Healthy",
          "status": "False",
          "message": "Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: getsockopt: connection refused"
        }
      ]
    },
    {
      "metadata": {
        "name": "etcd-0",
        "selfLink": "/api/v1/componentstatuses/etcd-0",
        "creationTimestamp": null
      },
      "conditions": [
        {
          "type": "Healthy",
          "status": "True",
          "message": "{\"health\": \"true\"}"
        }
      ]
    }
  ]

The text was updated successfully, but these errors were encountered:

aaronlevy · 2016-06-23T18:39:11Z

Looks like it's hard-coded to expect scheduler + controller-manager are on the same host as api-server: https://github.com/kubernetes/kubernetes/blob/04ce042ff9cfb32b2c776f755cc7abc886b8a441/pkg/master/master.go#L620-L623

We do not adhere to this assumption because schedule + controller manager are deployments which could be on different hosts (and do not use host-networking).

@sym3tri would you be able to inspect this information from another api-endpoint? Maybe inspecting pods in kube-system, or a specific set of pods via label query?

It seems like this componentstatus endpoint is somewhat contentious as it stands:
kubernetes/kubernetes#18610
kubernetes/kubernetes#19570
kubernetes/kubernetes#13216

bgrant0607 · 2016-07-18T00:32:53Z

I have no love for the current componentstatuses endpoint.

I don't remember whether it was all captured in the proposal, but I think we iterated towards a consensus on Karl's component registration proposal, which you cited:
kubernetes/kubernetes#13216

Someone would need to work on it.

aaronlevy · 2016-07-18T18:47:11Z

I skimmed the proposal & I more or less agree that it's not exactly a pressing issue to have a single /componentstatuses api-endpoint.

I like the idea of fronting healthcheck endpoints with a service (e.g. "scheduler-health.kube-system.cluster.local"). Then if we wanted to drill down into how many of those pods are healthy, it's just a matter of querying the service itself.

@sym3tri is this still blocking you for any reason? Would the health-check service endpoint be a reasonable end-goal? Or is directly querying the pods sufficient?

sym3tri · 2016-07-19T23:13:53Z

@aaronlevy Directly querying the pods puts a lot of burden on the caller. If we can have a fronting service that would be ideal.

Directly querying the pods is an ok workaround for the time-being but not a good long-term solution. We'd just be shifting the hardcoded services to our code, and there is no other way to query etcd health via the API.

aaronlevy · 2016-07-19T23:28:14Z

Opened #85 to track that feature specifically.

fejta-bot · 2019-04-21T00:20:32Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-05-21T01:03:37Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-06-20T01:53:40Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-06-20T01:53:48Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

aaronlevy self-assigned this Jun 23, 2016

aaronlevy added dependency/external kind/bug Categorizes issue or PR as related to a bug. priority/Pmaybe labels Jun 23, 2016

aaronlevy mentioned this issue Jul 19, 2016

Deploy component-health services #85

Closed

aaronlevy added reviewed/won't fix and removed priority/Pmaybe labels Jul 20, 2016

philips unassigned aaronlevy Sep 12, 2016

brancz mentioned this issue Nov 1, 2016

monitor and alert on k8s components prometheus-operator/kube-prometheus#2

Merged

aaronlevy mentioned this issue Dec 21, 2016

WIP: Kubeadm self-hosted deployment type kubernetes/kubernetes#38407

Closed

aaronlevy added the priority/Pmaybe label Jan 18, 2017

JrCs mentioned this issue Jun 15, 2018

Use hostNetwork: true for controller and scheduler pods #989

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 21, 2019

k8s-ci-robot closed this as completed Jun 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some k8s components report unhealthy status after cluster bootsrap #64

some k8s components report unhealthy status after cluster bootsrap #64

sym3tri commented Jun 23, 2016

aaronlevy commented Jun 23, 2016

bgrant0607 commented Jul 18, 2016

aaronlevy commented Jul 18, 2016

sym3tri commented Jul 19, 2016

aaronlevy commented Jul 19, 2016

fejta-bot commented Apr 21, 2019

fejta-bot commented May 21, 2019

fejta-bot commented Jun 20, 2019

k8s-ci-robot commented Jun 20, 2019

some k8s components report unhealthy status after cluster bootsrap #64

some k8s components report unhealthy status after cluster bootsrap #64

Comments

sym3tri commented Jun 23, 2016

aaronlevy commented Jun 23, 2016

bgrant0607 commented Jul 18, 2016

aaronlevy commented Jul 18, 2016

sym3tri commented Jul 19, 2016

aaronlevy commented Jul 19, 2016

fejta-bot commented Apr 21, 2019

fejta-bot commented May 21, 2019

fejta-bot commented Jun 20, 2019

k8s-ci-robot commented Jun 20, 2019