Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple machine-config server restarts after 'http: TLS handshake error from 10.0.29.128:17205: EOF' #233

Closed
wking opened this issue Dec 14, 2018 · 5 comments

Comments

@wking
Copy link
Member

wking commented Dec 14, 2018

In a recent CI run, I saw:

Dec 14 05:53:23.635: INFO: Pod status openshift-machine-config-operator/machine-config-server-c9dr5:
{
  "phase": "Running",
  "conditions": [
    {
      "type": "Initialized",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2018-12-14T05:41:23Z"
    },
    {
      "type": "Ready",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2018-12-14T05:47:34Z"
    },
    {
      "type": "ContainersReady",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": null
    },
    {
      "type": "PodScheduled",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2018-12-14T05:41:23Z"
    }
  ],
  "message": "container machine-config-server has restarted more than 5 times",
  "hostIP": "10.0.2.151",
  "podIP": "10.0.2.151",
  "startTime": "2018-12-14T05:41:23Z",
  "containerStatuses": [
    {
      "name": "machine-config-server",
      "state": {
        "running": {
          "startedAt": "2018-12-14T05:47:33Z"
        }
      },
      "lastState": {
        "terminated": {
          "exitCode": 1,
          "reason": "Error",
          "startedAt": "2018-12-14T05:44:50Z",
          "finishedAt": "2018-12-14T05:44:50Z",
          "containerID": "cri-o://35e3004e72b35a273ab4b0e2e75e082f0840464c55a13f5716d3b796be241e8a"
        }
      },
      "ready": true,
      "restartCount": 6,
      "image": "registry.svc.ci.openshift.org/ci-op-4xwzpczq/stable@sha256:7f2cd078c139f2ed319d16d68e7a5d05f9c60012fd4eeafddc66b1d24a78abf8",
      "imageID": "registry.svc.ci.openshift.org/ci-op-4xwzpczq/stable@sha256:7f2cd078c139f2ed319d16d68e7a5d05f9c60012fd4eeafddc66b1d24a78abf8",
      "containerID": "cri-o://c60a6df780ea4d8d9679309a9037c057002a96f7db1fb62772e2a0b5bb00eaa3"
    }
  ],
  "qosClass": "BestEffort"
}
Dec 14 05:53:23.639: INFO: Running AfterSuite actions on all node
Dec 14 05:53:23.639: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/operators/cluster.go:109]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-machine-config-operator/machine-config-server-c9dr5 is not healthy: container machine-config-server has restarted more than 5 times",
    ]
to be empty
...
Dec 14 05:51:09.642 W ns=openshift-monitoring pod=prometheus-adapter-bdc5f58cb-5l4jt MountVolume.SetUp failed for volume "prometheus-adapter-tls" : secrets "prometheus-adapter-tls" not found
Dec 14 05:51:16.557 E kube-apiserver Kube API started failing: Get https://ci-op-4xwzpczq-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system?timeout=3s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Dec 14 05:51:16.557 I openshift-apiserver OpenShift API started failing: Get https://ci-op-4xwzpczq-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=3s: context deadline exceeded
Dec 14 05:51:18.547 E kube-apiserver Kube API is not responding to GET requests
Dec 14 05:51:18.547 E openshift-apiserver OpenShift API is not responding to GET requests
Dec 14 05:51:20.645 I openshift-apiserver OpenShift API started responding to GET requests
Dec 14 05:51:20.742 I kube-apiserver Kube API started responding to GET requests
...
failed: (2m18s) 2018-12-14T05:53:23 "[Feature:Platform][Suite:openshift/smoke-4] Managed cluster should have no crashlooping pods in core namespaces over two minutes [Suite:openshift/conformance/parallel]"

From the logs for one of those server pods:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/905/pull-ci-openshift-installer-master-e2e-aws/2291/artifacts/e2e-aws/pods/openshift-machine-config-operator_machine-config-server-77dc7_machine-config-server.log.gz | zcat
I1214 05:41:50.478893       1 start.go:37] Version: 3.11.0-352-g0cfc4183-dirty
I1214 05:41:50.480250       1 api.go:54] launching server
I1214 05:41:50.480380       1 api.go:54] launching server
2018/12/14 05:41:51 http: TLS handshake error from 10.0.29.128:17205: EOF
2018/12/14 05:41:52 http: TLS handshake error from 10.0.0.231:28579: EOF
2018/12/14 05:41:52 http: TLS handshake error from 10.0.72.138:31458: EOF
...
2018/12/14 06:10:01 http: TLS handshake error from 10.0.72.138:38099: EOF
2018/12/14 06:10:02 http: TLS handshake error from 10.0.29.128:59541: EOF
2018/12/14 06:10:02 http: TLS handshake error from 10.0.45.28:9790: EOF

This is possibly related to #199, which also had TLS handshake errors (although in that case they were bad-certificate errors). Are these errors someone attempting to connect to the MCS but immediately hanging up? Who would do that? Is there information about the restart reason somewhere I can dig up?

Also, only one of the three machine-config-server containers seems to have had a restart issue:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/905/pull-ci-openshift-installer-master-e2e-aws/2291/artifacts/e2e-aws/pods.json | jq '.items[] | .status.containerStatuses[] | select(.restartCount > 0) | {name, restartCount}'
{
  "name": "operator",
  "restartCount": 1
}
{
  "name": "operator",
  "restartCount": 1
}
{
  "name": "csi-operator",
  "restartCount": 1
}
{
  "name": "machine-config-server",
  "restartCount": 6
}
{
  "name": "prometheus",
  "restartCount": 1
}
{
  "name": "prometheus",
  "restartCount": 1
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/905/pull-ci-openshift-installer-master-e2e-aws/2291/artifacts/e2e-aws/pods.json | jq '.items[] | .status.containerStatuses[] | select(.name == "machine-config-server") | {name, restartCount}'
{
  "name": "machine-config-server",
  "restartCount": 0
}
{
  "name": "machine-config-server",
  "restartCount": 6
}
{
  "name": "machine-config-server",
  "restartCount": 0
}
@abhinavdahiya
Copy link
Contributor

abhinavdahiya commented Dec 15, 2018

also saw here https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/711/pull-ci-openshift-installer-master-e2e-aws/2331

The reason might be

I1215 01:17:04.446950       1 start.go:37] Version: 3.11.0-354-g542d610c-dirty
I1215 01:17:04.449017       1 api.go:54] launching server
I1215 01:17:04.449191       1 api.go:54] launching server
F1215 01:17:04.449246       1 api.go:58] Machine Config Server exited with error: listen tcp :49501: bind: address already in use

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/711/pull-ci-openshift-installer-master-e2e-aws/2331/artifacts/e2e-aws/pods/openshift-machine-config-operator_machine-config-server-sjlbx_machine-config-server.log.gz

@kikisdeliveryservice
Copy link
Contributor

I recently experienced some TLS errors today, when I re-run I will try to log what happened better.

@cgwalters
Copy link
Member

F1215 01:17:04.449246 1 api.go:58] Machine Config Server exited with error: listen tcp :49501: bind: address already in use

That'd make this a dup of #166 right?

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Jan 11, 2019

bin/openshift-install v0.9.1 running on aws

I'm seeing these same errors in the logs of my machine-config-servers basically running in a non-stop loop:
$ oc logs -f -n openshift-machine-config-operator machine-config-server-42qmm

then infinitely repeating:
2019/01/11 01:29:15 http: TLS handshake error from 10.0.8.225:40761: EOF

I haven't seen any performance problems in my mco-mcc-mcd work, but wanted to add the datapoint bc it's kind of disconcerting to see the infinite error scroll in the logs.

The machine-config-servers are, listing 0 restarts ftr.

@wking
Copy link
Member Author

wking commented Jan 11, 2019

I'm going to mark these EOF errors as fixed by openshift/installer#924. If anyone can reproduce with a cluster built from an installer with that commit included (it will be in the next release), please comment and we can re-open.

@wking wking closed this as completed Jan 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants