Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kind e2e tests sometimes fail with the webhook pod not becoming ready. #4496

Closed
vaikas opened this issue Nov 10, 2020 · 8 comments
Closed

Kind e2e tests sometimes fail with the webhook pod not becoming ready. #4496

vaikas opened this issue Nov 10, 2020 · 8 comments
Labels
area/test-and-release Test infrastructure, tests or release kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@vaikas
Copy link
Contributor

vaikas commented Nov 10, 2020

Describe the bug
Eventing webhook does not sometimes become ready, looks like maybe the specific one that the wait loop is waiting for gets replaced (maybe because of chaos duck?) by another pod that does become ready.

From one example here:
https://github.com/knative/eventing/pull/4492/checks?check_run_id=1381350649

pod/sugar-controller-7f7c8ddfc4-8gbfn condition met
error: timed out waiting for the condition on pods/eventing-webhook-5c8b8865c7-wbzjd
pod/zipkin-8fdcfcddc-d9rbm condition met
Error: Process completed with exit code 1.

Then when the artifacts are dumped, note a different webhook pod comes up:

eventing-controller-64768b7fcc-mpd29    1/1     Running   0          57s
eventing-webhook-5c8b8865c7-x7d7w       1/1     Running   0          57s
imc-controller-6f6b794fd6-m79tv         1/1     Running   0          117s
imc-dispatcher-64d5f8445-vwscb          1/1     Running   0          117s
mt-broker-controller-75475bcbc7-tks6f   1/1     Running   0          57s
mt-broker-filter-6f4c99cddd-xfjzw       1/1     Running   0          118s
mt-broker-ingress-64f6f6cb9f-spccf      1/1     Running   0          118s
sugar-controller-7f7c8ddfc4-8gbfn       1/1     Running   0          57s
zipkin-8fdcfcddc-d9rbm                  1/1     Running   0          104s

Expected behavior
tests to not fail due to test setup failures.

To Reproduce
Look at some of these failing tests here:
https://github.com/knative/eventing/actions?query=workflow%3A%22KinD+e2e+tests%22

Knative release version
head

Additional context
Add any other context about the problem here such as proposed priority

@vaikas
Copy link
Contributor Author

vaikas commented Nov 10, 2020

Looking...

Here's the step doing the wait:

    - name: Wait for things to be up
      run: |
        kubectl wait pod --for=condition=Ready -n ${SYSTEM_NAMESPACE} -l '!job-name'

@pierDipi
Copy link
Member

pierDipi commented Nov 12, 2020

@vaikas
Copy link
Contributor Author

vaikas commented Nov 16, 2020

https://github.com/knative/eventing/runs/1406746500?check_suite_focus=true

I1116 14:19:57.809327   27973 round_trippers.go:423] curl -k -v -XGET  -H "Accept: application/json" -H "User-Agent: kubectl/v1.19.3 (linux/amd64) kubernetes/1e11e4a" 'https://127.0.0.1:36147/api/v1/namespaces/knative-eventing/pods?fieldSelector=metadata.name%3Deventing-webhook-6bd5798587-4zv5s&resourceVersion=2808&watch=true'
I1116 14:19:57.809985   27973 round_trippers.go:443] GET https://127.0.0.1:36147/api/v1/namespaces/knative-eventing/pods?fieldSelector=metadata.name%3Deventing-webhook-6bd5798587-4zv5s&resourceVersion=2808&watch=true 200 OK in 0 milliseconds
I1116 14:19:57.810003   27973 round_trippers.go:449] Response Headers:
I1116 14:19:57.810008   27973 round_trippers.go:452]     Cache-Control: no-cache, private
I1116 14:19:57.810011   27973 round_trippers.go:452]     Content-Type: application/json
I1116 14:19:57.810014   27973 round_trippers.go:452]     Date: Mon, 16 Nov 2020 14:19:57 GMT
I1116 14:20:27.810546   27973 round_trippers.go:423] curl -k -v -XGET  -H "Accept: application/json" -H "User-Agent: kubectl/v1.19.3 (linux/amd64) kubernetes/1e11e4a" 'https://127.0.0.1:36147/api/v1/namespaces/knative-eventing/pods?fieldSelector=metadata.name%3Deventing-webhook-6bd5798587-k6mr8'
I1116 14:20:27.813242   27973 round_trippers.go:443] GET https://127.0.0.1:36147/api/v1/namespaces/knative-eventing/pods?fieldSelector=metadata.name%3Deventing-webhook-6bd5798587-k6mr8 200 OK in 2 milliseconds
I1116 14:20:27.813259   27973 round_trippers.go:449] Response Headers:
pod/eventing-webhook-6bd5798587-k6mr8 condition met
I1116 14:20:27.813263   27973 round_trippers.go:452]     Cache-Control: no-cache, private
I1116 14:20:27.813267   27973 round_trippers.go:452]     Content-Type: application/json
I1116 14:20:27.813270   27973 round_trippers.go:452]     Date: Mon, 16 Nov 2020 14:20:27 GMT
I1116 14:20:27.813811   27973 request.go:1097] Response Body: {"kind":"PodList","apiVersion":"v1","metadata":{"selfLink":"/api/v1/namespaces/knative-eventing/pods","resourceVersion":"3232"},"items":[{"metadata":{"name":"eventing-webhook-6bd5798587-k6mr8","generateName":"eventing-webhook-6bd5798587-","namespace":"knative-eventing","selfLink":"/api/v1/namespaces/knative-eventing/pods/eventing-webhook-6bd5798587-k6mr8","uid":"015fa6f7-88e9-45e5-95b2-91b2da58ee13","resourceVersion":"2579","creationTimestamp":"2020-11-16T14:19:36Z","labels":{"app":"eventing-webhook","pod-template-hash":"6bd5798587","role":"eventing-webhook"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"eventing-webhook-6bd5798587","uid":"59b46e0b-631b-4f3e-8dba-584e35bc6dea","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"v1","time":"2020-11-16T14:19:36Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:generateName":{},"f:labels":{".":{},"f:app":{},"f:pod-template-hash":{},"f:role":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"59b46e0b-631b-4f3e-8dba-584e35bc6dea\"}":{".":{},"f:apiVersion":{},"f:blockOwnerDeletion":{},"f:controller":{},"f:kind":{},"f:name":{},"f:uid":{}}}},"f:spec":{"f:affinity":{".":{},"f:podAntiAffinity":{".":{},"f:preferredDuringSchedulingIgnoredDuringExecution":{}}},"f:containers":{"k:{\"name\":\"eventing-webhook\"}":{".":{},"f:env":{".":{},"k:{\"name\":\"CONFIG_LOGGING_NAME\"}":{".":{},"f:name":{},"f:value":{}},"k:{\"name\":\"METRICS_DOMAIN\"}":{".":{},"f:name":{},"f:value":{}},"k:{\"name\":\"POD_NAME\"}":{".":{},"f:name":{},"f:valueFrom":{".":{},"f:fieldRef":{".":{},"f:apiVersion":{},"f:fieldPath":{}}}},"k:{\"name\":\"SINK_BINDING_SELECTION_MODE\"}":{".":{},"f:name":{},"f:value":{}},"k:{\"name\":\"SYSTEM_NAMESPACE\"}":{".":{},"f:name":{},"f:valueFrom":{".":{},"f:fieldRef":{".":{},"f:apiVersion":{},"f:fieldPath":{}}}},"k:{\"name\":\"WEBHOOK_NAME\"}":{".":{},"f:name":{},"f:value":{}},"k:{\"name\":\"WEBHOOK_PORT\"}":{".":{},"f:name":{},"f:value":{}}},"f:image":{},"f:imagePullPolicy":{},"f:livenessProbe":{".":{},"f:failureThreshold":{},"f:httpGet":{".":{},"f:httpHeaders":{},"f:path":{},"f:port":{},"f:scheme":{}},"f:initialDelaySeconds":{},"f:periodSeconds":{},"f:successThreshold":{},"f:timeoutSeconds":{}},"f:name":{},"f:ports":{".":{},"k:{\"containerPort\":8008,\"protocol\":\"TCP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}},"k:{\"containerPort\":8443,\"protocol\":\"TCP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}},"k:{\"containerPort\":9090,\"protocol\":\"TCP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}}},"f:readinessProbe":{".":{},"f:failureThreshold":{},"f:httpGet":{".":{},"f:httpHeaders":{},"f:path":{},"f:port":{},"f:scheme":{}},"f:periodSeconds":{},"f:successThreshold":{},"f:timeoutSeconds":{}},"f:resources":{".":{},"f:limits":{".":{},"f:cpu":{},"f:memory":{}},"f:requests":{".":{},"f:cpu":{},"f:memory":{}}},"f:securityContext":{".":{},"f:allowPrivilegeEscalation":{}},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}}},"f:dnsPolicy":{},"f:enableServiceLinks":{},"f:restartPolicy":{},"f:schedulerName":{},"f:securityContext":{},"f:serviceAccount":{},"f:serviceAccountName":{},"f:terminationGracePeriodSeconds":{}}}},{"manager":"kubelet","operation":"Update","apiVersion":"v1","time":"2020-11-16T14:19:42Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:conditions":{"k:{\"type\":\"ContainersReady\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Initialized\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Ready\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}}},"f:containerStatuses":{},"f:hostIP":{},"f:phase":{},"f:podIP":{},"f:podIPs":{".":{},"k:{\"ip\":\"10.244.1.19\"}":{".":{},"f:ip":{}}},"f:startTime":{}}}}]},"spec":{"volumes":[{"name":"eventing-webhook-token-bqv7v","secret":{"secretName":"eventing-webhook-token-bqv7v","defaultMode":420}}],"containers":[{"name":"eventing-webhook","image":"kind.local/knative.dev/eventing/cmd/webhook:1af4fd82f9a9ff68e3f5768dda777cabfe0e349429cf8289bdc3f32b533b60a4","ports":[{"name":"https-webhook","containerPort":8443,"protocol":"TCP"},{"name":"metrics","containerPort":9090,"protocol":"TCP"},{"name":"profiling","containerPort":8008,"protocol":"TCP"}],"env":[{"name":"SYSTEM_NAMESPACE","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.namespace"}}},{"name":"CONFIG_LOGGING_NAME","value":"config-logging"},{"name":"METRICS_DOMAIN","value":"knative.dev/eventing"},{"name":"WEBHOOK_NAME","value":"eventing-webhook"},{"name":"WEBHOOK_PORT","value":"8443"},{"name":"SINK_BINDING_SELECTION_MODE","value":"exclusion"},{"name":"POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}}],"resources":{"limits":{"cpu":"200m","memory":"200Mi"},"requests":{"cpu":"20m","memory":"20Mi"}},"volumeMounts":[{"name":"eventing-webhook-token-bqv7v","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"livenessProbe":{"httpGet":{"path":"/","port":8443,"scheme":"HTTPS","httpHeaders":[{"name":"k-kubelet-probe","value":"webhook"}]},"initialDelaySeconds":20,"timeoutSeconds":1,"periodSeconds":1,"successThreshold":1,"failureThreshold":3},"readinessProbe":{"httpGet":{"path":"/","port":8443,"scheme":"HTTPS","httpHeaders":[{"name":"k-kubelet-probe","value":"webhook"}]},"timeoutSeconds":1,"periodSeconds":1,"successThreshold":1,"failureThreshold":3},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"FallbackToLogsOnError","imagePullPolicy":"IfNotPresent","securityContext":{"allowPrivilegeEscalation":false}}],"restartPolicy":"Always","terminationGracePeriodSeconds":300,"dnsPolicy":"ClusterFirst","serviceAccountName":"eventing-webhook","serviceAccount":"eventing-webhook","nodeName":"kind-worker","securityContext":{},"affinity":{"podAntiAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"weight":100,"podAffinityTerm":{"labelSelector":{"matchLabels":{"app":"eventing-webhook"}},"topologyKey":"kubernetes.io/hostname"}}]}},"schedulerName":"default-scheduler","tolerations":[{"key":"node.kubernetes.io/not-ready","operator":"Exists","effect":"NoExecute","tolerationSeconds":300},{"key":"node.kubernetes.io/unreachable","operator":"Exists","effect":"NoExecute","tolerationSeconds":300}],"priority":0,"enableServiceLinks":true},"status":{"phase":"Running","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2020-11-16T14:19:37Z"},{"type":"Ready","status":"True","lastProbeTime":null,"lastTransitionTime":"2020-11-16T14:19:42Z"},{"type":"ContainersReady","status":"True","lastProbeTime":null,"lastTransitionTime":"2020-11-16T14:19:42Z"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2020-11-16T14:19:36Z"}],"hostIP":"172.18.0.3","podIP":"10.244.1.19","podIPs":[{"ip":"10.244.1.19"}],"startTime":"2020-11-16T14:19:37Z","containerStatuses":[{"name":"eventing-webhook","state":{"running":{"startedAt":"2020-11-16T14:19:41Z"}},"lastState":{},"ready":true,"restartCount":0,"image":"kind.local/knative.dev/eventing/cmd/webhook:1af4fd82f9a9ff68e3f5768dda777cabfe0e349429cf8289bdc3f32b533b60a4","imageID":"sha256:a5bffea29ff5b9b24ad286ce1981725ff772e6981a6e246583282cd96e094715","containerID":"containerd://3111ec724700c26c3191da72d020c217706ad6bea200668f0caac75882561733","started":true}],"qosClass":"Burstable"}}]

Yet the test failed with:

F1116 14:20:27.856761   27973 helpers.go:115] error: timed out waiting for the condition on pods/eventing-webhook-6bd5798587-4zv5s
goroutine 1 [running]:

@grantr grantr added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. area/test-and-release Test infrastructure, tests or release labels Nov 16, 2020
@grantr grantr added this to the Backlog milestone Nov 16, 2020
@zhongduo
Copy link
Contributor

Does this look like: #3244

In knative-gcp, we will get crashed webhook, but maybe knative eventing automatically restart?

@vaikas vaikas removed their assignment Nov 17, 2020
@vaikas
Copy link
Contributor Author

vaikas commented Nov 17, 2020

@zhongduo I don't think so because the webhook becomes ready.

@zhongduo
Copy link
Contributor

@zhongduo I don't think so because the webhook becomes ready.

But as you said, it is a different pod already. So it might as well be that we have some logic to detect the crash or unreadiness and restart the pod, which accidentally will solve the problem.

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021
@vaikas
Copy link
Contributor Author

vaikas commented Feb 16, 2021

This should've been fixed by:
#4741

Let's reopen if it comes back.

@vaikas vaikas closed this as completed Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-and-release Test infrastructure, tests or release kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

4 participants