Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod argo-server still keeps restarting, even after #2249 #2355

Closed
4 tasks done
dmegyesi opened this issue Mar 4, 2020 · 11 comments · Fixed by #2360
Closed
4 tasks done

Pod argo-server still keeps restarting, even after #2249 #2355

dmegyesi opened this issue Mar 4, 2020 · 11 comments · Fixed by #2360
Assignees
Labels

Comments

@dmegyesi
Copy link

dmegyesi commented Mar 4, 2020

Checklist:

  • I've included the version.
  • I've included reproduction steps.
  • I've included the workflow YAML. - not applicable
  • I've included the logs.

What happened:

#2249 still happens, it's not fixed.

After upgrading to Argo 2.6.0, the argo-server pod keeps restarting with exit code 0, with the following message:

msg="config map event, exiting gracefully" eventType=

What you expected to happen:

The underlying ConfigMap was never modified, so argo-server should not restart to reload the config, as it did not even change.

I tried 2 scenarios:

  • empty ConfigMap (default from install.yaml manifest), no changes on it --> ❌ restart ❌
  • deleting the ConfigMap and starting argo-server, while the CM does not even exist: no error messages --> ❌ restart ❌ randomly after 30-60m, due to detecting a change (in the non-existing CM)

How to reproduce it (as minimally and precisely as possible):

Install default manifest, wait up to 1 hour, see that the pod restarts.

Anything else we need to know?:

AWS EKS 1.14 platform

Environment:

  • Argo version: 2.6.0

  • Kubernetes version :

serverVersion:
  buildDate: "2019-12-22T23:14:11Z"
  compiler: gc
  gitCommit: c0eccca51d7500bb03b2f163dd8d534ffeb2f7a2
  gitTreeState: clean
  gitVersion: v1.14.9-eks-c0eccc
  goVersion: go1.12.12
  major: "1"
  minor: 14+
  platform: linux/amd64

Other debugging information (if applicable):

Here's a screenshot about the restart frequency:
image

Logs

These are the latest logs, when the referenced ConfigMap is deleted (workflow-controller-configmap)

$ kubectl logs --previous -l app=argo-server
time="2020-03-04T14:16:47Z" level=info authMode=server baseHRef=/ managedNamespace= namespace=cre
time="2020-03-04T14:16:47Z" level=info msg="Argo Server started successfully on address :2746"
time="2020-03-04T14:17:22Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetInfo grpc.service=info.InfoService grpc.start_time="2020-03-04T14:17:22Z" grpc.time_ms=0.05 span.kind=server system=grpc
time="2020-03-04T14:17:22Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2020-03-04T14:17:22Z" grpc.time_ms=37.977 span.kind=server system=grpc
time="2020-03-04T14:17:23Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetInfo grpc.service=info.InfoService grpc.start_time="2020-03-04T14:17:23Z" grpc.time_ms=0.037 span.kind=server system=grpc
time="2020-03-04T14:17:28Z" level=info msg="Deleting pod: test-cron-wf-ftz99"
time="2020-03-04T14:17:28Z" level=error msg="finished unary call with code Unknown" error="pods \"test-cron-wf-ftz99\" is forbidden: User \"system:serviceaccount:cre:argo-server\" cannot delete resource \"pods\" in API group \"\" in the namespace \"unicron-kufar-data-dev\"" grpc.code=Unknown grpc.method=RetryWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2020-03-04T14:17:28Z" grpc.time_ms=5.364 span.kind=server system=grpc
time="2020-03-04T14:17:36Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetInfo grpc.service=info.InfoService grpc.start_time="2020-03-04T14:17:36Z" grpc.time_ms=0.036 span.kind=server system=grpc
time="2020-03-04T14:17:36Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2020-03-04T14:17:36Z" grpc.time_ms=32.941 span.kind=server system=grpc
time="2020-03-04T14:17:37Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetInfo grpc.service=info.InfoService grpc.start_time="2020-03-04T14:17:37Z" grpc.time_ms=0.05 span.kind=server system=grpc
time="2020-03-04T14:17:40Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ResubmitWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2020-03-04T14:17:40Z" grpc.time_ms=21.195 span.kind=server system=grpc
time="2020-03-04T14:17:40Z" level=error msg="finished streaming call with code Unknown" error="context canceled" grpc.code=Unknown grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2020-03-04T14:17:37Z" grpc.time_ms=3520.25 span.kind=server system=grpc
time="2020-03-04T14:17:40Z" level=error msg="finished streaming call with code Unknown" error="context canceled" grpc.code=Unknown grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2020-03-04T14:17:23Z" grpc.time_ms=17498.506 span.kind=server system=grpc
time="2020-03-04T14:17:43Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetInfo grpc.service=info.InfoService grpc.start_time="2020-03-04T14:17:43Z" grpc.time_ms=0.039 span.kind=server system=grpc
time="2020-03-04T14:17:43Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2020-03-04T14:17:43Z" grpc.time_ms=39.406 span.kind=server system=grpc
time="2020-03-04T14:17:44Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetInfo grpc.service=info.InfoService grpc.start_time="2020-03-04T14:17:44Z" grpc.time_ms=0.032 span.kind=server system=grpc
time="2020-03-04T14:18:00Z" level=error msg="finished streaming call with code Unknown" error="context canceled" grpc.code=Unknown grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2020-03-04T14:17:44Z" grpc.time_ms=16001.755 span.kind=server system=grpc
time="2020-03-04T14:18:02Z" level=error msg="finished streaming call with code Unknown" error="context canceled" grpc.code=Unknown grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2020-03-04T14:17:38Z" grpc.time_ms=24092.336 span.kind=server system=grpc
time="2020-03-04T14:18:02Z" level=error msg="finished streaming call with code Unknown" error="context canceled" grpc.code=Unknown grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2020-03-04T14:17:26Z" grpc.time_ms=35679.344 span.kind=server system=grpc
time="2020-03-04T14:18:43Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2020-03-04T14:17:40Z" grpc.time_ms=62482.633 span.kind=server system=grpc
time="2020-03-04T14:21:47Z" level=info msg="Alloc=4838 TotalAlloc=37563 Sys=70592 NumGC=11 Goroutines=52"
time="2020-03-04T14:26:47Z" level=info msg="Alloc=5226 TotalAlloc=39223 Sys=70592 NumGC=13 Goroutines=53"
time="2020-03-04T14:31:47Z" level=info msg="Alloc=5526 TotalAlloc=40861 Sys=70592 NumGC=15 Goroutines=53"
time="2020-03-04T14:36:47Z" level=info msg="Alloc=5045 TotalAlloc=42540 Sys=70592 NumGC=18 Goroutines=53"
time="2020-03-04T14:41:47Z" level=info msg="Alloc=5414 TotalAlloc=44204 Sys=70592 NumGC=20 Goroutines=53"
time="2020-03-04T14:46:47Z" level=info msg="Alloc=4861 TotalAlloc=45774 Sys=70592 NumGC=23 Goroutines=52"
time="2020-03-04T14:51:47Z" level=info msg="Alloc=5245 TotalAlloc=47426 Sys=70592 NumGC=25 Goroutines=52"
time="2020-03-04T14:56:47Z" level=info msg="Alloc=4782 TotalAlloc=49077 Sys=70592 NumGC=28 Goroutines=52"
time="2020-03-04T15:01:47Z" level=info msg="Alloc=5173 TotalAlloc=50735 Sys=70592 NumGC=30 Goroutines=53"
time="2020-03-04T15:01:49Z" level=info msg="config map event, exiting gracefully" eventType=

Previous logs when the empty, default ConfigMap existed:

time="2020-03-04T08:24:09Z" level=info authMode=server baseHRef=/ managedNamespace= namespace=cre
time="2020-03-04T08:24:09Z" level=warning msg="ConfigMap 'workflow-controller-configmap' does not have key 'config'"
time="2020-03-04T08:24:09Z" level=info msg="Config Map: "
time="2020-03-04T08:24:09Z" level=info msg="Argo Server started successfully on address :2746"
time="2020-03-04T08:29:09Z" level=info msg="Alloc=4035 TotalAlloc=6847 Sys=70080 NumGC=3 Goroutines=52"
time="2020-03-04T08:34:09Z" level=info msg="Alloc=3720 TotalAlloc=8484 Sys=70080 NumGC=6 Goroutines=52"
time="2020-03-04T08:39:09Z" level=info msg="Alloc=3959 TotalAlloc=10056 Sys=70080 NumGC=8 Goroutines=52"
time="2020-03-04T08:44:09Z" level=info msg="Alloc=3568 TotalAlloc=11702 Sys=70080 NumGC=11 Goroutines=52"
time="2020-03-04T08:49:09Z" level=info msg="Alloc=3885 TotalAlloc=13343 Sys=70080 NumGC=13 Goroutines=52"
time="2020-03-04T08:54:09Z" level=info msg="Alloc=3839 TotalAlloc=14977 Sys=70080 NumGC=16 Goroutines=52"
time="2020-03-04T08:59:09Z" level=info msg="Alloc=3808 TotalAlloc=16534 Sys=70080 NumGC=18 Goroutines=52"
time="2020-03-04T09:04:09Z" level=info msg="Alloc=4116 TotalAlloc=18167 Sys=70080 NumGC=20 Goroutines=52"
time="2020-03-04T09:09:09Z" level=info msg="Alloc=3723 TotalAlloc=19793 Sys=70080 NumGC=23 Goroutines=52"
time="2020-03-04T09:14:09Z" level=info msg="Alloc=4040 TotalAlloc=21426 Sys=70080 NumGC=25 Goroutines=52"
time="2020-03-04T09:19:09Z" level=info msg="Alloc=3568 TotalAlloc=22991 Sys=70080 NumGC=28 Goroutines=52"
time="2020-03-04T09:23:56Z" level=info msg="config map event, exiting gracefully" eventType=

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@simster7
Copy link
Member

simster7 commented Mar 4, 2020

@alexec Related to your previous PRs

@dmegyesi
Copy link
Author

dmegyesi commented Mar 4, 2020

Additional info: I tried to replicate the exact same watch behaviour with kubectl, in verbose mode:

I0304 17:29:46.346913   20665 round_trippers.go:423] curl -k -v -XGET  -H "Accept: application/json;as=Table;v=v1beta1;g=meta.k8s.io, application/json" -H "User-Agent: kubectl/v1.17.3 (darwin/amd64) kubernetes/06ad960" 'https://XXXREDACTEDCLUSTERNAME.sk1.eu-west-1.eks.amazonaws.com/api/v1/namespaces/cre/configmaps?fieldSelector=metadata.name%3Dworkflow-controller-configmap&resourceVersion=38489345&watch=true'

I0304 17:29:46.394176   20665 round_trippers.go:443] GET https://XXXREDACTEDCLUSTERNAME.sk1.eu-west-1.eks.amazonaws.com/api/v1/namespaces/cre/configmaps?fieldSelector=metadata.name%3Dworkflow-controller-configmap&resourceVersion=38489345&watch=true 200 OK in 47 milliseconds

I0304 17:29:46.394230   20665 round_trippers.go:449] Response Headers:
I0304 17:29:46.394244   20665 round_trippers.go:452]     Content-Type: application/json
I0304 17:29:46.394252   20665 round_trippers.go:452]     Date: Wed, 04 Mar 2020 16:29:46 GMT
I0304 17:29:46.394259   20665 round_trippers.go:452]     Audit-Id: 1f77c968-17a3-431b-b7fd-3859e9318c3a

and then kubectl watch exited at around ~18:05, without providing any further output.

I will re-run now the command with time, to figure out when this happens. Probably some kind of server-side timeout to break the connection?

Update: yep, I confirm, it's a server side timeout that breaks the Watch:

real	32m17.711s
user	0m0.818s
sys	0m0.467s

@alexec
Copy link
Contributor

alexec commented Mar 4, 2020

@simster7 there is a bug in the code, but I do not have bandwidth to fix. My suggestion - either (1) flag around the restart code so people can disable it if it is a problem or (2) find a proper fix.

@alexec
Copy link
Contributor

alexec commented Mar 4, 2020

I think this is a duplicate of #2090

@alexec alexec closed this as completed Mar 4, 2020
@alexec
Copy link
Contributor

alexec commented Mar 4, 2020

Closed as dup.

@simster7
Copy link
Member

simster7 commented Mar 4, 2020

I'll see if I can find a fix

@dmegyesi
Copy link
Author

dmegyesi commented Mar 4, 2020

Can this be relevant? kubernetes/client-go#547

@simster7 simster7 removed the duplicate label Mar 4, 2020
@simster7
Copy link
Member

simster7 commented Mar 4, 2020

I think this should be the main issue, since it's relevant to the bug being described as opposed to the feature implementation in general. Reopening

@simster7
Copy link
Member

simster7 commented Mar 4, 2020

Issue was that we were not recognizing when the channel was being closed and instead treating it as a legitimate watch event. Thanks for the great suggestiong @dmegyesi.

Fixed by restarting the watch when it is closed without restarting the server in #2360

@dmegyesi
Copy link
Author

dmegyesi commented Mar 5, 2020

@simster7 thanks a lot for the quick fix! I can confirm that in the last ~12 hours there were no restarts since I deployed 2.6.1.

@simster7
Copy link
Member

simster7 commented Mar 5, 2020

@simster7 thanks a lot for the quick fix! I can confirm that in the last ~12 hours there were no restarts since I deployed 2.6.1.

Wooo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants