Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow controller crash on nil pointer #11769

Closed
3 tasks done
astraw99 opened this issue Sep 7, 2023 · 1 comment · Fixed by #11770
Closed
3 tasks done

Workflow controller crash on nil pointer #11769

astraw99 opened this issue Sep 7, 2023 · 1 comment · Fixed by #11770
Labels
area/controller Controller issues, panics type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@astraw99
Copy link
Contributor

astraw99 commented Sep 7, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Using the latest master branch code (94fbb3bf27f0746a0b84ea2f654568cc655bfbfb), got workflow-controller crash with multiple replicas.

workflow-controller-85bfb69457-5fl5m   1/1     Running            0          12h
workflow-controller-85bfb69457-6pcbq   0/1     CrashLoopBackOff   121        12h
workflow-controller-85bfb69457-864r4   0/1     CrashLoopBackOff   121        12h
workflow-controller-85bfb69457-g8gtw   0/1     CrashLoopBackOff   121        12h
workflow-controller-85bfb69457-j22fr   0/1     CrashLoopBackOff   121        12h
workflow-controller-85bfb69457-kq8mg   0/1     CrashLoopBackOff   121        12h
workflow-controller-85bfb69457-mrq64   1/1     Running            124        12h
workflow-controller-85bfb69457-t2zmt   0/1     CrashLoopBackOff   121        12h
workflow-controller-85bfb69457-xnfng   0/1     CrashLoopBackOff   121        12h
workflow-controller-85bfb69457-zmwt8   0/1     CrashLoopBackOff   121        12h

The cash log is:

time="2023-09-07T04:17:54.311Z" level=info msg="Persistence configuration enabled"
time="2023-09-07T04:17:54.328Z" level=info msg="Persistence Session created successfully"
time="2023-09-07T04:17:54.328Z" level=info msg="Node status offloading is enabled"
time="2023-09-07T04:17:54.328Z" level=info msg="Workflow archiving is enabled"
time="2023-09-07T04:17:54.328Z" level=info executorImage="quay.io/argoproj/argoexec:v0.0.0" executorImagePullPolicy=Always managedNamespace=
I0907 04:17:54.329379       1 leaderelection.go:248] attempting to acquire leader lease argo/workflow-controller...
time="2023-09-07T04:17:54.329Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics"
time="2023-09-07T04:17:54.334Z" level=info msg="new leader" leader=workflow-controller-85bfb69457-5fl5m
2023/09/07 04:19:46 http: panic serving 10.4.0.129:53794: runtime error: invalid memory address or nil pointer dereference
goroutine 85 [running]:
net/http.(*conn).serve.func1()
        /Users/apple/.gvm/gos/go1.20/src/net/http/server.go:1854 +0xbf
panic({0x21b0120, 0x3a8bb10})
        /Users/apple/.gvm/gos/go1.20/src/runtime/panic.go:890 +0x263
github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Healthz.func2({0xc000567e00?, 0x25113ae?}, 0xc000537900)
        /Users/apple/work/code/WWW/argo-workflows/workflow/controller/healthz.go:36 +0x5d
github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Healthz(0xc000537900, {0x287c740, 0xc00055e0e0}, 0x7fe2561911e8?)
        /Users/apple/work/code/WWW/argo-workflows/workflow/controller/healthz.go:47 +0xb5
net/http.HandlerFunc.ServeHTTP(0xc0000b4080?, {0x287c740?, 0xc00055e0e0?}, 0x40dc88?)
        /Users/apple/.gvm/gos/go1.20/src/net/http/server.go:2122 +0x2f
net/http.(*ServeMux).ServeHTTP(0x0?, {0x287c740, 0xc00055e0e0}, 0xc000565400)
        /Users/apple/.gvm/gos/go1.20/src/net/http/server.go:2500 +0x149
net/http.serverHandler.ServeHTTP({0xc0007cdd40?}, {0x287c740, 0xc00055e0e0}, 0xc000565400)
        /Users/apple/.gvm/gos/go1.20/src/net/http/server.go:2936 +0x316
net/http.(*conn).serve(0xc000821560, {0x287ddc8, 0xc0001a6de0})
        /Users/apple/.gvm/gos/go1.20/src/net/http/server.go:1995 +0x612
created by net/http.(*Server).Serve
        /Users/apple/.gvm/gos/go1.20/src/net/http/server.go:3089 +0x5ed

Version

latest master: 94fbb3b

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Just use the latest master branch, then deploy workflow-controller with multiple replicas (leader-election is on), then will reproduce the crash.

Logs from the workflow controller

See above.

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@astraw99
Copy link
Contributor Author

astraw99 commented Sep 7, 2023

Checked the code, the root cause is from this PR #11375.
I am trying to fix it.

@agilgur5 agilgur5 added type/regression Regression from previous behavior (a specific type of bug) area/controller Controller issues, panics labels Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants