Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server CrashLoopBackOff while calling log API without container: panic: runtime error: invalid memory address or nil pointer dereference #13585

Open
4 tasks done
cdtzabra opened this issue Sep 10, 2024 · 9 comments · May be fixed by #13866
Assignees
Labels
area/api Argo Server API area/server P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug

Comments

@cdtzabra
Copy link

cdtzabra commented Sep 10, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

By trying to get a workflow logs with the API endpoint api/workflow/namespace/name/log using curl, I received a reply 502 Bad Gateway.
I checked the pod status and found that the pod servers are CrashLoopBackOff followed by a restart.

 curl -H "Authorization: $ARGO_TOKEN" -X GET http://workflows.example.com/api/v1/workflows/argo-workflows/workflow-controlm-3oj4k/log
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>
k get po

NAME                                                  READY   STATUS             RESTARTS     AGE
argo-workflows-server-587c6d6f58-f8w2j                0/1     CrashLoopBackOff   3 (7s ago)   8d
argo-workflows-server-587c6d6f58-nfjbf                0/1     CrashLoopBackOff   3 (8s ago)   8d
argo-workflows-workflow-controller-5fd7d444fc-cfxmz   1/1     Running            0            8d
workflow-controlm-3oj4k-build-4246951965              0/3     Completed          0            113m
workflow-controlm-3oj4k-create-job-1156902314         0/1     Completed          0            112m



k get po
NAME                                                  READY   STATUS      RESTARTS        AGE
argo-workflows-server-587c6d6f58-f8w2j                1/1     Running     4 (2m19s ago)   8d
argo-workflows-server-587c6d6f58-nfjbf                1/1     Running     4 (2m20s ago)   8d
argo-workflows-workflow-controller-5fd7d444fc-cfxmz   1/1     Running     0               8d
workflow-controlm-3oj4k-build-4246951965              0/3     Completed   0               115m
workflow-controlm-3oj4k-create-job-1156902314         0/1     Completed   0               114m

And looking at the previous logs, we can clearly see that the pod has panic: runtime error: invalid memory address or nil pointer dereference because of the container parameter missing from the request.

time="2024-09-10T13:12:34.206Z" level=info duration=13.270867ms method=GET path=/api/v1/workflows/argo-workflows/workflow-controlm-3oj4k size=18650 status=0
time="2024-09-10T13:12:39.588Z" level=info duration="158.928µs" method=GET path=index.html size=487 status=0
time="2024-09-10T13:12:52.924Z" level=error msg="a container name must be specified for pod workflow-controlm-3oj4k-build-4246951965, choose one of: [init wait prerequisistes main]" namespace=argo-workflows podName=workflow-controlm-3oj4k-build-4246951965 workflow=workflow-controlm-3oj4k
time="2024-09-10T13:12:52.925Z" level=error msg="a container name must be specified for pod workflow-controlm-3oj4k-notify-862466002, choose one of: [init wait main]" namespace=argo-workflows podName=workflow-controlm-3oj4k-notify-862466002 workflow=workflow-controlm-3oj4k
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x1e8c8a4]

When I specify the container ?logOptions.container=main it works fine

However, I think that the argo-workflows-server pod should not CrashLoopBackOff for this reason.
Otherwise it would be very impactful in production if simple bad API calls crash the whole argo-worflows

Version(s)

v3.5.10

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: graph-
  labels:
    workflows.argoproj.io/test: "true"
  annotations:
    workflows.argoproj.io/description: |
      This workflow demonstrates running a graph of tasks within containers in a single pod.
    workflows.argoproj.io/version: ">= 3.1.0"
spec:
  entrypoint: main
  templates:
    - name: main
      containerSet:
        containers:
          - name: a
            image: argoproj/argosay:v2
          - name: b
            image: argoproj/argosay:v2
            dependencies: ["a"]
          - name: c
            image: argoproj/argosay:v2
            dependencies: ["a"]
          - name: main
            image: argoproj/argosay:v2
            dependencies: ["b", "c"]

Logs from the workflow controller

kubectl logs -n argo-workflows deploy/argo-workflows-workflow-controller | grep ${workflow}
Found 2 pods, using pod/argo-workflows-workflow-controller-5fd7d444fc-dk66f

kubectl logs -n argo-workflows argo-workflows-workflow-controller-xxx-cfxmz  | grep ${workflow}


kubectl logs -n argo-workflows argo-workflows-workflow-controller-xxx-dk66f  | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo-workflows -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
error: container wait is not valid for pod workflow-controlm-3oj4k-create-job-1156902314
@agilgur5 agilgur5 changed the title argo-workflows-server CrashLoopBackOff while calling the API with curl - panic: runtime error: invalid memory address or nil pointer dereference Server CrashLoopBackOff while calling the API with curl - panic: runtime error: invalid memory address or nil pointer dereference Sep 10, 2024
@agilgur5 agilgur5 added the area/api Argo Server API label Sep 10, 2024
@agilgur5
Copy link
Contributor

And looking at the previous logs, we can clearly see that the pod has panic: runtime error: invalid memory address or nil pointer dereference because of the container parameter missing from the request.

There should be a stack trace following that log pointing to the exact line of the nil pointer

Otherwise it would be very impactful in production if simple bad API calls crash the whole argo-worflows [sic]

It normally doesn't; I'm suspecting there's a missing recover statement somewhere

@agilgur5 agilgur5 added P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important area/server labels Sep 10, 2024
@agilgur5 agilgur5 changed the title Server CrashLoopBackOff while calling the API with curl - panic: runtime error: invalid memory address or nil pointer dereference Server CrashLoopBackOff while calling log API without container: panic: runtime error: invalid memory address or nil pointer dereference Sep 10, 2024
@cdtzabra
Copy link
Author

cdtzabra commented Sep 11, 2024

There should be a stack trace following that log pointing to the exact line of the nil pointer

there was no more log concerning the error unfortunately.

@isubasinghe
Copy link
Member

Can you run https://pkg.go.dev/cmd/addr2line with the program counter as an argument along with the binary?

@djanjic
Copy link
Contributor

djanjic commented Sep 18, 2024

Hello, can I work on this one?

@Joibel Joibel assigned Joibel and djanjic and unassigned Joibel Sep 18, 2024
@tooptoop4 tooptoop4 linked a pull request Nov 5, 2024 that will close this issue
@tooptoop4
Copy link
Contributor

@cdtzabra can u run with debug log level turned on? i also raized #13866 if u wanna give that a whirl

@cdtzabra
Copy link
Author

cdtzabra commented Nov 8, 2024

@cdtzabra can u run with debug log level turned on? i also raized #13866 if u wanna give that a whirl

Hi,

Yes, I can. Next week

@cdtzabra
Copy link
Author

cdtzabra commented Nov 18, 2024

@tooptoop4

I've just tested .
I first tested in lab and I could not reproduce the problem despite several times.
I removed the resource requests/limits that had been added and still no crashloopback.

So I switched to the pre-production argo-workflows (where I'd encountered the problem), I tested our real workflowTemplate and there was the CrashLoopBackOff.

By re-running the tests with the basic worklowtemplates tested in LAB, there's no problem.

I put the preprod PODS in debug mode, I redid the crash test but there's no more info in the log.

Note for the basic templates, the msg="a container name must be specified for pod workflow xxx is still present in server log BUT there is no crash

    spec:
      containers:
      - args:
        - server
        - --configmap=argo-workflows-workflow-controller-configmap
        - --auth-mode=client
        - --auth-mode=sso
        - --secure=false
        - --namespaced
        - --loglevel
        - debug
        - --gloglevel
        - "0"
        - --log-format
        - text
       ```
       
      
> workflow  server pod1 log
       
```log-
time="2024-11-18T10:16:33.744Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetWorkflowTemplate grpc.service=workflowtemplate.WorkflowTemplateService grpc.start_time="2024-11-18T10:16:33Z" grpc.time_ms=13.259 span.kind=server system=grpc
time="2024-11-18T10:16:33.747Z" level=info duration=17.33487ms method=GET path=/api/v1/workflow-templates/argo-workflows/workflow-controlm size=6690 status=0
time="2024-11-18T10:16:48.263Z" level=info duration="112.732µs" method=GET path=index.html size=487 status=0
time="2024-11-18T10:17:08.263Z" level=info duration="181.525µs" method=GET path=index.html size=487 status=0
time="2024-11-18T10:17:28.263Z" level=info duration="146.051µs" method=GET path=index.html size=487 status=0
time="2024-11-18T10:17:48.264Z" level=info duration="139.866µs" method=GET path=index.html size=487 status=0
time="2024-11-18T10:18:08.264Z" level=info duration="347.373µs" method=GET path=index.html size=487 status=0
time="2024-11-18T10:18:28.263Z" level=info duration="315.198µs" method=GET path=index.html size=487 status=0
time="2024-11-18T10:18:41.558Z" level=debug msg="List options" namespace=argo-workflows options="{{ } workflows.argoproj.io/workflow=workflow-controlm-44vpo  false false   <nil> 0 }" workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.558Z" level=debug msg="Log options" namespace=argo-workflows options="&PodLogOptions{Container:,Follow:false,Previous:false,SinceSeconds:nil,SinceTime:<nil>,Timestamps:false,TailLines:nil,LimitBytes:nil,InsecureSkipTLSVerifyBackend:false,}" workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.560Z" level=debug msg="Listing workflow pods" namespace=argo-workflows workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Ensuring pod logs stream" alreadyStreaming=false namespace=argo-workflows podName=workflow-controlm-44vpo-build-1554548131 podPhase=Succeeded workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Ensuring pod logs stream" alreadyStreaming=false namespace=argo-workflows podName=workflow-controlm-44vpo-create-job-3142162940 podPhase=Succeeded workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Ensuring pod logs stream" alreadyStreaming=false namespace=argo-workflows podName=workflow-controlm-44vpo-notify-204188948 podPhase=Succeeded workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Not starting watches" namespace=argo-workflows workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Waiting for work-group" namespace=argo-workflows workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Sorting entries" namespace=argo-workflows workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Streaming pod logs" namespace=argo-workflows podName=workflow-controlm-44vpo-create-job-3142162940 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Streaming pod logs" namespace=argo-workflows podName=workflow-controlm-44vpo-build-1554548131 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Streaming pod logs" namespace=argo-workflows podName=workflow-controlm-44vpo-notify-204188948 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.575Z" level=error msg="a container name must be specified for pod workflow-controlm-44vpo-notify-204188948, choose one of: [init wait main]" namespace=argo-workflows podName=workflow-controlm-44vpo-notify-204188948 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.575Z" level=debug msg="Pod logs stream done" namespace=argo-workflows podName=workflow-controlm-44vpo-notify-204188948 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.576Z" level=error msg="a container name must be specified for pod workflow-controlm-44vpo-build-1554548131, choose one of: [init wait prerequisistes main]" namespace=argo-workflows podName=workflow-controlm-44vpo-build-1554548131 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.576Z" level=debug msg="Pod logs stream done" namespace=argo-workflows podName=workflow-controlm-44vpo-build-1554548131 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.594Z" level=debug msg="Pod logs stream done" namespace=argo-workflows podName=workflow-controlm-44vpo-create-job-3142162940 workflow=workflow-controlm-44vpo
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x1e8c8a4]

goroutine 373 [running]:
github.com/argoproj/argo-workflows/v3/util/logs.WorkflowLogs.func1.1({0xc0007879e0, 0x2d})
        /go/src/github.com/argoproj/argo-workflows/util/logs/workflow-logger.go:167 +0x524
created by github.com/argoproj/argo-workflows/v3/util/logs.WorkflowLogs.func1 in goroutine 224
        /go/src/github.com/argoproj/argo-workflows/util/logs/workflow-logger.go:126 +0x5e5

workflow server pod2 log

time="2024-11-18T10:18:41.650Z" level=debug msg="Not starting watches" namespace=argo-workflows workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.650Z" level=debug msg="Waiting for work-group" namespace=argo-workflows workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.650Z" level=debug msg="Sorting entries" namespace=argo-workflows workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.650Z" level=debug msg="Streaming pod logs" namespace=argo-workflows podName=workflow-controlm-44vpo-notify-204188948 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.650Z" level=debug msg="Streaming pod logs" namespace=argo-workflows podName=workflow-controlm-44vpo-create-job-3142162940 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.650Z" level=debug msg="Streaming pod logs" namespace=argo-workflows podName=workflow-controlm-44vpo-build-1554548131 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.653Z" level=error msg="a container name must be specified for pod workflow-controlm-44vpo-notify-204188948, choose one of: [init wait main]" namespace=argo-workflows podName=workflow-controlm-44vpo-notify-204188948 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.653Z" level=debug msg="Pod logs stream done" namespace=argo-workflows podName=workflow-controlm-44vpo-notify-204188948 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.654Z" level=error msg="a container name must be specified for pod workflow-controlm-44vpo-build-1554548131, choose one of: [init wait prerequisistes main]" namespace=argo-workflows podName=workflow-controlm-44vpo-build-1554548131 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.654Z" level=debug msg="Pod logs stream done" namespace=argo-workflows podName=workflow-controlm-44vpo-build-1554548131 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.669Z" level=debug msg="Pod logs stream done" namespace=argo-workflows podName=workflow-controlm-44vpo-create-job-3142162940 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.670Z" level=debug msg="Work-group done" namespace=argo-workflows workflow=workflow-controlm-44vpo
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x1e8c8a4]

goroutine 287 [running]:
github.com/argoproj/argo-workflows/v3/util/logs.WorkflowLogs.func1.1({0xc00054a5d0, 0x2d})
        /go/src/github.com/argoproj/argo-workflows/util/logs/workflow-logger.go:167 +0x524
created by github.com/argoproj/argo-workflows/v3/util/logs.WorkflowLogs.func1 in goroutine 294
        /go/src/github.com/argoproj/argo-workflows/util/logs/workflow-logger.go:126 +0x5e5

So I wonder if the problem isn't related to the resource types in the template.
The workflowTemplate that causes the workflow-server to crash also uses “artificat”, “resource” (to create a kubernetes job) and more

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: workflow-controlm
  namespace: argo-workflows
spec:
  ttlStrategy:
    # keep completed workflows for 1d
    secondsAfterCompletion: 86400
  workflowMetadata:
    annotations:
      workflows.argoproj.io/title: "{{workflow.parameters.DISPLAY_NAME}}"
    labels:
      workflows.argoproj.io/title: "{{workflow.parameters.DISPLAY_NAME}}"
  entrypoint: main-template
  onExit: exit-handler
  serviceAccountName: workflow-pods-sa
  arguments:
    parameters:
      - name: PROJECT_URL
        value: ""
      - name: PROJECT_BRANCH
        value: ""
      - name: CONTROLM_JOB
        value: ""
      - name: CONTROLM_TASK
        value: ""
      - name: CONTROLM_CTMS
        value: ""
      - name: TEMPLATE_NAME
        value: ""
      - name: BUILD_URL
        value: ""
      - name: DISPLAY_NAME
        value: "to-overload"
  volumes:
    - name: workspace
      emptyDir: {}
  templates:
    - name: main-template
      steps:
        - - name: build
            template: build
            arguments:
              parameters:
                - name: PROJECT_URL
                  value: "{{workflow.parameters.PROJECT_URL}}"
                - name: PROJECT_BRANCH
                  value: "{{workflow.parameters.PROJECT_BRANCH}}"
                - name: CONTROLM_TASK
                  value: "{{workflow.parameters.CONTROLM_TASK}}"
                - name: TEMPLATE_NAME
                  value: "{{workflow.parameters.TEMPLATE_NAME}}"

        - - name: create-job
            template: create-job
            arguments:
              artifacts:
              - name: kubejob
                raw:
                  data: |
                    {{workflow.outputs.parameters.KUBEJOB}}

    # Steps template definition: Build
    - name: build
      inputs:
        parameters:
          - name: PROJECT_URL
          - name: PROJECT_BRANCH
          - name: CONTROLM_TASK
          - name: TEMPLATE_NAME
      outputs:
        parameters:
          - name: namespace
            valueFrom:
              path: /workspace/namespace.txt
            globalName: NAMESPACE
          - name: kube_job
            valueFrom:
              path: /workspace/kube-job.yaml
            globalName: KUBEJOB
          - name: backofflimit
            valueFrom:
              path: /workspace/backofflimit.txt
            globalName: backoffLimit           
      containerSet:
        volumeMounts:
          - name: workspace
            mountPath: /workspace
        containers:
          - name: prerequisistes
            image: myregistry/tools:latest
            envFrom:
              - secretRef:
                  name: gitlab-a1963-read
            command:
              - sh
              - '-c'
              - |
                set -o errexit
                cd /workspace

                gitUrl="{{inputs.parameters.PROJECT_URL}}"
                gitBranch="{{inputs.parameters.PROJECT_BRANCH}}"
                codeApp=$(basename $(dirname {{inputs.parameters.PROJECT_URL}}))
                echo "${codeApp}-${gitBranch}" > namespace.txt
                
                git clone -b ${gitBranch} ${gitUrl} chart

          - name: main
            dependencies:
              - prerequisistes
            image: myregistry/tools:latest
            command:
              - sh
              - '-c'
              - |
                set -o errexit
                cd /workspace/

                printf "\n### Templating the Helm Chart ###\n"
                taskPath="chart/{{inputs.parameters.CONTROLM_TASK}}"
                templateName="{{inputs.parameters.TEMPLATE_NAME}}"
                outputDir="./myfolder"

                helm template ${taskPath}/${templateName}/ -f ${taskPath}/app-values.yaml -f ${taskPath}/infra-values.yaml --output-dir ${outputDir} --dependency-update

                # Store data to export them as global variables
                find ${outputDir}  -type f -iname "job*.yaml" | xargs cat > kube-job.yaml
                yq .spec.backoffLimit kube-job.yaml > backofflimit.txt

                namespace=$(cat namespace.txt)
                get_ns=$(kubectl get -o name --ignore-not-found ns ${namespace})
                if [ -z $get_ns ]; then kubectl create ns ${namespace}; fi

                printf "\n### Creating configmaps - Secrets - AVP ###\n"
                find ${outputDir}  -type f -iname "configmap-*.yaml" -o -iname "secret*.yaml" | while read -r line; do 
                cat  ${line} | | kubectl -n ${namespace} apply -f -
                done

    # Steps template definition: Argo App
    - name: create-job
      inputs:
        artifacts:
        - name: kubejob
          path: /artifacts/kubejob
      resource:
        action: create
        successCondition: status.succeeded > 0
        failureCondition: status.failed > {{workflow.outputs.parameters.backoffLimit}}
        flags: ["--namespace={{workflow.outputs.parameters.NAMESPACE}}"]
        manifest: |
          {{workflow.outputs.parameters.KUBEJOB}}
      outputs:
        parameters:
        - name: job-status
          valueFrom:
            jsonPath: '{.status}'


    # Exit handler templates
    # After the completion of the entrypoint template
    - name: exit-handler
      steps:
      - - name:  Success-Notifier
          template: notify
          arguments:
            parameters:
              - name: exitCode
                value: 0
          when: "{{workflow.status}} == Succeeded && {{workflow.parameters.PROJECT_BRANCH}} != dev"
        - name: Failure-Notifier
          template: notify
          arguments:
            parameters:
              - name: exitCode
                value: 1
          when: "{{workflow.status}} != Succeeded && {{workflow.parameters.PROJECT_BRANCH}} != dev"

    # notify template definition
    - name: notify
      inputs:
        parameters:
          - name: exitCode
      container:
        image: alpine/curl
        envFrom:
          - secretRef:
              name: controlm
        command:
          - sh
          - '-c'
          - |
            exitCode="{{inputs.parameters.exitCode}}"
            job="{{workflow.parameters.CONTROLM_JOB}}"
            task="{{workflow.parameters.CONTROLM_TASK}}"
            ctms="{{workflow.parameters.CONTROLM_CTMS}}"
            message="xxxx"
            # POST exitCode HERE

@tooptoop4
Copy link
Contributor

@cdtzabra did u build the image locally with the #13866 fix in it?

@cdtzabra
Copy link
Author

@tooptoop4 NO, but I will do it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Argo Server API area/server P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants