Invalid node IDs in workflow.status #10107

mweibel · 2022-11-25T13:10:28Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Since #8748 pod names use the V2 naming which contains the template name in the pod name. However the implementation did not update the Workflow.Status.Nodes map to contain the correct pod name anymore. There's a disconnect between NodeIDs and pod names which wasn't the case before. This makes it impossible to look at a argo workflow status, take the nodeID and use it to know which pod it belongs to.

This is a follow-up of #9906. I initially thought this would be the same case but apparently is not.

The below workflow triggers the following workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    workflows.argoproj.io/pod-name-format: v2
  creationTimestamp: "2022-11-25T11:33:41Z"
  generateName: nodename-
  generation: 3
  labels:
    workflows.argoproj.io/phase: Running
  name: nodename-bvd45
  namespace: argo
  resourceVersion: "15649"
  uid: ea233eef-210d-4394-a238-ef847b104458
spec:
  activeDeadlineSeconds: 300
  arguments: {}
  entrypoint: render
  podSpecPatch: |
    terminationGracePeriodSeconds: 3
  templates:
  - inputs: {}
    metadata: {}
    name: render
    outputs: {}
    steps:
    - - arguments:
          parameters:
          - name: frames
            value: '{{item.frames}}'
        name: run-blender
        template: blender
        withItems:
        - frames: 1
  - container:
      args:
      - /argosay echo 0/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s && /argosay
        echo 50/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s
      command:
      - /bin/sh
      - -c
      image: argoproj/argosay:v2
      name: ""
      resources: {}
    inputs:
      parameters:
      - name: frames
    metadata: {}
    name: blender
    outputs: {}
    retryStrategy:
      limit: 2
      retryPolicy: Always
status:
  artifactGCStatus:
    notSpecified: true
  artifactRepositoryRef:
    artifactRepository:
      archiveLogs: true
      s3:
        accessKeySecret:
          key: accesskey
          name: my-minio-cred
        bucket: my-bucket
        endpoint: minio:9000
        insecure: true
        secretKeySecret:
          key: secretkey
          name: my-minio-cred
    configMap: artifact-repositories
    key: default-v1
    namespace: argo
  conditions:
  - status: "False"
    type: PodRunning
  finishedAt: null
  nodes:
    nodename-bvd45:
      children:
      - nodename-bvd45-701773242
      displayName: nodename-bvd45
      finishedAt: null
      id: nodename-bvd45
      name: nodename-bvd45
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: render
      templateScope: local/nodename-bvd45
      type: Steps
    nodename-bvd45-701773242:
      boundaryID: nodename-bvd45
      children:
      - nodename-bvd45-3728066428
      displayName: '[0]'
      finishedAt: null
      id: nodename-bvd45-701773242
      name: nodename-bvd45[0]
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateScope: local/nodename-bvd45
      type: StepGroup
    nodename-bvd45-3728066428:
      boundaryID: nodename-bvd45
      children:
      - nodename-bvd45-3928099255
      displayName: run-blender(0:frames:1)
      finishedAt: null
      id: nodename-bvd45-3728066428
      inputs:
        parameters:
        - name: frames
          value: "1"
      name: nodename-bvd45[0].run-blender(0:frames:1)
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: blender
      templateScope: local/nodename-bvd45
      type: Retry
    nodename-bvd45-3928099255:
      boundaryID: nodename-bvd45
      displayName: run-blender(0:frames:1)(0)
      finishedAt: null
      hostNodeName: k3d-argowf-server-0
      id: nodename-bvd45-3928099255
      inputs:
        parameters:
        - name: frames
          value: "1"
      message: PodInitializing
      name: nodename-bvd45[0].run-blender(0:frames:1)(0)
      phase: Pending
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: blender
      templateScope: local/nodename-bvd45
      type: Pod
  phase: Running
  progress: 0/1
  startedAt: "2022-11-25T11:33:41Z"

The pod to run is named nodename-bvd45-blender-3928099255 but the NodeID in workflow.status.nodes is just nodename-bvd45-3928099255.

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: nodename-
spec:
  arguments: {}
  entrypoint: render
  templates:
    - inputs: {}
      metadata: {}
      name: render
      steps:
        - - arguments:
              parameters:
                - name: frames
                  value: '{{item.frames}}'
            name: run-blender
            template: blender
            withItems:
              - frames: 1
    - container:
        image: argoproj/argosay:v2
        command: ["/bin/sh", "-c"]
        args:
          - /argosay echo 0/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s && /argosay echo 50/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s
        name: ""
      inputs:
        parameters:
          - name: frames
      name: blender
      retryStrategy:
        limit: 2
        retryPolicy: Always

Logs from the workflow controller

irrelevant

Logs from in your workflow's wait container

irrelevant

The text was updated successfully, but these errors were encountered:

ognjen-it · 2022-11-28T07:31:07Z

I have the same error :)
Just finger up for the issue 👍🏻

alexec · 2022-11-28T08:25:23Z

Node IDs are opaque identifiers, not pod names. Not a bug.

ognjen-it · 2022-11-28T08:40:14Z

Hi @alexec, thank you for your message. I didn't read this issue properly. He wrote about nodeID.
However, there is a problem with podName.
The real pod name is ognjen-587tr-template-1243882462 but podName in workflow.status is ognjen-587tr-1243882462. It's 100% bug.
I don't know if I need to create a new issue.

alexec · 2022-11-28T08:43:28Z

Ah. That is a bug. Can you raise a separate issue?

ognjen-it · 2022-11-28T10:05:46Z

@alexec #10124 done :)

caelan-io · 2022-11-28T18:45:04Z

Closed per @alexec comment:

Node IDs are opaque identifiers, not pod names. Not a bug.

chensun · 2023-05-05T11:22:40Z

Node IDs are opaque identifiers, not pod names. Not a bug.

@alexec We have been relying on node id being equal to pod name for the past few years until we recently try to upgrade from v3.3.8 to v3.4.7 and hit this issue. So this is a breaking change at minimum. Is there any reason why we can't or shouldn't use pod name as node id?

Also, if I'm not mistaken, workflow.status would contain pod name only when there's a failure case. If node id is different from pod name, then how can we get the pod name for non-failure case?

Thanks in advance!

chensun · 2023-05-05T11:36:14Z

Since #8748 pod names use the V2 naming which contains the template name in the pod name. However the implementation did not update the Workflow.Status.Nodes map to contain the correct pod name anymore. There's a disconnect between NodeIDs and pod names which wasn't the case before. This makes it impossible to look at a argo workflow status, take the nodeID and use it to know which pod it belongs to.

+1 on what @mweibel said. This describes the issue precisely. This regression is blocking us from upgrading to v3.4.

terrytangyuan · 2023-05-05T18:42:13Z

Let's reopen this since this blocks upgrade for KFP.

terrytangyuan · 2023-05-09T12:52:31Z

@chensun To unblock your upgrade, you can construct/infer the podName from node.ID and node.templateName. For example:

nodeId: coinflip-tmrzc-3636679105
templateName: flip-coin
podName: coinflip-tmrzc-flip-coin-3636679105

This is basically how we get podName in the UI:

export const getPodName = (workflowName: string, nodeName: string, templateName: string, nodeID: string, version: string): string => {
    if (version === POD_NAME_V2 && templateName !== '') {
        if (workflowName === nodeName) {
            return workflowName;
        }

        const prefix = ensurePodNamePrefixLength(`${workflowName}-${templateName}`);

        const hash = createFNVHash(nodeName);
        return `${prefix}-${hash}`;
    }

    return nodeID;
};

and here's the podName implementation for failure case:

func GeneratePodName(workflowName, nodeName, templateName, nodeID string, version PodNameVersion) string {
	if version == PodNameV1 {
		return nodeID
	}

	if workflowName == nodeName {
		return workflowName
	}

	prefix := workflowName
	if !strings.Contains(nodeName, ".inline") {
		prefix = fmt.Sprintf("%s-%s", workflowName, templateName)
	}
	prefix = ensurePodNamePrefixLength(prefix)

	h := fnv.New32a()
	_, _ = h.Write([]byte(nodeName))

	return fmt.Sprintf("%s-%v", prefix, h.Sum32())

}

terrytangyuan · 2023-05-09T13:01:58Z

I don't think you should rely on nodeId as Alex mentioned earlier:

Node IDs are opaque identifiers, not pod names. Not a bug.

Instead, it might be a good enhancement proposal to also add podName to node statuses for non-failure cases.

chensun · 2023-05-09T21:20:10Z

@terrytangyuan Thank your for your reply.

To unblock your upgrade, you can construct/infer the podName from node.ID and node.templateName

This is rather hacky, and subject to break with future upgrade. The way argo construct pod name is an implementation detail that no user should ever reimplement it themselves. In fact, even within Argo's code space, it should be a single SOT for pod name generation. There has been a chain of bugs following the v2 pod naming change. Please consider #10267.

This is basically how we get podName in the UI:

In fact, there's possibly an open bug with the code you pasted. See #11015 and the proposed fix. Again to the point of having a single SOT for pod name.

I don't think you should rely on nodeId as Alex mentioned earlier:

Node IDs are opaque identifiers, not pod names. Not a bug.

Instead, it might be a good enhancement proposal to also add podName to node statuses for non-failure cases.

Can anyone explain why node IDs cannot be the same as pod names (which was the case before v3.4)? Node IDs can an arbitrary UUIDs if you want, but being able to know the k8s pod name regardless failure or not is a super important feature. Missing pod name is a regression and an undocumented breaking change that should be avoided with minor version bumps.

terrytangyuan · 2023-05-10T00:06:21Z

Yes I agree it's hacky and could only be a temporary workaround to unblock your release.

cc @isubasinghe @JPZ13 @juliev0 who contributed the feature and relevant fixes. Any thoughts?

juliev0 · 2023-05-10T21:46:26Z

Yes I agree it's hacky and could only be a temporary workaround to unblock your release.

cc @isubasinghe @JPZ13 @juliev0 who contributed the feature and relevant fixes. Any thoughts?

Hey Terry. I'm not sure exactly the question, but I believe JP put in the original work for updating to POD NAMES V2 and has been looking at many of the bugs for that since then. (Although, the very first issue I worked on was switching to V2 as the default, and then also fixed an issue that was related to this afterward. I actually never really was told the motivation for the change or anything.)

terrytangyuan · 2023-05-11T13:12:49Z

I don't recall being involved in that conversation either. Perhaps @alexec or @sarabala1979 have more context around the motivation behind the change?

JPZ13 · 2023-05-11T18:05:29Z

Chiming in - the motivation for switching from node ids to pod names with the workflow template in the pod name was so that a user who is looking through their individual pods using kubectl get pods can quickly figure out which pod corresponds to which step in a workflow. Most end users and infrastructure users really like this feature per the comment on #7279:

It's just better, especially for tracing, debugging

I'll confirm in the codebase, but if you don't want the pod names to display with the workflowtemplate and you need the pod names to just be the node id, you can set the POD_NAMES environment variable to v1 and have it go back to node id

cc @terrytangyuan @chensun

JPZ13 · 2023-05-11T18:08:02Z

Following up, the code path for v1 pod names (node ids) still exists: https://github.com/argoproj/argo-workflows/blob/master/workflow/util/pod_name.go#L38

chensun · 2023-05-15T18:38:36Z

@JPZ13 Thanks for your reply. I can see the value of adding more info to pod name to make it more debug-friend. I wasn't questioning the change to pod name itself, but why node id cannot follow the same?

I can try the POD_NAMES env variable. But given the number of bugs and fixes around this topic, I'm not very confident that changing the environment variable won't break other cases. Please do consider #10267.

JPZ13 · 2023-05-15T19:44:03Z

I very much agree with #10267

aaron-arellano · 2023-08-21T19:48:10Z

We use Argo heavily at ZG and when upgrading to v3.4 of argo workflows we noticed breaking changes this causes outside of the nodeId in the workflow. This v2 naming convention also breaks upstream k8s HOSTNAME env variable for the pod. For instance, we get the HOSTNAME in the workflow pod and run the kubernetes api call to get_namespace_pod with that HOSTNAME the pod name returned from the kubernetes api server does not match the pod name in the actual pod metadata.name. Not sure if there is some weird concatenation going on that is not persisting to etcd but the downard API does not match what is in etcd. I reverted the env var POD_NAMES to v1 and everything works in v3.4. I feel with all the bugs this v2 pod name should be reverted becuase the scope of breaking changes persists beyond argo itself and into kubernetes.

related to #10267

henrywangx · 2024-05-28T05:30:57Z

@terrytangyuan Thank your for your reply.

To unblock your upgrade, you can construct/infer the podName from node.ID and node.templateName

This is rather hacky, and subject to break with future upgrade. The way argo construct pod name is an implementation detail that no user should ever reimplement it themselves. In fact, even within Argo's code space, it should be a single SOT for pod name generation. There has been a chain of bugs following the v2 pod naming change. Please consider #10267.

This is basically how we get podName in the UI:

In fact, there's possibly an open bug with the code you pasted. See #11015 and the proposed fix. Again to the point of having a single SOT for pod name.

I don't think you should rely on nodeId as Alex mentioned earlier:

Node IDs are opaque identifiers, not pod names. Not a bug.

Instead, it might be a good enhancement proposal to also add podName to node statuses for non-failure cases.

Can anyone explain why node IDs cannot be the same as pod names (which was the case before v3.4)? Node IDs can an arbitrary UUIDs if you want, but being able to know the k8s pod name regardless failure or not is a super important feature. Missing pod name is a regression and an undocumented breaking change that should be avoided with minor version bumps.

Agree with it. If we change pod'name with v2, we can't get the pod name from status. It's really inconvenient.

mweibel added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Nov 25, 2022

mweibel mentioned this issue Nov 25, 2022

Wrong Pod name in argo get command result from CLI #9906

Closed

3 tasks

sarabala1979 closed this as completed Nov 28, 2022

chensun mentioned this issue May 5, 2023

feat!: Upgrade argo to v3.4.7 kubeflow/pipelines#9301

Closed

1 task

terrytangyuan reopened this May 5, 2023

JPZ13 added the P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important label May 11, 2023

This comment was marked as resolved.

Sign in to view

stale bot added the problem/stale This has not had a response in some time label Jun 18, 2023

SlaterByte mentioned this issue Jun 21, 2023

argo get displays incorrect PODNAME when Node has a templateRef #11250

Closed

3 tasks

terrytangyuan mentioned this issue Jun 22, 2023

Ensure single source of truth for generated values. #10267

Open

This comment was marked as resolved.

Sign in to view

stale bot removed the problem/stale This has not had a response in some time label Jul 11, 2023

aaron-arellano mentioned this issue Aug 21, 2023

WIP: Add Task names to Pod names #1320

Closed

juliev0 added P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority and removed P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Sep 7, 2023

chensun mentioned this issue Jan 18, 2024

Kubeflow Pipelines open /var/run/argo/outputs/parameters/tmp/outputs/execution-id kubeflow/pipelines#10365

Closed

HumairAK mentioned this issue Apr 14, 2024

feat: store podname in nodestatus #12503

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid node IDs in workflow.status #10107

Invalid node IDs in workflow.status #10107

mweibel commented Nov 25, 2022

ognjen-it commented Nov 28, 2022

alexec commented Nov 28, 2022

ognjen-it commented Nov 28, 2022 •

edited

Loading

alexec commented Nov 28, 2022

ognjen-it commented Nov 28, 2022

caelan-io commented Nov 28, 2022

chensun commented May 5, 2023

chensun commented May 5, 2023

terrytangyuan commented May 5, 2023

terrytangyuan commented May 9, 2023 •

edited by agilgur5

Loading

terrytangyuan commented May 9, 2023

chensun commented May 9, 2023

terrytangyuan commented May 10, 2023

juliev0 commented May 10, 2023

terrytangyuan commented May 11, 2023

JPZ13 commented May 11, 2023 •

edited by agilgur5

Loading

JPZ13 commented May 11, 2023

chensun commented May 15, 2023

JPZ13 commented May 15, 2023

This comment was marked as resolved.

This comment was marked as resolved.

aaron-arellano commented Aug 21, 2023 •

edited

Loading

henrywangx commented May 28, 2024

Invalid node IDs in workflow.status #10107

Invalid node IDs in workflow.status #10107

Comments

mweibel commented Nov 25, 2022

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

ognjen-it commented Nov 28, 2022

alexec commented Nov 28, 2022

ognjen-it commented Nov 28, 2022 • edited Loading

alexec commented Nov 28, 2022

ognjen-it commented Nov 28, 2022

caelan-io commented Nov 28, 2022

chensun commented May 5, 2023

chensun commented May 5, 2023

terrytangyuan commented May 5, 2023

terrytangyuan commented May 9, 2023 • edited by agilgur5 Loading

terrytangyuan commented May 9, 2023

chensun commented May 9, 2023

terrytangyuan commented May 10, 2023

juliev0 commented May 10, 2023

terrytangyuan commented May 11, 2023

JPZ13 commented May 11, 2023 • edited by agilgur5 Loading

JPZ13 commented May 11, 2023

chensun commented May 15, 2023

JPZ13 commented May 15, 2023

This comment was marked as resolved.

This comment was marked as resolved.

aaron-arellano commented Aug 21, 2023 • edited Loading

henrywangx commented May 28, 2024

ognjen-it commented Nov 28, 2022 •

edited

Loading

terrytangyuan commented May 9, 2023 •

edited by agilgur5

Loading

JPZ13 commented May 11, 2023 •

edited by agilgur5

Loading

aaron-arellano commented Aug 21, 2023 •

edited

Loading