Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] pipelines flaky on GKE after upgrading to argo v3.1+ docker executor #5943

Closed
Tracked by #5779
Bobgy opened this issue Jun 29, 2021 · 5 comments · Fixed by #6228
Closed
Tracked by #5779

[bug] pipelines flaky on GKE after upgrading to argo v3.1+ docker executor #5943

Bobgy opened this issue Jun 29, 2021 · 5 comments · Fixed by #6228

Comments

@Bobgy
Copy link
Contributor

Bobgy commented Jun 29, 2021

What steps did you take

pipelines flaky after upgrading to argo v3.1.0, when using argo docker executor

  1. install kfp 1.7.0-alpha.1
  2. run v2 sample tests in the local cluster

What happened:

Some pipelines fail randomly with several types of error:

This step is in Error state with this message: Error (exit code 1): path /tmp/outputs/metrics/data does not exist in archive /tmp/argo/outputs/artifacts/output-named-tuple-metrics.tgz

This step is in Error state with this message: Error (exit code 1): Error: No such container:path: 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data tar: This does not look like a tar archive tar: Exiting with failure status due to previous errors
https://1bbe723ceaf1ede1-dot-asia-east1.pipelines.googleusercontent.com/#/runs/details/da866240-98f4-42d1-86e1-2c6b17a7b944

This step is in Error state with this message: Error (exit code 1): failed to wait for main container to complete: timed out waiting for the condition: Error response from daemon: No such container: 2723a1dcd93b23f100a96bd82480b8917b8f805827d41e1f70f515655cb1d9e1
https://1bbe723ceaf1ede1-dot-asia-east1.pipelines.googleusercontent.com/#/runs/details/1f5a60a1-885a-4de5-8222-d906e92130bc

Both type of errors have very similar logs in argo wait container:

time="2021-06-29T11:14:29.706Z" level=info msg="Starting annotations monitor"
time="2021-06-29T11:14:29.706Z" level=info msg="Starting deadline monitor"
time="2021-06-29T11:14:29.707Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:29.790Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Created {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:29.790Z" level=info msg="mapped container name \"wait\" to container ID \"1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431\" (created at 2021-06-29 11:14:17 +0000 UTC, status Created)"
time="2021-06-29T11:14:29.790Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:30.791Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:30.862Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:30.862Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:31.862Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:31.990Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:31.990Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:32.990Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:33.067Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:33.067Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:34.067Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:34.158Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:34.158Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:35.159Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:35.286Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:35.286Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:36.286Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:36.359Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:36.359Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:37.360Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:37.535Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:37.535Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:38.535Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:38.598Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:38.598Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:39.598Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:39.661Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:39.661Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:40.661Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:40.782Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:40.782Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:41.782Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:41.877Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} main:{32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5 Created {0 63760562072 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:41.877Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:41.877Z" level=info msg="mapped container name \"main\" to container ID \"32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5\" (created at 2021-06-29 11:14:32 +0000 UTC, status Created)"
time="2021-06-29T11:14:42.710Z" level=info msg="docker wait 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5"
time="2021-06-29T11:14:42.768Z" level=error msg="`docker wait 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5` failed: Error response from daemon: No such container: 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5\n"
time="2021-06-29T11:14:42.877Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:42.935Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} main:{32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5 Created {0 63760562072 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:42.935Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:43.935Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:44.012Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} main:{32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5 Created {0 63760562072 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:44.012Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:44.071Z" level=info msg="docker wait 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5"
time="2021-06-29T11:14:44.071Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:44.265Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} main:{32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5 Created {0 63760562072 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:44.265Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:44.302Z" level=info msg="Main container completed"
time="2021-06-29T11:14:44.302Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2021-06-29T11:14:44.302Z" level=info msg="Saving logs"
time="2021-06-29T11:14:44.303Z" level=info msg="[docker logs 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5]"
time="2021-06-29T11:14:45.013Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:45.112Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} main:{32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5 Created {0 63760562072 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:45.112Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:45.268Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:45.423Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} main:{32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5 Created {0 63760562072 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:45.423Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:46.113Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:46.167Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} main:{32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5 Created {0 63760562072 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:46.168Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:46.423Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:46.589Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} main:{32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5 Created {0 63760562072 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:46.589Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:46.878Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/add-pipeline-fz5cn/2021/06/29/add-pipeline-fz5cn-2565795640/main.log"
time="2021-06-29T11:14:46.878Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2021-06-29T11:14:46.879Z" level=info msg="Saving from /tmp/argo/outputs/logs/main.log to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/add-pipeline-fz5cn/2021/06/29/add-pipeline-fz5cn-2565795640/main.log)"
time="2021-06-29T11:14:47.168Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:47.259Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2021-06-29T11:14:47.259Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2021-06-29T11:14:47.259Z" level=info msg="Saving output parameters"
time="2021-06-29T11:14:47.259Z" level=info msg="Saving path output parameter: add-2-sum"
time="2021-06-29T11:14:47.259Z" level=info msg="Copying /tmp/outputs/sum/data from base image layer"
time="2021-06-29T11:14:47.259Z" level=info msg="sh -c docker cp -a 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data - | tar -ax -O"
time="2021-06-29T11:14:47.284Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} main:{32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5 Up {0 63760562072 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:47.284Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:47.413Z" level=error msg="`sh -c docker cp -a 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data - | tar -ax -O` failed: Error: No such container:path: 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data\ntar: This does not look like a tar archive\ntar: Exiting with failure status due to previous errors\n"
time="2021-06-29T11:14:47.413Z" level=error msg="executor error: Error: No such container:path: 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data\ntar: This does not look like a tar archive\ntar: Exiting with failure status due to previous errors\ngithub.com/argoproj/argo-workflows/v3/errors.New\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:49\ngithub.com/argoproj/argo-workflows/v3/errors.InternalError\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:60\ngithub.com/argoproj/argo-workflows/v3/workflow/common.RunCommand\n\t/go/src/github.com/argoproj/argo-workflows/workflow/common/util.go:238\ngithub.com/argoproj/argo-workflows/v3/workflow/common.RunShellCommand\n\t/go/src/github.com/argoproj/argo-workflows/workflow/common/util.go:255\ngithub.com/argoproj/argo-workflows/v3/workflow/executor/docker.(*DockerExecutor).GetFileContents\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/docker/docker.go:58\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(*WorkflowExecutor).SaveParameters\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:492\ngithub.com/argoproj/argo-workflows/v3/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/wait.go:57\ngithub.com/argoproj/argo-workflows/v3/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/wait.go:18\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:846\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887\nmain.main\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"
time="2021-06-29T11:14:47.413Z" level=info msg="Saving output artifacts"
time="2021-06-29T11:14:47.414Z" level=info msg="Staging artifact: add-2-sum"
time="2021-06-29T11:14:47.414Z" level=info msg="Copying /tmp/outputs/sum/data from container base image layer to /tmp/argo/outputs/artifacts/add-2-sum.tgz"
time="2021-06-29T11:14:47.414Z" level=info msg="Archiving main:/tmp/outputs/sum/data to /tmp/argo/outputs/artifacts/add-2-sum.tgz"
time="2021-06-29T11:14:47.414Z" level=info msg="sh -c docker cp -a 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data - | gzip  > /tmp/argo/outputs/artifacts/add-2-sum.tgz"
time="2021-06-29T11:14:47.589Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:47.680Z" level=warning msg="path /tmp/outputs/sum/data does not exist in archive /tmp/argo/outputs/artifacts/add-2-sum.tgz"
time="2021-06-29T11:14:47.680Z" level=error msg="executor error: path /tmp/outputs/sum/data does not exist in archive /tmp/argo/outputs/artifacts/add-2-sum.tgz\ngithub.com/argoproj/argo-workflows/v3/errors.New\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:49\ngithub.com/argoproj/argo-workflows/v3/errors.Errorf\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:55\ngithub.com/argoproj/argo-workflows/v3/workflow/executor/docker.(*DockerExecutor).CopyFile\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/docker/docker.go:88\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(*WorkflowExecutor).stageArchiveFile\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:402\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(*WorkflowExecutor).saveArtifact\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:290\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(*WorkflowExecutor).SaveArtifacts\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:276\ngithub.com/argoproj/argo-workflows/v3/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/wait.go:62\ngithub.com/argoproj/argo-workflows/v3/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/wait.go:18\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:846\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887\nmain.main\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"
time="2021-06-29T11:14:47.681Z" level=info msg="Annotating pod with output"
time="2021-06-29T11:14:47.694Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} main:{32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5 Up {0 63760562072 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:47.694Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-06-29 11:13:20 +0000 UTC, too long before process started"
time="2021-06-29T11:14:47.739Z" level=info msg="Patch pods 200"
time="2021-06-29T11:14:47.898Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=add-pipeline-fz5cn-2565795640"
time="2021-06-29T11:14:48.025Z" level=info msg="listed containers" containers="map[kfp-launcher:{7a39cf1866758e4c53dd138c38f90b2a683ae0059a3ecc2ea0b439a28dbb38c1 Exited {0 63760562000 <nil>}} main:{32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5 Up {0 63760562072 <nil>}} wait:{1531d46bf944fcdb3ac8ea356e0536bb2a01ba9791e9007699f6d195401fc431 Up {0 63760562057 <nil>}}]"
time="2021-06-29T11:14:48.025Z" level=info msg="Killing sidecars [\"kfp-launcher\"]"
time="2021-06-29T11:14:48.032Z" level=info msg="Get pods 200"
time="2021-06-29T11:14:48.034Z" level=info msg="zero container IDs, assuming all containers have exited successfully"
time="2021-06-29T11:14:48.034Z" level=info msg="Alloc=10350 TotalAlloc=17307 Sys=73297 NumGC=5 Goroutines=13"
time="2021-06-29T11:14:48.173Z" level=fatal msg="Error: No such container:path: 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data\ntar: This does not look like a tar archive\ntar: Exiting with failure status due to previous errors\ngithub.com/argoproj/argo-workflows/v3/errors.New\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:49\ngithub.com/argoproj/argo-workflows/v3/errors.InternalError\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:60\ngithub.com/argoproj/argo-workflows/v3/workflow/common.RunCommand\n\t/go/src/github.com/argoproj/argo-workflows/workflow/common/util.go:238\ngithub.com/argoproj/argo-workflows/v3/workflow/common.RunShellCommand\n\t/go/src/github.com/argoproj/argo-workflows/workflow/common/util.go:255\ngithub.com/argoproj/argo-workflows/v3/workflow/executor/docker.(*DockerExecutor).GetFileContents\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/docker/docker.go:58\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(*WorkflowExecutor).SaveParameters\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:492\ngithub.com/argoproj/argo-workflows/v3/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/wait.go:57\ngithub.com/argoproj/argo-workflows/v3/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/wait.go:18\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:846\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887\nmain.main\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"

I did some investigation, /tmp/outputs/metrics/data is path in the container where output artifacts are expected to be emitted. The archive path is an implementation detail of docker executor, so we can ignore it here. To verify, we may need to confirm whether inside the container the artifact was correctly generated.

What did you expect to happen:

Pipelines run successfully

Labels

/area backend


Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@Bobgy Bobgy added the kind/bug label Jun 29, 2021
@Bobgy Bobgy changed the title [bug/v2compat] pipelines flaky after upgrading to argo v3.1.0 [bug] pipelines flaky on GKE Jun 29, 2021
@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 29, 2021

Downgraded to KFP 1.6.0 and verified the same problems, they seem to be related to my GKE cluster than argo/kfp.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jul 9, 2021

I'm seeing this again on test infra:
https://4e18c21c9d33d20f-dot-datalab-vm-staging.googleusercontent.com/#/runs/details/d249c419-a9a7-4653-8212-856f87ad1e91
https://4e18c21c9d33d20f-dot-datalab-vm-staging.googleusercontent.com/#/runs/details/cb35184f-60a5-44e1-832d-6ec1d7866c98

again on

This step is in Error state with this message: Error (exit code 1): path /tmp/outputs/Output/data does not exist in archive /tmp/argo/outputs/artifacts/output-artifact-Output.tgz

https://4e18c21c9d33d20f-dot-datalab-vm-staging.googleusercontent.com/#/runs/details/dfb5fe83-dd21-4a26-b494-c9382e700de5

@Bobgy
Copy link
Contributor Author

Bobgy commented Jul 9, 2021

Restarted nodepools and the issue is temporarily resolved

@Bobgy
Copy link
Contributor Author

Bobgy commented Jul 9, 2021

Just saw a different type of error:
#5970 (comment)

F0709 05:29:17.078200 1 main.go:56] Failed to execute component: failed to download input artifact "input_dir" from remote storage URI "gs://kfp-ci/4ca1dd2d702e106386af42120c73df46ed8c0cc5/v2-sample-test/data/samples_config-loop-item/dir-pipeline-v2/dir-pipeline-v2-hd2tw/produce-dir-with-files-v2-python-op/output_dir": failed to list objects in remote storage "dir-pipeline-v2/dir-pipeline-v2-hd2tw/produce-dir-with-files-v2-python-op/output_dir": blob (code=Unknown): Get "

@Bobgy Bobgy changed the title [bug] pipelines flaky on GKE [bug] pipelines flaky on GKE after upgrading to argo v3.1+ Jul 16, 2021
@Bobgy Bobgy changed the title [bug] pipelines flaky on GKE after upgrading to argo v3.1+ [bug] pipelines flaky on GKE after upgrading to argo v3.1+ docker executor Jul 16, 2021
@Bobgy
Copy link
Contributor Author

Bobgy commented Jul 22, 2021

Decided to stop working on this issue, because the problems proves to be issues in argo docker executor itself.
Our plan has been upgrading to emissary executor, so there's not much value stabilizing argo docker executor if it isn't stable already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants