-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] pipelines flaky on GKE after upgrading to argo v3.1+ docker executor #5943
Comments
Downgraded to KFP 1.6.0 and verified the same problems, they seem to be related to my GKE cluster than argo/kfp. |
I'm seeing this again on test infra: again on
|
Restarted nodepools and the issue is temporarily resolved |
Just saw a different type of error:
|
Decided to stop working on this issue, because the problems proves to be issues in argo docker executor itself. |
What steps did you take
pipelines flaky after upgrading to argo v3.1.0, when using argo docker executor
What happened:
Some pipelines fail randomly with several types of error:
This step is in Error state with this message: Error (exit code 1): path /tmp/outputs/metrics/data does not exist in archive /tmp/argo/outputs/artifacts/output-named-tuple-metrics.tgz
This step is in Error state with this message: Error (exit code 1): Error: No such container:path: 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data tar: This does not look like a tar archive tar: Exiting with failure status due to previous errors
https://1bbe723ceaf1ede1-dot-asia-east1.pipelines.googleusercontent.com/#/runs/details/da866240-98f4-42d1-86e1-2c6b17a7b944
This step is in Error state with this message: Error (exit code 1): failed to wait for main container to complete: timed out waiting for the condition: Error response from daemon: No such container: 2723a1dcd93b23f100a96bd82480b8917b8f805827d41e1f70f515655cb1d9e1
https://1bbe723ceaf1ede1-dot-asia-east1.pipelines.googleusercontent.com/#/runs/details/1f5a60a1-885a-4de5-8222-d906e92130bc
Both type of errors have very similar logs in argo wait container:
I did some investigation,
/tmp/outputs/metrics/data
is path in the container where output artifacts are expected to be emitted. The archive path is an implementation detail of docker executor, so we can ignore it here. To verify, we may need to confirm whether inside the container the artifact was correctly generated.What did you expect to happen:
Pipelines run successfully
Labels
/area backend
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: