Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient network errors are not handled properly for plugin API calls #9664

Closed
3 tasks done
Suckzoo opened this issue Sep 23, 2022 · 0 comments · Fixed by #9665
Closed
3 tasks done

Transient network errors are not handled properly for plugin API calls #9664

Suckzoo opened this issue Sep 23, 2022 · 0 comments · Fixed by #9665

Comments

@Suckzoo
Copy link
Contributor

Suckzoo commented Sep 23, 2022

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

1. Transient url.Error is not handled properly.

While every transient network error is *url.Error when using net/http's http.Client.Do() (https://github.com/golang/go/blob/master/src/net/http/client.go#L579-L580), the only *url.Error that is considered transient is ECONNCLOSED (source code).
#9017 deals with the ECONNRESET error, but this pull is not very effective on http request because ECONNRESET is an *url.Error actually.

2. argoexec does not retry ExecutorPlugin API call on transient network errors.

The document describes that any connection/socket error occurred during a plugin API call is considered transient, therefore triggers retry logic. Temporary(this term is arguable. golang/go#45729) network errors are handled as of now, but this does not fully cover every transient network error. ECONNRESET is one of the example. Generally it is not a temporary network error(golang/go#24808), but in the context of argo-workflow we can consider it as a transient error, i.e. we may retry and get the successful reply from the plugin.

3. EOF error during http request also should be considered as a transient network errors.

I accidentally set the requeueing period same as the keep-alive timeout of the ExecutorPlugin and I found EOF could also happen as well as ECONNRESET. When invoking read() syscall inside the http request module, EOF wrapped with *url.Error could be returned if peer has disconnected.

How to reproduce:

Implement a simple ExecutorPlugin with nodejs(e.g. with expressjs), without keep-alive timeout tuning. The server should respond with requeue: 5s, node.phase: Running.

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

See the description above.

Logs from the workflow controller

Nothing special.

Logs from in your workflow's wait container

time="2022-09-23T03:40:01.131Z" level=info msg="Processing task" nodeID=hello-ptjn9
time="2022-09-23T03:40:01.132Z" level=info msg="Sending result" message="Post \"http://localhost:4355/api/v1/template.execute\": read tcp 127.0.0.1:48594->127.0.0.1:4355: read: connection reset by peer" nodeID=hello-ptjn9 phase=Failed requeue=0s
time="2022-09-23T03:40:04.547Z" level=info msg="Processing Patch" workflow=hello-ptjn9
time="2022-09-23T03:40:04.559Z" level=info msg="TaskSet Event" event_type=MODIFIED workflow=hello-ptjn9
time="2022-09-23T03:40:04.559Z" level=info msg="Task is already considered" nodeID=hello-ptjn9
time="2022-09-23T03:40:04.560Z" level=info msg="Patch workflowtasksets/status 200"
time="2022-09-23T03:40:04.561Z" level=info msg="Patched TaskSet"
time="2022-09-23T03:40:04.917Z" level=info msg="TaskSet Event" event_type=MODIFIED workflow=hello-ptjn9
time="2022-09-23T03:40:14.949Z" level=info msg="TaskSet Event" event_type=MODIFIED workflow=hello-ptjn9
time="2022-09-23T03:40:14.949Z" level=info msg="Workflow completed... stopping agent" workflow=hello-ptjn9
chenyangxueHDU pushed a commit to chenyangxueHDU/argo that referenced this issue Sep 29, 2022
juchaosong pushed a commit to juchaosong/argo-workflows that referenced this issue Nov 3, 2022
@agilgur5 agilgur5 changed the title Transient network errors are not handled properly Transient network errors are not handled properly for plugin API calls Apr 26, 2024
@argoproj argoproj locked as resolved and limited conversation to collaborators Apr 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants