Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): Report reconciliation errors better #4877

Merged
merged 17 commits into from
Jan 21, 2021
Merged

Conversation

alexec
Copy link
Contributor

@alexec alexec commented Jan 14, 2021

Signed-off-by: Alex Collins alex_collins@intuit.com

Checklist:

Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Alex Collins <alex_collins@intuit.com>
workflow/controller/dag.go Outdated Show resolved Hide resolved
workflow/controller/operator.go Show resolved Hide resolved
woc.addChildNode(sgNodeName, childNodeName)
return woc.markNodePhase(node.Name, wfv1.NodeError, errMsg)
return woc.markNodeError(node.Name, fmt.Errorf("step group deemed errored due to child %s error: %w", childNodeName, err))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, obfuscating the error to the user

workflow/controller/operator.go Outdated Show resolved Hide resolved
@alexec alexec marked this pull request as ready for review January 14, 2021 18:09
alexec and others added 3 commits January 15, 2021 08:20
Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Alex Collins <alex_collins@intuit.com>
@alexec
Copy link
Contributor Author

alexec commented Jan 17, 2021

TestContinueOnFailDag

@alexec alexec marked this pull request as draft January 17, 2021 18:27
Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Alex Collins <alex_collins@intuit.com>
@alexec alexec marked this pull request as ready for review January 19, 2021 17:14
case ErrTimeout:
woc.markWorkflowFailed(ctx, x.Error())
return
default:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guard with transient check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add feature flag (envvar)

Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Alex Collins <alex_collins@intuit.com>
@alexec
Copy link
Contributor Author

alexec commented Jan 19, 2021

12/22/204:02:13.525 AM | time="2020-12-22T12:02:13Z" level=error msg="error in entry template execution" error="pods \"canary-6kwhk\" is forbidden: rpc error: code = Unavailable desc = transport is closing" namespace=argo workflow=canary-6kwhk host = workflow-controller-79dd88b9f-k2b4dsource = kubernetes.var.log.containers.workflow-controller-79dd88b9f-k2b4d_argo_workflow-controller-29a517c3a32db2760ee89950d138e1bd47b97438ef955d4fdab2abebc54093a0.logsourcetype = fluent
-- | --
  | 12/20/207:38:30.026 PM | time="2020-12-21T03:38:30Z" level=error msg="error in entry template execution" error="Internal error occurred: failed calling webhook \"mutating-webhook.openpolicyagent.org\": Post https://opa.opa.svc:443/?timeout=30s: context deadline exceeded" namespace=argo workflow=canary-wwhpj host = workflow-controller-647fd7778f-525tmsource = kubernetes.var.log.containers.workflow-controller-647fd7778f-525tm_argo_workflow-controller-08eb6221b4257b23a598a11961475ab0125dcb1ce765e4314445d7d1b468b92b.logsourcetype = fluent
  | 12/20/207:37:30.024 PM | time="2020-12-21T03:37:30Z" level=error msg="error in entry template execution" error="Internal error occurred: failed calling webhook \"mutating-webhook.openpolicyagent.org\": Post https://opa.opa.svc:443/?timeout=30s: context deadline exceeded" namespace=argo workflow=canary-r7nq2 host = workflow-controller-647fd7778f-525tmsource = kubernetes.var.log.containers.workflow-controller-647fd7778f-525tm_argo_workflow-controller-08eb6221b4257b23a598a11961475ab0125dcb1ce765e4314445d7d1b468b92b.logsourcetype = fluent

Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Alex Collins <alex_collins@intuit.com>
@simster7 simster7 self-assigned this Jan 20, 2021
util/diff/diff.go Outdated Show resolved Hide resolved
switch err {
case ErrDeadlineExceeded:
woc.eventRecorder.Event(woc.wf, apiv1.EventTypeWarning, "WorkflowTimedOut", msg)
woc.eventRecorder.Event(woc.wf, apiv1.EventTypeWarning, "WorkflowTimedOut", x.Error())
case ErrParallelismReached:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we won't log ErrParallelismReached. Is this intended or did you mean to add a fallthrough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite, we will log it (just before the switch statement) - but we don't want to error the workflow

Co-authored-by: Simon Behar <simbeh7@gmail.com>
@alexec alexec merged commit f872366 into argoproj:master Jan 21, 2021
@alexec alexec deleted the err branch January 21, 2021 00:37
This was referenced Jan 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants