-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix nexus_task_execution_failed to include OperationError outcome in start requests #1664
Conversation
Do we not have any tests for metrics emitted that need to be updated? |
No tests for metrics in this case. I guess I can add some. |
// Increment failure in all forms of errors: | ||
// Internal error processing the task. | ||
// Failure from user handler. | ||
// Special case for the start response with operation error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a task failure if the process task call doesn't return a failure? I wouldn't expect failure to start things like a child workflow to increment this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is when a handler returns an UnsuccessfulOperationError. A user would consider it a failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, may need a definition for this metric. We have temporal_workflow_task_execution_failed
which is actually a workflow task failure (a defined Temporal term) and we have temporal_activity_execution_failed
which is an activity attempt/execution failure (we don't call this a task failure).
Is this expected to be a task failure (i.e. unexpected, non-user-returned/panic, failure to process the task) or is this an execution failure (i.e. expected, user-returned error)? I wonder if there should be two metrics, temporal_nexus_task_execution_failed
and temporal_nexus_operation_execution_failed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be fine renaming the metric to something like temporal_nexus_operation_execution_failed
not sure why we need two metrics if activity only needs one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue with your suggestion is that there are other ways that the execution would fail, e.g. by returning a non retryable error in your handler or the workflow in the workflow run operation fails asynchronously.
For activities we don't distinguish between retryable and non retryable errors, so I figured this model is good enough.
What we have now maps to whether a handler method returned an error or not (or there was an internal error within the SDK).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definition is, it's incremented if any of the following are true:
- The user defined handler returns any error (or panics)
- There's an internal error in the SDK (e.g. bug) while processing the task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user defined handler returns any error (or panics)
Does this include UnsuccessfulOperationError
and ErrOperationStillRunning
errors on GetResult
? Do we really want a metric that can't differentiate between a proper failure and an internal one? Or should we not count operation failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't implemented GetResult
but I think we could exclude ErrOperationStillRunning
. This change includes UnsuccessfulOperationError
in the metric, where it didn't used to be included before.
The rationale is that all of HandlerErrorTypeBadRequest
, non-retryable ApplicationError
, and UnsuccessfulOperationError
are non-retryable errors from the caller perspective and essentially fail the operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I agree that it's not perfect. It wasn't perfect before either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the metrics should be split between user-returned operation errors and unexpected RPC failures similar to workflows. People often just stop using activity failure metrics because they are using intentional failures for some things, and now they can't even know what panics. They have to create their own metrics to differentiate. The difference is more obvious in other languages where unexpected failures/exceptions are more common and the languages are used to differentiating explicit RPC failure from accidental one.
But if the definition of this metric is decided otherwise, ok.
Discussed offline, we'll add a label to this metric to help distinguish between the different error types in a followup PR. |
Why?
Consistency with other failure metrics.