[Monitoring] Partition uworker_output.ErrorType conditions into success, maybe_retry and failure outcomes #4499

vitorguidi · 2024-12-13T02:02:04Z

Motivation

#4458 implemented a task outcome metric, so we can track error rates in utasks, by job/task/subtask.

As failures are expected for ClusterFuzz, initially only unhandled exceptions would be considered as actual errors. Chrome folks asked for a better partitioning of error codes, which is implemented here as the following outcomes:

success: the task has unequivocally succeeded, producing a sane result
maybe_retry: some transient error happened, and the task is potentially being retried. This might capture some unretriable failure condition, but it is a compromise we are willing to make in order to decrease false positives.
failure: the task has unequivocally failed.

Part of #4271

…ss, maybe_retry and failure outcomes (#4499) ### Motivation #4458 implemented a task outcome metric, so we can track error rates in utasks, by job/task/subtask. As failures are expected for ClusterFuzz, initially only unhandled exceptions would be considered as actual errors. Chrome folks asked for a better partitioning of error codes, which is implemented here as the following outcomes: * success: the task has unequivocally succeeded, producing a sane result * maybe_retry: some transient error happened, and the task is potentially being retried. This might capture some unretriable failure condition, but it is a compromise we are willing to make in order to decrease false positives. * failure: the task has unequivocally failed. Part of #4271

Running CI checks with a PR prior to deployment

jonathanmetzman

Please revert this PR.
I don't want to have categorize these conditions, especially not where they are defined.

vitorguidi · 2024-12-16T21:14:58Z

There is a cleaner way to do this. Add two boolean fields to uworker_output.ErrorType:

is_success
is_retryable

This way we can propagate the data to _MetricRecorder without listing out enums. Will revert this and implement the alternative approach

…to success, maybe_retry and failure outcomes (#4499)" This reverts commit d3d1b76.

There is a clear way to partition ErrorType enums in the criteria proposed, reverting this one. Part of #4271

vitorguidi · 2024-12-17T21:04:46Z

The maybe_retry outcome is kind of useless, it comprises a tiny tiny volume among all outcomes. It is better to split between success and error only

@letitz @alhijazi

vitorguidi · 2024-12-17T21:35:20Z

The maybe_retry outcome is kind of useless, it comprises a tiny tiny volume among all outcomes. It is better to split between success and error only

@letitz @alhijazi

Even discarding fuzz task (which does not retry and comprises most of the task volume), this is still a very small fraction

This reverts commit f9d516c.

letitz · 2024-12-18T08:37:45Z

Fair enough, thanks for trying!

There is a clear way to partition ErrorType enums in the criteria proposed, reverting this one. Part of #4271

There is a clear way to partition ErrorType enums in the criteria proposed, reverting this one. Part of #4271 Co-authored-by: Vitor Guidi <vitorguidi@gmail.com>

#4516) ### Motivation #4458 implemented an error rate for utasks, only considering exceptions. In #4499 , outcomes were split between success, failure and maybe_retry conditions. There we learned that the volume of retryable outcomes is negligible, so it makes sense to count them as failures. Listing out all the success conditions under _MetricRecorder is not desirable. However, we are consciously taking this technical debt so we can deliver #4271 . A refactor of uworker main will be later performed, so we can split the success and failure conditions, both of which are mixed in uworker_output.ErrorType. Reference for tech debt acknowledgement: #4517

There is a clear way to partition ErrorType enums in the criteria proposed, reverting this one. Part of #4271

#4516) ### Motivation #4458 implemented an error rate for utasks, only considering exceptions. In #4499 , outcomes were split between success, failure and maybe_retry conditions. There we learned that the volume of retryable outcomes is negligible, so it makes sense to count them as failures. Listing out all the success conditions under _MetricRecorder is not desirable. However, we are consciously taking this technical debt so we can deliver #4271 . A refactor of uworker main will be later performed, so we can split the success and failure conditions, both of which are mixed in uworker_output.ErrorType. Reference for tech debt acknowledgement: #4517

…ss, maybe_retry and failure outcomes (#4499) ### Motivation #4458 implemented a task outcome metric, so we can track error rates in utasks, by job/task/subtask. As failures are expected for ClusterFuzz, initially only unhandled exceptions would be considered as actual errors. Chrome folks asked for a better partitioning of error codes, which is implemented here as the following outcomes: * success: the task has unequivocally succeeded, producing a sane result * maybe_retry: some transient error happened, and the task is potentially being retried. This might capture some unretriable failure condition, but it is a compromise we are willing to make in order to decrease false positives. * failure: the task has unequivocally failed. Part of #4271

There is a clear way to partition ErrorType enums in the criteria proposed, reverting this one. Part of #4271

#4516) ### Motivation #4458 implemented an error rate for utasks, only considering exceptions. In #4499 , outcomes were split between success, failure and maybe_retry conditions. There we learned that the volume of retryable outcomes is negligible, so it makes sense to count them as failures. Listing out all the success conditions under _MetricRecorder is not desirable. However, we are consciously taking this technical debt so we can deliver #4271 . A refactor of uworker main will be later performed, so we can split the success and failure conditions, both of which are mixed in uworker_output.ErrorType. Reference for tech debt acknowledgement: #4517

vitorguidi added 2 commits December 13, 2024 00:52

partition uworker error types in success, maybe retry, and failure

1511f6f

Emit TASK_OUTCOME_COUNT by using the best effort error type partitioning

7202e98

vitorguidi requested review from jonathanmetzman, alhijazi and letitz December 13, 2024 02:02

vitorguidi and others added 4 commits December 13, 2024 02:05

Delete temporary logging

1582287

Attempt to fix lint

6b3c119

Fix lint again

f909755

Merge branch 'master' into chore/expand-task-error-rate-metric

8473f3f

vitorguidi merged commit d3d1b76 into master Dec 16, 2024
7 checks passed

vitorguidi deleted the chore/expand-task-error-rate-metric branch December 16, 2024 13:23

vitorguidi added a commit that referenced this pull request Dec 16, 2024

Merge #4499 and #4481 into chrome branch (#4505)

19fea40

Running CI checks with a PR prior to deployment

jonathanmetzman reviewed Dec 16, 2024

View reviewed changes

vitorguidi added a commit that referenced this pull request Dec 16, 2024

Revert "[Monitoring] Partition uworker_output.ErrorType conditions in…

a39956d

…to success, maybe_retry and failure outcomes (#4499)" This reverts commit d3d1b76.

vitorguidi added a commit that referenced this pull request Dec 17, 2024

Revert #4499 (#4512)

f9d516c

There is a clear way to partition ErrorType enums in the criteria proposed, reverting this one. Part of #4271

vitorguidi added a commit that referenced this pull request Dec 17, 2024

Revert "Revert #4499 (#4512)"

38c603c

This reverts commit f9d516c.

vitorguidi mentioned this pull request Dec 17, 2024

[Monitoring] Partition UTask outcomes correctly into success and error #4516

Merged

jonathanmetzman pushed a commit that referenced this pull request Dec 20, 2024

Revert #4499 (#4512)

c412e33

There is a clear way to partition ErrorType enums in the criteria proposed, reverting this one. Part of #4271

jonathanmetzman added a commit that referenced this pull request Dec 20, 2024

Revert #4499 (#4512) (#4534)

3201444

There is a clear way to partition ErrorType enums in the criteria proposed, reverting this one. Part of #4271 Co-authored-by: Vitor Guidi <vitorguidi@gmail.com>

vitorguidi added a commit that referenced this pull request Dec 27, 2024

Revert #4499 (#4512)

3997898

There is a clear way to partition ErrorType enums in the criteria proposed, reverting this one. Part of #4271

jonathanmetzman pushed a commit that referenced this pull request Jan 8, 2025

Revert #4499 (#4512)

70b2fd3

There is a clear way to partition ErrorType enums in the criteria proposed, reverting this one. Part of #4271

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring] Partition uworker_output.ErrorType conditions into success, maybe_retry and failure outcomes #4499

[Monitoring] Partition uworker_output.ErrorType conditions into success, maybe_retry and failure outcomes #4499

vitorguidi commented Dec 13, 2024

jonathanmetzman left a comment

vitorguidi commented Dec 16, 2024 •

edited

Loading

vitorguidi commented Dec 17, 2024 •

edited

Loading

vitorguidi commented Dec 17, 2024

letitz commented Dec 18, 2024

[Monitoring] Partition uworker_output.ErrorType conditions into success, maybe_retry and failure outcomes #4499

[Monitoring] Partition uworker_output.ErrorType conditions into success, maybe_retry and failure outcomes #4499

Conversation

vitorguidi commented Dec 13, 2024

Motivation

jonathanmetzman left a comment

Choose a reason for hiding this comment

vitorguidi commented Dec 16, 2024 • edited Loading

vitorguidi commented Dec 17, 2024 • edited Loading

vitorguidi commented Dec 17, 2024

letitz commented Dec 18, 2024

vitorguidi commented Dec 16, 2024 •

edited

Loading

vitorguidi commented Dec 17, 2024 •

edited

Loading