Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Partition uworker_output.ErrorType conditions into success, maybe_retry and failure outcomes #4499

Merged
merged 6 commits into from
Dec 16, 2024

Conversation

vitorguidi
Copy link
Collaborator

Motivation

#4458 implemented a task outcome metric, so we can track error rates in utasks, by job/task/subtask.

As failures are expected for ClusterFuzz, initially only unhandled exceptions would be considered as actual errors. Chrome folks asked for a better partitioning of error codes, which is implemented here as the following outcomes:

  • success: the task has unequivocally succeeded, producing a sane result
  • maybe_retry: some transient error happened, and the task is potentially being retried. This might capture some unretriable failure condition, but it is a compromise we are willing to make in order to decrease false positives.
  • failure: the task has unequivocally failed.

Part of #4271

@vitorguidi vitorguidi merged commit d3d1b76 into master Dec 16, 2024
7 checks passed
@vitorguidi vitorguidi deleted the chore/expand-task-error-rate-metric branch December 16, 2024 13:23
vitorguidi added a commit that referenced this pull request Dec 16, 2024
…ss, maybe_retry and failure outcomes (#4499)

### Motivation

#4458 implemented a task outcome metric, so we can track error rates in
utasks, by job/task/subtask.

As failures are expected for ClusterFuzz, initially only unhandled
exceptions would be considered as actual errors. Chrome folks asked for
a better partitioning of error codes, which is implemented here as the
following outcomes:

* success: the task has unequivocally succeeded, producing a sane result
* maybe_retry: some transient error happened, and the task is
potentially being retried. This might capture some unretriable failure
condition, but it is a compromise we are willing to make in order to
decrease false positives.
* failure: the task has unequivocally failed.

Part of #4271
vitorguidi added a commit that referenced this pull request Dec 16, 2024
Running CI checks with a PR prior to deployment
Copy link
Collaborator

@jonathanmetzman jonathanmetzman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this PR.
I don't want to have categorize these conditions, especially not where they are defined.

@vitorguidi
Copy link
Collaborator Author

vitorguidi commented Dec 16, 2024

There is a cleaner way to do this. Add two boolean fields to uworker_output.ErrorType:

  • is_success
  • is_retryable

This way we can propagate the data to _MetricRecorder without listing out enums. Will revert this and implement the alternative approach

vitorguidi added a commit that referenced this pull request Dec 16, 2024
…to success, maybe_retry and failure outcomes (#4499)"

This reverts commit d3d1b76.
vitorguidi added a commit that referenced this pull request Dec 17, 2024
There is a clear way to partition ErrorType enums in the criteria
proposed, reverting this one.

Part of #4271
@vitorguidi
Copy link
Collaborator Author

vitorguidi commented Dec 17, 2024

The maybe_retry outcome is kind of useless, it comprises a tiny tiny volume among all outcomes. It is better to split between success and error only

@letitz @alhijazi
image

@vitorguidi
Copy link
Collaborator Author

The maybe_retry outcome is kind of useless, it comprises a tiny tiny volume among all outcomes. It is better to split between success and error only

@letitz @alhijazi image

Even discarding fuzz task (which does not retry and comprises most of the task volume), this is still a very small fraction
image

vitorguidi added a commit that referenced this pull request Dec 17, 2024
This reverts commit f9d516c.
@letitz
Copy link
Collaborator

letitz commented Dec 18, 2024

Fair enough, thanks for trying!

jonathanmetzman pushed a commit that referenced this pull request Dec 20, 2024
There is a clear way to partition ErrorType enums in the criteria
proposed, reverting this one.

Part of #4271
jonathanmetzman added a commit that referenced this pull request Dec 20, 2024
There is a clear way to partition ErrorType enums in the criteria
proposed, reverting this one.

Part of #4271

Co-authored-by: Vitor Guidi <vitorguidi@gmail.com>
vitorguidi added a commit that referenced this pull request Dec 23, 2024
#4516)

### Motivation

#4458 implemented an error rate for utasks, only considering exceptions.
In #4499 , outcomes were split between success, failure and maybe_retry
conditions. There we learned that the volume of retryable outcomes is
negligible, so it makes sense to count them as failures.

Listing out all the success conditions under _MetricRecorder is not
desirable. However, we are consciously taking this technical debt so we
can deliver #4271 .

A refactor of uworker main will be later performed, so we can split the
success and failure conditions, both of which are mixed in
uworker_output.ErrorType.

Reference for tech debt acknowledgement: #4517
vitorguidi added a commit that referenced this pull request Dec 26, 2024
#4516)

### Motivation

#4458 implemented an error rate for utasks, only considering exceptions.
In #4499 , outcomes were split between success, failure and maybe_retry
conditions. There we learned that the volume of retryable outcomes is
negligible, so it makes sense to count them as failures.

Listing out all the success conditions under _MetricRecorder is not
desirable. However, we are consciously taking this technical debt so we
can deliver #4271 .

A refactor of uworker main will be later performed, so we can split the
success and failure conditions, both of which are mixed in
uworker_output.ErrorType.

Reference for tech debt acknowledgement: #4517
vitorguidi added a commit that referenced this pull request Dec 27, 2024
There is a clear way to partition ErrorType enums in the criteria
proposed, reverting this one.

Part of #4271
vitorguidi added a commit that referenced this pull request Dec 27, 2024
#4516)

### Motivation

#4458 implemented an error rate for utasks, only considering exceptions.
In #4499 , outcomes were split between success, failure and maybe_retry
conditions. There we learned that the volume of retryable outcomes is
negligible, so it makes sense to count them as failures.

Listing out all the success conditions under _MetricRecorder is not
desirable. However, we are consciously taking this technical debt so we
can deliver #4271 .

A refactor of uworker main will be later performed, so we can split the
success and failure conditions, both of which are mixed in
uworker_output.ErrorType.

Reference for tech debt acknowledgement: #4517
jonathanmetzman pushed a commit that referenced this pull request Jan 8, 2025
…ss, maybe_retry and failure outcomes (#4499)

### Motivation

#4458 implemented a task outcome metric, so we can track error rates in
utasks, by job/task/subtask.

As failures are expected for ClusterFuzz, initially only unhandled
exceptions would be considered as actual errors. Chrome folks asked for
a better partitioning of error codes, which is implemented here as the
following outcomes:

* success: the task has unequivocally succeeded, producing a sane result
* maybe_retry: some transient error happened, and the task is
potentially being retried. This might capture some unretriable failure
condition, but it is a compromise we are willing to make in order to
decrease false positives.
* failure: the task has unequivocally failed.

Part of #4271
jonathanmetzman pushed a commit that referenced this pull request Jan 8, 2025
There is a clear way to partition ErrorType enums in the criteria
proposed, reverting this one.

Part of #4271
jonathanmetzman pushed a commit that referenced this pull request Jan 8, 2025
#4516)

### Motivation

#4458 implemented an error rate for utasks, only considering exceptions.
In #4499 , outcomes were split between success, failure and maybe_retry
conditions. There we learned that the volume of retryable outcomes is
negligible, so it makes sense to count them as failures.

Listing out all the success conditions under _MetricRecorder is not
desirable. However, we are consciously taking this technical debt so we
can deliver #4271 .

A refactor of uworker main will be later performed, so we can split the
success and failure conditions, both of which are mixed in
uworker_output.ErrorType.

Reference for tech debt acknowledgement: #4517
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants