-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base task: attempt to log CheckpointLogger errors for on_timeout and on_failure #794
Conversation
Codecov ReportAttention: Patch coverage is
✅ All tests successful. No failed tests found.
Additional details and impacted files@@ Coverage Diff @@
## main #794 +/- ##
==========================================
- Coverage 98.01% 98.00% -0.01%
==========================================
Files 443 443
Lines 36615 36637 +22
==========================================
+ Hits 35887 35905 +18
- Misses 728 732 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Codecov ReportAttention: Patch coverage is ✅ All tests successful. No failed tests found.
@@ Coverage Diff @@
## main #794 +/- ##
==========================================
- Coverage 98.01% 98.00% -0.01%
==========================================
Files 443 443
Lines 36615 36637 +22
==========================================
+ Hits 35887 35905 +18
- Misses 728 732 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Codecov ReportAttention: Patch coverage is
✅ All tests successful. No failed tests found.
@@ Coverage Diff @@
## main #794 +/- ##
==========================================
- Coverage 98.01% 98.00% -0.01%
==========================================
Files 443 443
Lines 36615 36637 +22
==========================================
+ Hits 35887 35905 +18
- Misses 728 732 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Codecov ReportAttention: Patch coverage is
✅ All tests successful. No failed tests found. @@ Coverage Diff @@
## main #794 +/- ##
==========================================
- Coverage 98.01% 98.00% -0.01%
==========================================
Files 443 443
Lines 36615 36637 +22
==========================================
+ Hits 35887 35905 +18
- Misses 728 732 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
|
the
UploadFlow
reliability metrics rely on there being a checkpoint logged for every way the flow can end, whether it's a success or a failure. looking at the data, it appears 40-60% of flow endings aren't captured. this means we can't totally trust our calculated reliability ratei am not totally clear on how celery calls
on_failure()
andon_timeout()
, but this PR attempts to load any checkpoints data it can find in them and log an error event. ideally this will increase our share of captured endings and make the metrics more accurateon_timeout()
is a member on a celeryRequest
which we've subclassed asBaseCodecovRequest
. it doesn't receivekwargs
as an argument, but according to docs it has akwargs
property which presumably holds the kwargs that the task was scheduled with.on_failure()
is a member on the task and one of its arguments iskwargs
. a non-retriedUploadTask
will not have any checkpoints data in its kwargs, but i think a retriedUploadTask
and future tasks should at least have a beginning checkpoint.doing this creates a possibility that some endings will be double-counted:
SoftTimeLimitException
may result in bothon_timeout()
andon_failure()
being calledSoftTimeLimitException
allows the task to clean up gracefully. if it takes too long, a hardTimeLimitException
is raised? and it's possible if this happens thaton_timeout()
will be called twiceon_success()
, but i don't know for sure that there are not cases where we log an error and then also wind up inon_failure()
oron_retry()