Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CheckpointLogger changes to improve signal #823

Merged
merged 4 commits into from
Nov 19, 2024

Conversation

matt-codecov
Copy link
Contributor

@matt-codecov matt-codecov commented Oct 26, 2024

UploadFlow metrics are not very trustworthy. there have always been "flow endings" that have been missed by my first pass at instrumentation (on_timeout and on_failure task callbacks), and recent refactors of the upload flow and integration of new features have added a bunch more + logged some non-coverage endings in the coverage flow

because of the missing endings, we are probably underreporting errors and the reliability rate appears higher than it should be. also the notification_latency metric is probably inaccurate, but i don't know whether that is biased high or low

metrics that require all exit paths from such a huge expanse of code to be instrumented manually are weak, but they are all we have in today's platform for this sort of topline. this PR attempts to improve the signal so we can better gauge the health of today's platform, and we can use that insight to set goals for future work

full details

review by commit, not whole PR

  • move CheckpointLogger state to LogContext, auto-load from kwargs in base task
    • simplifies usage. don't need to pass state around anymore
    # new
    UploadFlow.log(UploadFlow.UPLOAD_TASK_BEGIN)
    other_function_with_checkpoints()
    # old
    checkpoints = from_kwargs(UploadFlow, kwargs)
    checkpoints.log(UploadFlow.UPLOAD_TASK_BEGIN)
    other_function_with_checkpoints(checkpoints)
    
    • i think it will allow the celery on_failure() and on_timeout() handlers to access the latest checkpoints data and not just what was initially passed in kwargs
    • there's actually no CheckpointLogger class anymore lol
  • ignore checkpoints logged before the first checkpoint or after a terminal checkpoint
    • the "% endings captured" metric is not really trustworthy. a flow can mistakenly log 3 terminal events and they'll count as 3 endings, or some shared code could accidentally log a terminal UploadFlow event outside of an upload flow
    • this PR simply ignores these bad checkpoints, but if we want to take these metrics seriously i will make these raise sentry errors
  • attempt to close UploadFlow and TestResultFlow flows in celery on_failure() and on_timeout() handlers
  • log terminal checkpoints in a bunch of places that lacked them
  • make UploadTask only log UploadFlow checkpoints for coverage uploads (test results are a different flow)

background:
UploadFlow's reliability metrics are useful to track but the actual current value is probably not accurate. some 50% of the flow endings are apparently not captured and we have never dug into why. the missing endings are probably more slanted towards errors which means the reliability in our dashboards appears higher than it really is

a more accurate understanding of our reliability will help us decide what/how much work to do to revamp the platform

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

Copy link

codecov bot commented Oct 26, 2024

Codecov Report

Attention: Patch coverage is 93.49845% with 21 lines in your changes missing coverage. Please review.

Project coverage is 98.01%. Comparing base (03c73da) to head (6f07d01).
Report is 1 commits behind head on main.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
tasks/base.py 55.55% 12 Missing ⚠️
helpers/log_context.py 70.00% 3 Missing ⚠️
tasks/upload_finisher.py 72.72% 3 Missing ⚠️
tasks/upload.py 90.00% 2 Missing ⚠️
helpers/checkpoint_logger/__init__.py 99.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #823      +/-   ##
==========================================
- Coverage   98.06%   98.01%   -0.05%     
==========================================
  Files         444      444              
  Lines       35474    35540      +66     
==========================================
+ Hits        34786    34835      +49     
- Misses        688      705      +17     
Flag Coverage Δ
integration 41.99% <50.15%> (+0.05%) ⬆️
unit 90.87% <87.92%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-qa
Copy link

codecov-qa bot commented Oct 26, 2024

Codecov Report

Attention: Patch coverage is 98.46939% with 6 lines in your changes missing coverage. Please review.

Project coverage is 98.03%. Comparing base (03c73da) to head (f3bace2).

Current head f3bace2 differs from pull request most recent head 6f07d01

Please upload reports for the commit 6f07d01 to get more accurate results.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
tasks/base.py 75.00% 4 Missing ⚠️
helpers/checkpoint_logger/__init__.py 99.00% 1 Missing ⚠️
tasks/upload.py 94.11% 1 Missing ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #823      +/-   ##
==========================================
- Coverage   98.06%   98.03%   -0.03%     
==========================================
  Files         444      444              
  Lines       35474    36477    +1003     
==========================================
+ Hits        34786    35760     +974     
- Misses        688      717      +29     
Flag Coverage Δ
integration 98.03% <98.46%> (+56.09%) ⬆️
unit 98.03% <98.46%> (+7.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 95.95% <97.10%> (-0.21%) ⬇️
OutsideTasks 98.03% <99.60%> (+2.08%) ⬆️
Files with missing lines Coverage Δ
conftest.py 94.70% <100.00%> (-0.30%) ⬇️
helpers/checkpoint_logger/flows.py 100.00% <100.00%> (ø)
helpers/log_context.py 98.80% <100.00%> (+0.12%) ⬆️
helpers/logging_config.py 81.39% <100.00%> (ø)
helpers/telemetry.py 100.00% <100.00%> (ø)
helpers/tests/unit/test_checkpoint_logger.py 99.59% <100.00%> (-0.03%) ⬇️
helpers/tests/unit/test_log_context.py 100.00% <100.00%> (ø)
helpers/tests/unit/test_logging_config.py 100.00% <100.00%> (ø)
helpers/tests/unit/test_telemetry.py 100.00% <100.00%> (ø)
services/report/__init__.py 97.18% <100.00%> (+0.21%) ⬆️
... and 16 more

... and 112 files with indirect coverage changes

Copy link

codecov-public-qa bot commented Oct 26, 2024

Codecov Report

Attention: Patch coverage is 98.46939% with 6 lines in your changes missing coverage. Please review.

Project coverage is 98.03%. Comparing base (03c73da) to head (f3bace2).

Current head f3bace2 differs from pull request most recent head 6f07d01

Please upload reports for the commit 6f07d01 to get more accurate results.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
tasks/base.py 75.00% 4 Missing ⚠️
helpers/checkpoint_logger/__init__.py 99.00% 1 Missing ⚠️
tasks/upload.py 94.11% 1 Missing ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #823      +/-   ##
==========================================
- Coverage   98.06%   98.03%   -0.03%     
==========================================
  Files         444      444              
  Lines       35474    36477    +1003     
==========================================
+ Hits        34786    35760     +974     
- Misses        688      717      +29     
Flag Coverage Δ
integration 98.03% <98.46%> (+56.09%) ⬆️
unit 98.03% <98.46%> (+7.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 95.95% <97.10%> (-0.21%) ⬇️
OutsideTasks 98.03% <99.60%> (+2.08%) ⬆️
Files with missing lines Coverage Δ
conftest.py 94.70% <100.00%> (-0.30%) ⬇️
helpers/checkpoint_logger/flows.py 100.00% <100.00%> (ø)
helpers/log_context.py 98.80% <100.00%> (+0.12%) ⬆️
helpers/logging_config.py 81.39% <100.00%> (ø)
helpers/telemetry.py 100.00% <100.00%> (ø)
helpers/tests/unit/test_checkpoint_logger.py 99.59% <100.00%> (-0.03%) ⬇️
helpers/tests/unit/test_log_context.py 100.00% <100.00%> (ø)
helpers/tests/unit/test_logging_config.py 100.00% <100.00%> (ø)
helpers/tests/unit/test_telemetry.py 100.00% <100.00%> (ø)
services/report/__init__.py 97.18% <100.00%> (+0.21%) ⬆️
... and 16 more

... and 112 files with indirect coverage changes

@codecov-notifications
Copy link

codecov-notifications bot commented Oct 26, 2024

Codecov Report

Attention: Patch coverage is 93.49845% with 21 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
tasks/base.py 55.55% 12 Missing ⚠️
helpers/log_context.py 70.00% 3 Missing ⚠️
tasks/upload_finisher.py 72.72% 3 Missing ⚠️
tasks/upload.py 90.00% 2 Missing ⚠️
helpers/checkpoint_logger/__init__.py 99.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link

✅ All tests successful. No failed tests were found.

📣 Thoughts on this report? Let Codecov know! | Powered by Codecov

@matt-codecov matt-codecov requested a review from a team November 19, 2024 02:56
@matt-codecov matt-codecov marked this pull request as ready for review November 19, 2024 02:56
@matt-codecov matt-codecov added this pull request to the merge queue Nov 19, 2024
Merged via the queue into main with commit 3676a01 Nov 19, 2024
18 of 27 checks passed
@matt-codecov matt-codecov deleted the matt/checkpoint-logger-updates branch November 19, 2024 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants