fix(monitor): prevent sample loss during disruption sampler shutdown#30771
fix(monitor): prevent sample loss during disruption sampler shutdown#30771mgencur wants to merge 1 commit intoopenshift:mainfrom
Conversation
Add graceful shutdown logic to ensure all samples are processed before the disruption sampler terminates. Previously, samples could be lost when the context was cancelled, as the consumer would exit immediately without draining the remaining queue. Changes: - Add 30-second timeout when waiting for consumer to finish - Implement contextCancelled flag to track cancellation state - Continue processing remaining samples after context cancellation - Remove early returns that would skip sample processing - Ensure current sample completes even if context is cancelled This prevents the "not finished writing all samples" error and ensures data integrity during shutdown.
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
Tested the solution against my dev cluster: |
|
/assign @xueqzhan Please. |
|
/test ci/prow/okd-scos-images |
|
/test okd-scos-images |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mgencur, xueqzhan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Seems there is a known issue for the |
|
/retest-required |
|
@mgencur: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Add graceful shutdown logic to ensure all samples are processed before the disruption sampler terminates. Previously, samples could be lost when the context was cancelled, as the consumer would exit immediately without draining the remaining queue.
Changes:
This prevents the "not finished writing all samples" error and ensures data integrity during shutdown.
The error was spotted in multiple runs for Hypershift on AWS, example: 2020875118166675456
Root cause analysis:
The problematic sequence: