Stabilise time-based tests: polling with sleeps → async primitives #534
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What do these changes do?
Refactor some of the time-based tests that were influenced by extra time wasted on unnecessary sleeps while polling the job scheduler. Now, they use sync primitives, so the sleeps of 0.2-0.5s are not needed anymore.
Description
More and more often, the CI pipelines fail due to tests with timed blocks missing their allowed thresholds. While there is no clear and obvious way how to refactor these tests with any form of artificial asyncio time (see #212), some remedy can be applied now.
For these specific tests of watcher-workers queueing & batching of the events from the API's watch-stream, one of the biggest contributors to time discrepancies is the exit timeout. It is performed as one hundred polls of the scheduler's readiness at regular intervals, which are
exit_timeout / 100
seconds.This is fine for the regular operators' flow, but is a problem for the tests: when the timeout is artificially increased to e.g. 100 seconds, the sleeps become 1-second long; even if shortened to 1/1000th of the timeout, it will be 0.1 second, while the test is expected to finish in 0.2-0.3 seconds of the batch timeout — and becomes the most often cause of failures.
There were a few attempts to solve this problem: e.g. by measuring the code overhead in the runtime environment — #522 #528 — neither did help.
With this PR, the watcher's finalization, scheduler's closure, workers' exits, and the queues' depletion are synchronised via asynchronous primitives (
asyncio.Condition
in this case): whenever a worker exits, it wakes up the depletion routine. As a result, there is no need for polling and thus sleeps. And so, the watcher's exit is not delayed by the queue depletion routine anymore, even of 0.1-1s. And so, the time measurements of the watcher are more reliable and can be kept small (to keep the tests fast).PS: On a side-note: this is also a reason why this 1/100th fraction or an interval were not moved to the settings (unlike exit_timeout, batch_window, etc): it was originally known as an internal hack that should be refactored to something normal. Now, the time has come.
PPS: This was also a reason why exit_timeout was set to 0.5 seconds in the tests, while it should never be used: because 1/100th of it should be low enough too. Now, the timeout can be 100 seconds, with no influence on the tests' outcomes.
Issues/PRs
Type of changes
Checklist
CONTRIBUTORS.txt