Stabilise time-based tests: polling with sleeps → async primitives #534

nolar · 2020-09-06T18:31:34Z

What do these changes do?

Refactor some of the time-based tests that were influenced by extra time wasted on unnecessary sleeps while polling the job scheduler. Now, they use sync primitives, so the sleeps of 0.2-0.5s are not needed anymore.

Description

More and more often, the CI pipelines fail due to tests with timed blocks missing their allowed thresholds. While there is no clear and obvious way how to refactor these tests with any form of artificial asyncio time (see #212), some remedy can be applied now.

Numerous branch/PR builds are failed: https://travis-ci.org/github/nolar/kopf/jobs/722895190 — though, they are usually restarted a few times until success, so the links are lost.
Sometimes, master branch failed after a successful PR run: https://travis-ci.org/github/nolar/kopf/jobs/723969381
Even the sped-up K3d/K3s branch fails with these errors: https://travis-ci.org/github/nolar/kopf/builds/724528729

For these specific tests of watcher-workers queueing & batching of the events from the API's watch-stream, one of the biggest contributors to time discrepancies is the exit timeout. It is performed as one hundred polls of the scheduler's readiness at regular intervals, which are exit_timeout / 100 seconds.

This is fine for the regular operators' flow, but is a problem for the tests: when the timeout is artificially increased to e.g. 100 seconds, the sleeps become 1-second long; even if shortened to 1/1000th of the timeout, it will be 0.1 second, while the test is expected to finish in 0.2-0.3 seconds of the batch timeout — and becomes the most often cause of failures.

There were a few attempts to solve this problem: e.g. by measuring the code overhead in the runtime environment — #522 #528 — neither did help.

With this PR, the watcher's finalization, scheduler's closure, workers' exits, and the queues' depletion are synchronised via asynchronous primitives (asyncio.Condition in this case): whenever a worker exits, it wakes up the depletion routine. As a result, there is no need for polling and thus sleeps. And so, the watcher's exit is not delayed by the queue depletion routine anymore, even of 0.1-1s. And so, the time measurements of the watcher are more reliable and can be kept small (to keep the tests fast).

PS: On a side-note: this is also a reason why this 1/100th fraction or an interval were not moved to the settings (unlike exit_timeout, batch_window, etc): it was originally known as an internal hack that should be refactored to something normal. Now, the time has come.

PPS: This was also a reason why exit_timeout was set to 0.5 seconds in the tests, while it should never be used: because 1/100th of it should be low enough too. Now, the timeout can be 100 seconds, with no influence on the tests' outcomes.

Issues/PRs

Issues: #212 #338

Type of changes

Mostly CI/CD automation, contribution experience

Checklist

The code addresses only the mentioned problem, and this problem only
I think the code is well written
Unit tests for the changes exist
Documentation reflects the changes
If you provide code modification, please add yourself to CONTRIBUTORS.txt

…tives Previously, the 1/100th of the exit timeout could be a significant part of overall timing of the watchers; especially in the unit-tests, where the exit timeout is artificially set high enough (e.g. 100 seconds). Now, asynchronous primitives are used to signal that the streams/jobs have changed, and it is worth re-checking the streams depletion, and exit the watcher if all the streams are depleted. One special edge-case to consider: the last worker's signal will finish the depletion waiter due to empty streams (the predicate becomes true). However, when the signaller is notified, the worker job itself is still running. It is assumed that the job finishes immediately after the notification, specifically within the scheduler's `close_timeout` (0.1s), so that the watcher can close the scheduler without any issues or leftovers. This change allows the watcher-workers-queueing tests to be adjusted for proper timing, so that we ignore the overhead introduced by polling the scheduler at regular time intervals.

lgtm-com · 2020-09-06T18:40:56Z

This pull request introduces 1 alert when merging 6d8985e into a168268 - view on LGTM.com

new alerts:

1 for Unused import

nolar added 2 commits September 6, 2020 19:36

Remove code overhead measurements and comparison

6d8985e

nolar added automation CI/CD: testing, linting, releasing automatically refactoring Code cleanup without new features added labels Sep 6, 2020

nolar merged commit 3c0974f into master Sep 6, 2020

nolar deleted the refactor-watcher-depletion branch September 6, 2020 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilise time-based tests: polling with sleeps → async primitives #534

Stabilise time-based tests: polling with sleeps → async primitives #534

nolar commented Sep 6, 2020 •

edited

Loading

lgtm-com bot commented Sep 6, 2020

Stabilise time-based tests: polling with sleeps → async primitives #534

Stabilise time-based tests: polling with sleeps → async primitives #534

Conversation

nolar commented Sep 6, 2020 • edited Loading

What do these changes do?

Description

Issues/PRs

Type of changes

Checklist

lgtm-com bot commented Sep 6, 2020

nolar commented Sep 6, 2020 •

edited

Loading