Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilise time-based tests: polling with sleeps → async primitives #534

Merged
merged 2 commits into from
Sep 6, 2020

Conversation

nolar
Copy link
Owner

@nolar nolar commented Sep 6, 2020

What do these changes do?

Refactor some of the time-based tests that were influenced by extra time wasted on unnecessary sleeps while polling the job scheduler. Now, they use sync primitives, so the sleeps of 0.2-0.5s are not needed anymore.

Description

More and more often, the CI pipelines fail due to tests with timed blocks missing their allowed thresholds. While there is no clear and obvious way how to refactor these tests with any form of artificial asyncio time (see #212), some remedy can be applied now.

For these specific tests of watcher-workers queueing & batching of the events from the API's watch-stream, one of the biggest contributors to time discrepancies is the exit timeout. It is performed as one hundred polls of the scheduler's readiness at regular intervals, which are exit_timeout / 100 seconds.

This is fine for the regular operators' flow, but is a problem for the tests: when the timeout is artificially increased to e.g. 100 seconds, the sleeps become 1-second long; even if shortened to 1/1000th of the timeout, it will be 0.1 second, while the test is expected to finish in 0.2-0.3 seconds of the batch timeout — and becomes the most often cause of failures.

There were a few attempts to solve this problem: e.g. by measuring the code overhead in the runtime environment — #522 #528 — neither did help.

With this PR, the watcher's finalization, scheduler's closure, workers' exits, and the queues' depletion are synchronised via asynchronous primitives (asyncio.Condition in this case): whenever a worker exits, it wakes up the depletion routine. As a result, there is no need for polling and thus sleeps. And so, the watcher's exit is not delayed by the queue depletion routine anymore, even of 0.1-1s. And so, the time measurements of the watcher are more reliable and can be kept small (to keep the tests fast).

PS: On a side-note: this is also a reason why this 1/100th fraction or an interval were not moved to the settings (unlike exit_timeout, batch_window, etc): it was originally known as an internal hack that should be refactored to something normal. Now, the time has come.

PPS: This was also a reason why exit_timeout was set to 0.5 seconds in the tests, while it should never be used: because 1/100th of it should be low enough too. Now, the timeout can be 100 seconds, with no influence on the tests' outcomes.

Issues/PRs

Issues: #212 #338

Type of changes

  • Mostly CI/CD automation, contribution experience

Checklist

  • The code addresses only the mentioned problem, and this problem only
  • I think the code is well written
  • Unit tests for the changes exist
  • Documentation reflects the changes
  • If you provide code modification, please add yourself to CONTRIBUTORS.txt

…tives

Previously, the 1/100th of the exit timeout could be a significant part
of overall timing of the watchers; especially in the unit-tests, where
the exit timeout is artificially set high enough (e.g. 100 seconds).

Now, asynchronous primitives are used to signal that the streams/jobs
have changed, and it is worth re-checking the streams depletion, and
exit the watcher if all the streams are depleted.

One special edge-case to consider: the last worker's signal will finish
the depletion waiter due to empty streams (the predicate becomes true).
However, when the signaller is notified, the worker job itself is still running.
It is assumed that the job finishes immediately after the notification,
specifically within the scheduler's `close_timeout` (0.1s), so that the watcher
can close the scheduler without any issues or leftovers.

This change allows the watcher-workers-queueing tests to be adjusted for proper
timing, so that we ignore the overhead introduced by polling the scheduler at
regular time intervals.
@nolar nolar added automation CI/CD: testing, linting, releasing automatically refactoring Code cleanup without new features added labels Sep 6, 2020
@lgtm-com
Copy link

lgtm-com bot commented Sep 6, 2020

This pull request introduces 1 alert when merging 6d8985e into a168268 - view on LGTM.com

new alerts:

  • 1 for Unused import

@nolar nolar merged commit 3c0974f into master Sep 6, 2020
@nolar nolar deleted the refactor-watcher-depletion branch September 6, 2020 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation CI/CD: testing, linting, releasing automatically refactoring Code cleanup without new features added
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant