Prevent loss of events under stress-load or sync-blockers #732
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A potential race condition is described in #718: when an "idle timeout" happens in a worker, it takes a moment before the worker's stream is deleted; the multiplexer can add new events during this microscopic time frame, and such events will be lost. (UPD: Not fully correct: the multiplexer will do nothing until the control is given back to asyncio event-loop; still the timeout happens even if the queue is filled in some cases.)
Similarly in #784: the queue-getting timeout happens because of lengthy sync operations in the operator's callback filters executed in the framework's stack (can be simulated with
time.sleep(1)
) — they were blocking the asyncio loop long enough for the event to arrive to the I/O socket, be noticed and queued immediately once the control returns to the asyncio loop, but too late for the worker's queue-getting calls which timed out. This resulted in the worker exiting withbacklog.qsize() == 1
.This PR hypothetically fixes the issue by doing an extra check on the queue's emptiness before really exiting the worker. However, it is difficult to check the fix as it is difficult to reproduce the issue.UPD: Both the issue and the fix are confirmed by manual tests. However, there are no automated tests: it seems impossible to simulate a scenario with
queue.get()
timing out while the queue is non-empty — neither with pure Python nor with mocks.The locks are intentionally not used to avoid locking overhead — locks acquiring is quite expensive for this performance-critical part of the code, and the issue happens only in rare circumstances.
A manual repro:
Here, we use namespaces as the resource where the events will be lost (because they are easier to create from the command line: no yamls are needed). Specifically, the namespace "simulated-123" is supposed to have 2 events:
The 1st event is never lost as it created a worker. And we intentionally give control back to asyncio to ensure that the worker is started (because the bug is in the worker).
The 2nd event is lost if the sync delay in the async event loop exceeds the worker's idle timeout. If it is below (taking the first
asyncio.sleep()
into account too), then the event is not lost.Before the fix:
After the fix:
When diving deeper, additional
print()
statements inkopf._core.reactor.queueing.worker()
can show that thefinally:
clause is executed withbacklog.qsize() == 1
before the fix, i.e. the queue is not truly empty on timeout of getting from it.