-
-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race Condition in Stream Multiplexing #718
Comments
@paxbit Thanks for reporting this issue with the details. That let me fully understand the issue directly from its description, without a repro. Indeed, this might be the case. May I ask what is the scale of the cluster you are operating? Probably by the number of resources being operated, only the scale: tens, hundreds, thousands? I am trying to understand in which circumstances this event loss becomes realistic. Meanwhile, despite I cannot reproduce this issue locally, I have prepared a hypothetical fix: #732 — can you take a look, please? Will it fix the issue? Can you please try this patch or branch in your environment? |
The cluster is 24 Nodes each with 96-128 SMT-Cores and 1.5T RAM and in production the operator will probably have to handle 500-1000 resources. But at the day I wrote this issue could reproduce this reliably it happened with like 5 managed resources. That day the various latencies added up in such a way that it took only 2-4 runs to run into the situation.
Yeah, that's basically how my monkey patch is doing it currently. Seems to work. |
Good news! In #784, a similar issue was reported with event loss. Unlike here, it was happening with no big scale, at a relatively small scale, but with multiple synchronous functions in async coroutines — which indirectly simulated an extremely laggy network. As a result, I was able to catch it into an artificial snippet with 100% reproducibility even with 1 object involved — see the manual scenario in #732. And after thorough consideration, that hypothetical fix can now be considered as a real and proper fix (though still untestable). The fix is released together with some other improvements as version 1.33rc1. As with any RC (release candidates), be careful with testing: do not do this on real production or staging clusters, start with local isolated environments if possible. However, the changeset is not big, the risk seems low this time. The expected release time is in 1-2 weeks from now. |
Long story short
If one is lucky enough to hit the default
settings.batching.batch_window
for pause times on established multiplexed streams, events may not get dispatched. Instead they'll be lost and not processed at all. There is a race condition when dealing with thebacklog
queue.Description
kopf.reactor.queueing.watcher(...)
andkopf.reactor.queueing.worker(...)
do not not property lock each other againstbacklog
queue modifications leading to lost events.After changing to a different environment with the operator I began seeing a lot of seemingly missed events. At first I thought the cluster API had issues because twice the missed events correlated with a 500 internal from the kube API very shortly before.
However after seeing many more missed events without any kube API server error before or after, I started debugging
kopf.reactor.queueing.watcher(...)
andkopf.reactor.queueing.worker(...)
.What I found:
kopf/kopf/reactor/queueing.py
Line 199 in e90ceaa
put
an event without triggering aKeyError
, which would spawn a new worker, while the task waiting for a new event to arrive atkopf/kopf/reactor/queueing.py
Lines 288 to 290 in e90ceaa
This leads to situations where a
backlog.qsize() > 0
inkopf/kopf/reactor/queueing.py
Lines 291 to 292 in e90ceaa
break
ing at this point with a non-zero qsize will then discard the event, leaving it unprocessed, through the queue deletion happening atkopf/kopf/reactor/queueing.py
Lines 324 to 330 in e90ceaa
Environment
The text was updated successfully, but these errors were encountered: