Race Condition in Stream Multiplexing #718

paxbit · 2021-03-19T12:03:05Z

Long story short

If one is lucky enough to hit the default settings.batching.batch_window for pause times on established multiplexed streams, events may not get dispatched. Instead they'll be lost and not processed at all. There is a race condition when dealing with the backlog queue.

Description

kopf.reactor.queueing.watcher(...) and kopf.reactor.queueing.worker(...) do not not property lock each other against backlog queue modifications leading to lost events.

After changing to a different environment with the operator I began seeing a lot of seemingly missed events. At first I thought the cluster API had issues because twice the missed events correlated with a 500 internal from the kube API very shortly before.

However after seeing many more missed events without any kube API server error before or after, I started debugging kopf.reactor.queueing.watcher(...) and kopf.reactor.queueing.worker(...).

What I found:

kopf/kopf/reactor/queueing.py

Line 199 in e90ceaa

await streams[key].backlog.put(raw_event)

might successfully put an event without triggering a KeyError, which would spawn a new worker, while the task waiting for a new event to arrive at

kopf/kopf/reactor/queueing.py

Lines 288 to 290 in e90ceaa

 raw_event = await asyncio.wait_for( 

 backlog.get(), 

 timeout=settings.batching.idle_timeout)

might already have actually timed out but has not yet thrown b/c the event loop so far did not come back to it.

This leads to situations where a backlog.qsize() > 0 in

kopf/kopf/reactor/queueing.py

Lines 291 to 292 in e90ceaa

 except asyncio.TimeoutError: 

 break

is possible and this is also what I'm seeing for each of the missed events in the operator. breaking at this point with a non-zero qsize will then discard the event, leaving it unprocessed, through the queue deletion happening at

kopf/kopf/reactor/queueing.py

Lines 324 to 330 in e90ceaa

 finally: 

 # Whether an exception or a break or a success, notify the caller, and garbage-collect our queue. 

 # The queue must not be left in the queue-cache without a corresponding job handling this queue. 

 try: 

 del streams[key] 

 except KeyError: 

 pass

Environment

Kopf version: 1.30.2
Kubernetes version: 1.17
Python version: 3.9.1rc1
OS/platform: Linux

The text was updated successfully, but these errors were encountered:

nolar · 2021-04-03T23:10:11Z

@paxbit Thanks for reporting this issue with the details. That let me fully understand the issue directly from its description, without a repro. Indeed, this might be the case.

May I ask what is the scale of the cluster you are operating? Probably by the number of resources being operated, only the scale: tens, hundreds, thousands? I am trying to understand in which circumstances this event loss becomes realistic.

Meanwhile, despite I cannot reproduce this issue locally, I have prepared a hypothetical fix: #732 — can you take a look, please? Will it fix the issue? Can you please try this patch or branch in your environment?

paxbit · 2021-04-06T12:20:00Z

@nolar

May I ask what is the scale of the cluster you are operating? Probably by the number of resources being operated, only the scale: tens, hundreds, thousands? I am trying to understand in which circumstances this event loss becomes realistic.

The cluster is 24 Nodes each with 96-128 SMT-Cores and 1.5T RAM and in production the operator will probably have to handle 500-1000 resources. But at the day I wrote this issue could reproduce this reliably it happened with like 5 managed resources. That day the various latencies added up in such a way that it took only 2-4 runs to run into the situation.

Meanwhile, despite I cannot reproduce this issue locally, I have prepared a hypothetical fix: #732 — can you take a look, please? Will it fix the issue? Can you please try this patch or branch in your environment?

Yeah, that's basically how my monkey patch is doing it currently. Seems to work.

nolar · 2021-07-04T18:03:58Z

Good news! In #784, a similar issue was reported with event loss. Unlike here, it was happening with no big scale, at a relatively small scale, but with multiple synchronous functions in async coroutines — which indirectly simulated an extremely laggy network. As a result, I was able to catch it into an artificial snippet with 100% reproducibility even with 1 object involved — see the manual scenario in #732. And after thorough consideration, that hypothetical fix can now be considered as a real and proper fix (though still untestable).

The fix is released together with some other improvements as version 1.33rc1. As with any RC (release candidates), be careful with testing: do not do this on real production or staging clusters, start with local isolated environments if possible. However, the changeset is not big, the risk seems low this time. The expected release time is in 1-2 weeks from now.

paxbit added the bug Something isn't working label Mar 19, 2021

nolar mentioned this issue Apr 3, 2021

Prevent loss of events under stress-load or sync-blockers #732

Merged

nolar mentioned this issue Jun 30, 2021

Unexpected trigger and missing object during on.update() and on.field() handlers with callback filters #784

Open

2 tasks

euan-tilley mentioned this issue Jul 26, 2021

Handler starts but never finishes #810

Closed

nolar mentioned this issue Oct 3, 2021

Process events instantly and consistently, stop skipping the events due to "batching" #844

Draft

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race Condition in Stream Multiplexing #718

Race Condition in Stream Multiplexing #718

paxbit commented Mar 19, 2021 •

edited

Loading

nolar commented Apr 3, 2021

paxbit commented Apr 6, 2021 •

edited

Loading

nolar commented Jul 4, 2021

Race Condition in Stream Multiplexing #718

Race Condition in Stream Multiplexing #718

Comments

paxbit commented Mar 19, 2021 • edited Loading

Long story short

Description

Environment

nolar commented Apr 3, 2021

paxbit commented Apr 6, 2021 • edited Loading

nolar commented Jul 4, 2021

paxbit commented Mar 19, 2021 •

edited

Loading

paxbit commented Apr 6, 2021 •

edited

Loading