Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition in Stream Multiplexing #718

Open
paxbit opened this issue Mar 19, 2021 · 3 comments
Open

Race Condition in Stream Multiplexing #718

paxbit opened this issue Mar 19, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@paxbit
Copy link

paxbit commented Mar 19, 2021

Long story short

If one is lucky enough to hit the default settings.batching.batch_window for pause times on established multiplexed streams, events may not get dispatched. Instead they'll be lost and not processed at all. There is a race condition when dealing with the backlog queue.

Description

kopf.reactor.queueing.watcher(...) and kopf.reactor.queueing.worker(...) do not not property lock each other against backlog queue modifications leading to lost events.

After changing to a different environment with the operator I began seeing a lot of seemingly missed events. At first I thought the cluster API had issues because twice the missed events correlated with a 500 internal from the kube API very shortly before.

However after seeing many more missed events without any kube API server error before or after, I started debugging kopf.reactor.queueing.watcher(...) and kopf.reactor.queueing.worker(...).

What I found:

await streams[key].backlog.put(raw_event)
might successfully put an event without triggering a KeyError, which would spawn a new worker, while the task waiting for a new event to arrive at
raw_event = await asyncio.wait_for(
backlog.get(),
timeout=settings.batching.idle_timeout)
might already have actually timed out but has not yet thrown b/c the event loop so far did not come back to it.

This leads to situations where a backlog.qsize() > 0 in

except asyncio.TimeoutError:
break
is possible and this is also what I'm seeing for each of the missed events in the operator. breaking at this point with a non-zero qsize will then discard the event, leaving it unprocessed, through the queue deletion happening at
finally:
# Whether an exception or a break or a success, notify the caller, and garbage-collect our queue.
# The queue must not be left in the queue-cache without a corresponding job handling this queue.
try:
del streams[key]
except KeyError:
pass

Environment

  • Kopf version: 1.30.2
  • Kubernetes version: 1.17
  • Python version: 3.9.1rc1
  • OS/platform: Linux
@nolar
Copy link
Owner

nolar commented Apr 3, 2021

@paxbit Thanks for reporting this issue with the details. That let me fully understand the issue directly from its description, without a repro. Indeed, this might be the case.

May I ask what is the scale of the cluster you are operating? Probably by the number of resources being operated, only the scale: tens, hundreds, thousands? I am trying to understand in which circumstances this event loss becomes realistic.

Meanwhile, despite I cannot reproduce this issue locally, I have prepared a hypothetical fix: #732 — can you take a look, please? Will it fix the issue? Can you please try this patch or branch in your environment?

@paxbit
Copy link
Author

paxbit commented Apr 6, 2021

@nolar

May I ask what is the scale of the cluster you are operating? Probably by the number of resources being operated, only the scale: tens, hundreds, thousands? I am trying to understand in which circumstances this event loss becomes realistic.

The cluster is 24 Nodes each with 96-128 SMT-Cores and 1.5T RAM and in production the operator will probably have to handle 500-1000 resources. But at the day I wrote this issue could reproduce this reliably it happened with like 5 managed resources. That day the various latencies added up in such a way that it took only 2-4 runs to run into the situation.

Meanwhile, despite I cannot reproduce this issue locally, I have prepared a hypothetical fix: #732 — can you take a look, please? Will it fix the issue? Can you please try this patch or branch in your environment?

Yeah, that's basically how my monkey patch is doing it currently. Seems to work.

@nolar
Copy link
Owner

nolar commented Jul 4, 2021

Good news! In #784, a similar issue was reported with event loss. Unlike here, it was happening with no big scale, at a relatively small scale, but with multiple synchronous functions in async coroutines — which indirectly simulated an extremely laggy network. As a result, I was able to catch it into an artificial snippet with 100% reproducibility even with 1 object involved — see the manual scenario in #732. And after thorough consideration, that hypothetical fix can now be considered as a real and proper fix (though still untestable).

The fix is released together with some other improvements as version 1.33rc1. As with any RC (release candidates), be careful with testing: do not do this on real production or staging clusters, start with local isolated environments if possible. However, the changeset is not big, the risk seems low this time. The expected release time is in 1-2 weeks from now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants