Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It is possible for a CCF node to deadlock when both ringbuffers are full #628

Closed
eddyashton opened this issue Dec 11, 2019 · 0 comments
Closed

Comments

@eddyashton
Copy link
Member

CCF nodes can exhibit deadlock when both the inbound and outbound ringbuffers are full. Our assumption that the host will always clear the outbound ringbuffer, allowing the full system to progress, is not true. We have a single host-side uv thread making blocking write attempts, so if both ringbuffers fill the node will get stuck.

We can get a consistent repro in a debug, verbose-logging build, running the logging_scenario_perf_test. This has no --max-writes-ahead argument, so allows the incoming TLS messages to overwhelm the inbound ringbuffer (while verbose logging fills the outbound buffer). We should be able to replicate in other perf tests by removing the --max-writes-ahead argument.

Here's a representative locked point from gdb:

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7f4a151c5b80 (LWP 9168) "cchost.virtual" ringbuffer::Writer::prepare (this=0x8e2fd0,      m=1271228756, size=16392, wait=true, identifier=0x0) at ../src/ds/ringbuffer.h:254
  2    Thread 0x7f4a0c7f1700 (LWP 9169) "cchost.virtual" ringbuffer::Writer::prepare (this=0x88c140,      m=1225314783, size=86, wait=true, identifier=0x0) at ../src/ds/ringbuffer.h:254

I see 3 potential fixes:

  1. Add a second host thread, responsible for processing the outbound messages. Currently we do this in HandleRingbufferImpl::every(), on the main uv loop thread. This may introduce synchronization issues (trying to access sockets from multiple threads).
  2. Swap to using TRY_WRITE rather than WRITE on the host, when writing to the ringbuffers. Each calling point would decide how it handled failures individually, and in some cases we might need to drop inbound data entirely.
  3. Replace the blocking WRITEs on the host with a TRY_WRITE followed by queuing outstanding work. Each uv loop, we attempt to empty this queue with another TRY_WRITE attempt, otherwise all writes this loop are added to the queue. Some care may be needed to ensure we don't break the ordering of message writes, but this should be possible while the host remains single-threaded.

Option 3 looks the most promising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants