It is possible for a CCF node to deadlock when both ringbuffers are full #628

eddyashton · 2019-12-11T11:49:04Z

CCF nodes can exhibit deadlock when both the inbound and outbound ringbuffers are full. Our assumption that the host will always clear the outbound ringbuffer, allowing the full system to progress, is not true. We have a single host-side uv thread making blocking write attempts, so if both ringbuffers fill the node will get stuck.

We can get a consistent repro in a debug, verbose-logging build, running the logging_scenario_perf_test. This has no --max-writes-ahead argument, so allows the incoming TLS messages to overwhelm the inbound ringbuffer (while verbose logging fills the outbound buffer). We should be able to replicate in other perf tests by removing the --max-writes-ahead argument.

Here's a representative locked point from gdb:

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7f4a151c5b80 (LWP 9168) "cchost.virtual" ringbuffer::Writer::prepare (this=0x8e2fd0,      m=1271228756, size=16392, wait=true, identifier=0x0) at ../src/ds/ringbuffer.h:254
  2    Thread 0x7f4a0c7f1700 (LWP 9169) "cchost.virtual" ringbuffer::Writer::prepare (this=0x88c140,      m=1225314783, size=86, wait=true, identifier=0x0) at ../src/ds/ringbuffer.h:254

I see 3 potential fixes:

Add a second host thread, responsible for processing the outbound messages. Currently we do this in HandleRingbufferImpl::every(), on the main uv loop thread. This may introduce synchronization issues (trying to access sockets from multiple threads).
Swap to using TRY_WRITE rather than WRITE on the host, when writing to the ringbuffers. Each calling point would decide how it handled failures individually, and in some cases we might need to drop inbound data entirely.
Replace the blocking WRITEs on the host with a TRY_WRITE followed by queuing outstanding work. Each uv loop, we attempt to empty this queue with another TRY_WRITE attempt, otherwise all writes this loop are added to the queue. Some care may be needed to ensure we don't break the ordering of message writes, but this should be possible while the host remains single-threaded.

Option 3 looks the most promising.

The text was updated successfully, but these errors were encountered:

eddyashton added the bug label Dec 11, 2019

achamayou added the liveness label Dec 11, 2019

eddyashton mentioned this issue Dec 12, 2019

Avoid deadlock by queuing message writes on the host #645

Merged

achamayou closed this as completed Jan 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It is possible for a CCF node to deadlock when both ringbuffers are full #628

It is possible for a CCF node to deadlock when both ringbuffers are full #628

eddyashton commented Dec 11, 2019

It is possible for a CCF node to deadlock when both ringbuffers are full #628

It is possible for a CCF node to deadlock when both ringbuffers are full #628

Comments

eddyashton commented Dec 11, 2019