You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CCF nodes can exhibit deadlock when both the inbound and outbound ringbuffers are full. Our assumption that the host will always clear the outbound ringbuffer, allowing the full system to progress, is not true. We have a single host-side uv thread making blocking write attempts, so if both ringbuffers fill the node will get stuck.
We can get a consistent repro in a debug, verbose-logging build, running the logging_scenario_perf_test. This has no --max-writes-ahead argument, so allows the incoming TLS messages to overwhelm the inbound ringbuffer (while verbose logging fills the outbound buffer). We should be able to replicate in other perf tests by removing the --max-writes-ahead argument.
Here's a representative locked point from gdb:
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7f4a151c5b80 (LWP 9168) "cchost.virtual" ringbuffer::Writer::prepare (this=0x8e2fd0, m=1271228756, size=16392, wait=true, identifier=0x0) at ../src/ds/ringbuffer.h:254
2 Thread 0x7f4a0c7f1700 (LWP 9169) "cchost.virtual" ringbuffer::Writer::prepare (this=0x88c140, m=1225314783, size=86, wait=true, identifier=0x0) at ../src/ds/ringbuffer.h:254
I see 3 potential fixes:
Add a second host thread, responsible for processing the outbound messages. Currently we do this in HandleRingbufferImpl::every(), on the main uv loop thread. This may introduce synchronization issues (trying to access sockets from multiple threads).
Swap to using TRY_WRITE rather than WRITE on the host, when writing to the ringbuffers. Each calling point would decide how it handled failures individually, and in some cases we might need to drop inbound data entirely.
Replace the blocking WRITEs on the host with a TRY_WRITE followed by queuing outstanding work. Each uv loop, we attempt to empty this queue with another TRY_WRITE attempt, otherwise all writes this loop are added to the queue. Some care may be needed to ensure we don't break the ordering of message writes, but this should be possible while the host remains single-threaded.
Option 3 looks the most promising.
The text was updated successfully, but these errors were encountered:
CCF nodes can exhibit deadlock when both the inbound and outbound ringbuffers are full. Our assumption that the host will always clear the outbound ringbuffer, allowing the full system to progress, is not true. We have a single host-side uv thread making blocking write attempts, so if both ringbuffers fill the node will get stuck.
We can get a consistent repro in a debug, verbose-logging build, running the
logging_scenario_perf_test
. This has no--max-writes-ahead
argument, so allows the incoming TLS messages to overwhelm the inbound ringbuffer (while verbose logging fills the outbound buffer). We should be able to replicate in other perf tests by removing the--max-writes-ahead
argument.Here's a representative locked point from gdb:
I see 3 potential fixes:
HandleRingbufferImpl::every()
, on the main uv loop thread. This may introduce synchronization issues (trying to access sockets from multiple threads).TRY_WRITE
rather thanWRITE
on the host, when writing to the ringbuffers. Each calling point would decide how it handled failures individually, and in some cases we might need to drop inbound data entirely.WRITE
s on the host with aTRY_WRITE
followed by queuing outstanding work. Each uv loop, we attempt to empty this queue with anotherTRY_WRITE
attempt, otherwise all writes this loop are added to the queue. Some care may be needed to ensure we don't break the ordering of message writes, but this should be possible while the host remains single-threaded.Option 3 looks the most promising.
The text was updated successfully, but these errors were encountered: