Skip to content

test flake: local bgp tests occasionally timeout after FSM death spiral #563

@taspelund

Description

@taspelund

BGP Test Flakiness: Send Failures Causing Idle/Connect Death Spiral

Overview

BGP tests exhibit rare (~1 in 500 iterations, not 5%) race conditions where connections fail with send errors, leading to Idle ↔ Connect oscillation loops that prevent session establishment. This is observed most often in Channel-based tests, but is sometimes seen with TCP-based tests as well. In either case, the symptom is a send error -- for channels this is usually "sending on a closed channel" while TCP shows IO errors.

Attached Examples:

  • test_basic_update (channel-based)
  • test_basic_peering_active (channel-based)

Symptom: FSM attempts to send messages on connection handles where the underlying channel has been closed, causing immediate send failures and FSM transitions to Idle state. While in Idle, the peer attempts to do the same thing which results in them moving into Idle. The two peers continue to chase each other back and forth between Connect and Idle FSM states.


Test Failure #1: basic_update (Simple Race During Initial Connection)

R1 Perspective (10.0.0.1)

First Death Spiral Cycle:

[FSM LAYER]
06:49:11.369859 - Event: manual start (in Idle)
06:49:11.379350 - Starting connect attempt
06:49:11.380668 - FSM transition: Idle → Connect

[CONNECTION LAYER]
06:49:11.381494 - Event: tcp connection confirmed
06:49:11.381763 - Outbound connection accepted (conn_id: 43e72604)
                  Channel created, recv loop started

[FAILURE POINT]
06:49:11.383416 - ERROR: send_open failed: "sending on a closed channel"
                  State: Connect, conn_id: 43e72604
06:49:11.383761 - Failed to send open, FSM transition to Idle

[FSM LAYER]
06:49:11.384097 - FSM transition: Connect → Idle
06:49:11.384586 - Inbound connection rejected: "not allowed in idle" (conn_id: 00bcde1d)

Second Cycle (5.6 seconds later):

[FSM LAYER]
06:49:17.008047 - Idle hold timer expires
06:49:17.009329 - FSM transition: Idle → Connect

[CONNECTION LAYER]
06:49:17.019837 - Event: tcp connection confirmed
06:49:17.020205 - Outbound connection accepted (conn_id: 90a97553)

[SUCCESS - BUT THEN COLLISION]
06:49:17.021361 - FSM transition: Connect → Open Sent
06:49:17.022540 - FSM transition: Open Sent → Open Confirm
06:49:17.023992 - FSM transition: Open Confirm → Established

[COLLISION NOTIFICATION]
06:49:17.022870 - Received notification: Connection Collision Resolution
06:49:17.024372 - Notification in Established state, FSM → Idle
06:49:17.024608 - FSM transition: Established → Idle

Third Cycle (5 seconds later):

[FSM LAYER]
06:49:22.026535 - Idle hold timer expires
06:49:22.027407 - FSM transition: Idle → Connect

[CONNECTION LAYER]
06:49:22.032936 - Event: tcp connection confirmed
06:49:22.033163 - Outbound connection accepted (conn_id: f639b835)

[FAILURE AGAIN]
06:49:22.034335 - ERROR: send_open failed: "sending on a closed channel"
06:49:22.034619 - Failed to send open, FSM transition to Idle
06:49:22.034903 - FSM transition: Connect → Idle

R2 Perspective (10.0.0.2)

Matching Timeline:

[FSM LAYER]
06:49:11.385698 - Event: manual start (in Idle)
06:49:11.385942 - Starting connect attempt
06:49:11.386585 - FSM transition: Idle → Connect

[CONNECTION LAYER]
06:49:11.387070 - Event: tcp connection confirmed
06:49:11.387328 - Outbound connection accepted (conn_id: 13c5a1a4)

[FAILURE]
06:49:11.388492 - ERROR: send_open failed: "sending on a closed channel"
06:49:11.388829 - Failed to send open, FSM transition to Idle
06:49:11.389079 - FSM transition: Connect → Idle

[COLLISION HANDLING - 5.6 seconds later]
06:49:17.011160 - In Open Sent when collision detected
06:49:17.012213 - FSM transition: Open Sent → Connection Collision
06:49:17.017979 - Collision resolution: local id 2 > remote id 1 (wins)
06:49:17.018488 - Sending notification to close losing connection
06:49:17.018905 - FSM transition: Connection Collision → Open Sent

Analysis

Both peers experience immediate channel closure (~1-2ms after connection acceptance), suggesting the recv loop detects closure and exits before the FSM can send the Open message. The death spiral occurs because:

  1. Connection accepted → recv loop started
  2. Channel closes almost immediately (peer rejection or network layer issue)
  3. Recv loop exits, drops Receiver
  4. FSM tries send_open() → fails with "closed channel"
  5. FSM transitions to Idle
  6. Peer's connection attempts get rejected ("not allowed in idle")
  7. Cycle repeats every ~5-6 seconds (idle hold timer)

Test Failure #2: basic_peering_active (Collision Resolution Confusion)

R1 Perspective (12.0.0.1)

Initial Collision Detection:

[FSM LAYER]
16:07:36.837799 - FSM transition: Idle → Connect
16:07:36.839872 - FSM transition: Connect → Open Sent

[COLLISION DETECTED]
16:07:36.841572 - FSM transition: Open Sent → Connection Collision
16:07:36.849090 - Sending notification: Connection Collision Resolution (conn_id: 47ec682d)
16:07:36.852014 - FSM transition: Connection Collision → Open Sent

[UNEXPECTED NOTIFICATION]
16:07:36.854861 - Received notification on WRONG connection (conn_id: c5e7c8c9)
                  This should have been on 47ec682d!
16:07:36.862389 - FSM transition: Open Sent → Open Confirm
16:07:36.863459 - Unexpected notification in Open Confirm
16:07:36.863815 - FSM transition: Open Confirm → Idle

Death Spiral Begins:

[5.9 seconds later]
16:07:42.771088 - FSM transition: Idle → Connect
16:07:42.777016 - FSM transition: Connect → Open Sent

[FAILURE - 8.5 seconds later]
16:07:51.319843 - ERROR: "sending on a closed channel" (conn_id: 64ff1607)
16:07:51.320625 - Failed to send notification

[11.2 seconds later]
16:08:02.490402 - ERROR: "sending on a closed channel" (conn_id: b1f1a0d1)
16:08:02.491964 - Failed to send notification

R2 Perspective (12.0.0.2)

Collision from R2's View:

[CONNECTION SETUP]
16:07:36.824482 - Connector spawned (outbound to R1)
16:07:36.825311 - Outbound connection accepted (conn_id: a079ac2c)

[COLLISION DETECTED]
16:07:36.848735 - Collision detected: exist=connector, new=dispatcher
16:07:36.849486 - FSM transition: Open Sent → Connection Collision
16:07:36.856723 - Resolving collision: connector wins (opposite of R1!)
16:07:36.859992 - Sending notification to close dispatcher connection
16:07:36.862619 - FSM transition: Connection Collision → Open Confirm

[PROBLEM]
16:07:36.863107 - Received notification (unexpected)
16:07:36.866189 - FSM transition: Established → Idle

Critical Finding: Contradictory Collision Decisions

R1 decided: Dispatcher connection wins (keep 47ec682d, close c5e7c8c9)
R2 decided: Connector connection wins (keep a079ac2c, close 3cc022fe)

But these are THE SAME connections viewed from opposite sides:

  • R1's connector (c5e7c8c9) = R2's dispatcher connection
  • R1's dispatcher (47ec682d) = R2's connector connection

Both peers send notifications to close what they think is the "losing" connection, but they're actually closing BOTH connections, leading to total connection loss.


Root Cause Analysis

Primary Issue: Connection Handle Lifecycle Race

The race occurs at the boundary between asynchronous connection operations and the serialized FSM event queue:

Time   Connection Layer           FSM Event Queue        FSM Processing
─────────────────────────────────────────────────────────────────────
T0:    Channel closes
T1:    Recv loop detects, exits   [Pending Open msg]
T2:    Receiver dropped            [Pending Open msg]
T3:                                                      Process Open
T4:                                                      send_open() FAILS
T5:                                                      FSM → Idle

Secondary Issue: Collision Resolution Logic Error

The collision resolution makes decisions based on local perspective without accounting for connection directionality:

  • Each peer sees "connector" (outbound) and "dispatcher" (inbound) from their own perspective
  • The same physical connection has opposite roles at each end
  • Current logic compares BGP IDs but applies decision to locally-perceived connection roles

Why Channel Tests Fail More Often

  1. Instantaneous closure: No TCP graceful shutdown delays
  2. Tighter timing: Microsecond-level race windows
  3. Test setup: Simultaneous router startup increases collision probability

Impact

  • Frequency: ~0.2% of test runs (1 in ~500 iterations)
  • Recovery: FSM eventually recovers via idle hold timer (5-6 seconds per attempt), but not before test timeout
  • Production Impact: Minimal - TCP connections have different timing characteristics
  • Test Impact: Causes spurious CI failures

Attempted Fixes (Why They Didn't Work)

  1. Silent drop in Idle state: Only handles inbound connections; doesn't prevent the FSM from reaching Idle due to send failures
  2. Non-standard FSM transition to Active on collision notifications: Handles async collision scenarios but not the initial send failure. RFC states Notifications should be handled by transitioning to Idle. This is a narrowly scoped change to handle Connection Collision Resolution notifications by moving into Active.

Reproduction Steps

cargo nextest run -p bgp --stress-duration 12h

Almost always fails in less than 1k iterations. Look for failed assertions where one router should have moved into Established state.
In the logs, search for the string "->" to find FSM transitions. Look for several oscillations between Idle and Connect, usually preceded by a send error.

Failure Logs:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bgpBorder Gateway ProtocolmgdMaghemite daemon

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions