Connection Timeout Networking Bug [All versions] #6822

PettitWesley · 2023-02-10T07:16:17Z

Connection Timeout Networking Bug

Impact

If this bug occurs it could lead a coroutine that is waiting on a timed out connection to never wake up. This work that the coroutine was spawned for (flushing data to some output) would not be completed.

Background: How the code works

The following are key to understanding this bug:

Coroutine yield and resume: Fluent Bit uses "coroutines"; a concurrent programming model in which subroutines can be paused and resumed. Co-routines are cooperative routines- instead of blocking, they cooperatively pass execution between each other. Coroutines are implemented as part of Fluent Bit's core network IO libraries. When a blocking network IO operation is made (for example, waiting for a response on a socket), a routine will cooperatively yield (pause itself) and pass execution to Fluent Bit engine, which will schedule (activate) other routines. Once the blocking IO operation is complete, the sleeping coroutine will be scheduled again (resumed). This model allows Fluent Bit to achieve performance benefits without the headaches that often come from having multiple active threads.
When async networking is used, and a new connection must be created, a coroutine attempts to establish a connection and then yields. (This is a simplification, if TLS is used, it may yield and resume many times as it performs the handshake).
- 2.0.9: net_connect_async: https://github.com/fluent/fluent-bit/blob/v2.0.9/src/flb_network.c#L456
- 1.9.10: net_connect_async: https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_network.c#L442
To check for timed out network connections that have not been successfully established, Fluent Bit runs a timer event every 1.5 seconds to check all connections. If a connection is found to be timed out, it is shutdown and marked for destruction. The connection’s event is injected into the event loop to ensure that the coroutine waiting for the connection is woken up to perform cleanup. This occurs in flb_upstream_conn_timeouts.
- 2.0.9: https://github.com/fluent/fluent-bit/blob/v2.0.9/src/flb_upstream.c#L842
- 1.9.10: https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_upstream.c#L846
prepare_destroy_conn which marks connections for deletion calls mk_event_del immediately.
- 2.0.9: https://github.com/fluent/fluent-bit/blob/v2.0.9/src/flb_upstream.c#L440
- 1.9.10: https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_upstream.c#L433
mk_event_inject only injects an event into the event loop if the event is not already in the triggered list of events. This occurs if the prevent_duplication flag is set.
- 2.0.9: https://github.com/fluent/fluent-bit/blob/v2.0.9/lib/monkey/mk_core/mk_event_epoll.c#L406
- 1.9.10: https://github.com/fluent/fluent-bit/blob/v1.9.10/lib/monkey/mk_core/mk_event_epoll.c#L371
mk_event_del calls mk_list_del on the _priority_head of the event if it is set. This means that it removes the event from the priority bucket queue in the priority event loop.
- 2.0.9: https://github.com/fluent/fluent-bit/blob/v2.0.9/lib/monkey/mk_core/mk_event_epoll.c#L174
- 1.9.10: https://github.com/fluent/fluent-bit/blob/v1.9.10/lib/monkey/mk_core/mk_event_epoll.c#L166

The key to understanding this bug is to notice the ordering of the code in flb_upstream_conn_timeouts and notice that the prevent_duplication flag is set for the mk_event_inject call:

if (drop == FLB_TRUE) {
    if (u_conn->event.status != MK_EVENT_NONE) {
        mk_event_inject(u_conn->evl, &u_conn->event,
                        MK_EVENT_READ | MK_EVENT_WRITE,
                        FLB_TRUE);
    }
    u_conn->net_error = ETIMEDOUT;
    prepare_destroy_conn(u_conn);
}

How the bug occurs

A coroutine attempts to establish a connection and then yields as it waits.
The mk_event for the connection is triggered and placed on the event loop in the priority queue.
The flb_upstream_conn_timeouts code runs and marks the connection as dropped. This can occur even if the connection has a triggered event on the event loop. The u_conn->ts_connect_timeout is only set when the connection creation begins and is not updated by anywhere in the code. So, if Fluent Bit is slow and/or the configured net.connect_timeout is low then it is possible for the socket to have an event on the event loop and for the timeout to be expired.
1. First the timeout code calls mk_event_inject with the prevent_duplication flag set to FLB_TRUE.
2. mk_event_inject sees the event already in the triggered list of events and thus returns without doing anything.
3. Next, the timeout code calls prepare_destroy_conn which calls mk_event_del on the connection event. The connection event is removed from its priority bucket in the priority event loop.
Outcome: The connection is destroyed and the event is removed and destroyed without ever running. The coroutine which had yielded on that event is never resumed.

Solution

Option 1: Re-order code

Simply call prepare_destroy_conn first, and then inject the event. First, it will remove the event from the priority queue if it was already triggered. It will then next inject the conn event. The priority event loop will then next process the injected event and add it to a priority bucket. When it runs, the coroutine yielded on it will be resumed.

if (drop == FLB_TRUE) {
    inject = FLB_FALSE;
    if (u_conn->event.status != MK_EVENT_NONE) {
        inject = FLB_TRUE;
    }
    u_conn->net_error = ETIMEDOUT;
    prepare_destroy_conn(u_conn);
    if (inject == FLB_TRUE) {
        mk_event_inject(u_conn->evl, &u_conn->event,
                        MK_EVENT_READ | MK_EVENT_WRITE,
                        FLB_TRUE);
    }
}

Option 2: Set prevent_duplication to FLB_FALSE in the mk_event_inject call.

This ensures that the event is always added as a triggered event. The prepare_destroy_conn can and should then run afterwards to remove the event from the priority queue if it was already triggered. The priority event loop will then next process the injected event and add it to a priority bucket. When it runs, the coroutine yielded on it will be resumed.

The text was updated successfully, but these errors were encountered:

PettitWesley · 2023-02-10T07:17:40Z

@leonardo-albertovich @edsiper Sorry to ping you twice in one night. But @matthewfala and I think we discovered 4 different bugs in the networking and event loop code this week. I wrote this and #6821 up first since they are the simplest to explain.

Anyway, let us know if you agree this could be an issue.

leonardo-albertovich · 2023-02-10T10:21:47Z

I think option 1 would be safer but I'm not familiar enough with the priority system to make a firm statement. Considering that @matthewfala created it I think if he's sure there will be no side effects then that's the option I would prefer.

As a side note, I don't think I would vouch for option 2 even if option 1 was not viable.

PettitWesley added status: waiting-for-triage bug and removed status: waiting-for-triage labels Feb 10, 2023

PettitWesley mentioned this issue Feb 11, 2023

2023 High Impact Issues Notice/Catalogue Ticket aws/aws-for-fluent-bit#542

Open

PettitWesley self-assigned this Feb 13, 2023

This was referenced Feb 13, 2023

upstream_conn: fix ordering of mk_event_inject and prepare_conn_destroy #6842

Merged

upstream_conn: fix ordering of mk_event_inject and prepare_conn_destroy #6843

Merged

PettitWesley mentioned this issue Mar 30, 2023

event_loop: Check loop condition before removing event from bucket queue #5649

Open

3 tasks

PettitWesley closed this as completed in #6843 Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection Timeout Networking Bug [All versions] #6822

Connection Timeout Networking Bug [All versions] #6822

PettitWesley commented Feb 10, 2023

PettitWesley commented Feb 10, 2023

leonardo-albertovich commented Feb 10, 2023

Connection Timeout Networking Bug [All versions] #6822

Connection Timeout Networking Bug [All versions] #6822

Comments

PettitWesley commented Feb 10, 2023