Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection Timeout Networking Bug [All versions] #6822

Closed
PettitWesley opened this issue Feb 10, 2023 · 2 comments · Fixed by #6843
Closed

Connection Timeout Networking Bug [All versions] #6822

PettitWesley opened this issue Feb 10, 2023 · 2 comments · Fixed by #6843
Assignees
Labels

Comments

@PettitWesley
Copy link
Contributor

Connection Timeout Networking Bug

Impact

If this bug occurs it could lead a coroutine that is waiting on a timed out connection to never wake up. This work that the coroutine was spawned for (flushing data to some output) would not be completed.

Background: How the code works

The following are key to understanding this bug:

The key to understanding this bug is to notice the ordering of the code in flb_upstream_conn_timeouts and notice that the prevent_duplication flag is set for the mk_event_inject call:

if (drop == FLB_TRUE) {
    if (u_conn->event.status != MK_EVENT_NONE) {
        mk_event_inject(u_conn->evl, &u_conn->event,
                        MK_EVENT_READ | MK_EVENT_WRITE,
                        FLB_TRUE);
    }
    u_conn->net_error = ETIMEDOUT;
    prepare_destroy_conn(u_conn);
}

How the bug occurs

  1. A coroutine attempts to establish a connection and then yields as it waits.
  2. The mk_event for the connection is triggered and placed on the event loop in the priority queue.
  3. The flb_upstream_conn_timeouts code runs and marks the connection as dropped. This can occur even if the connection has a triggered event on the event loop. The u_conn->ts_connect_timeout is only set when the connection creation begins and is not updated by anywhere in the code. So, if Fluent Bit is slow and/or the configured net.connect_timeout is low then it is possible for the socket to have an event on the event loop and for the timeout to be expired.
    1. First the timeout code calls mk_event_inject with the prevent_duplication flag set to FLB_TRUE.
    2. mk_event_inject sees the event already in the triggered list of events and thus returns without doing anything.
    3. Next, the timeout code calls prepare_destroy_conn which calls mk_event_del on the connection event. The connection event is removed from its priority bucket in the priority event loop.
  4. Outcome: The connection is destroyed and the event is removed and destroyed without ever running. The coroutine which had yielded on that event is never resumed.

Solution

Option 1: Re-order code

Simply call prepare_destroy_conn first, and then inject the event. First, it will remove the event from the priority queue if it was already triggered. It will then next inject the conn event. The priority event loop will then next process the injected event and add it to a priority bucket. When it runs, the coroutine yielded on it will be resumed.

if (drop == FLB_TRUE) {
    inject = FLB_FALSE;
    if (u_conn->event.status != MK_EVENT_NONE) {
        inject = FLB_TRUE;
    }
    u_conn->net_error = ETIMEDOUT;
    prepare_destroy_conn(u_conn);
    if (inject == FLB_TRUE) {
        mk_event_inject(u_conn->evl, &u_conn->event,
                        MK_EVENT_READ | MK_EVENT_WRITE,
                        FLB_TRUE);
    }
}

Option 2: Set prevent_duplication to FLB_FALSE in the mk_event_inject call.

This ensures that the event is always added as a triggered event. The prepare_destroy_conn can and should then run afterwards to remove the event from the priority queue if it was already triggered. The priority event loop will then next process the injected event and add it to a priority bucket. When it runs, the coroutine yielded on it will be resumed.

@PettitWesley
Copy link
Contributor Author

@leonardo-albertovich @edsiper Sorry to ping you twice in one night. But @matthewfala and I think we discovered 4 different bugs in the networking and event loop code this week. I wrote this and #6821 up first since they are the simplest to explain.

Anyway, let us know if you agree this could be an issue.

@leonardo-albertovich
Copy link
Collaborator

I think option 1 would be safer but I'm not familiar enough with the priority system to make a firm statement. Considering that @matthewfala created it I think if he's sure there will be no side effects then that's the option I would prefer.

As a side note, I don't think I would vouch for option 2 even if option 1 was not viable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants