Skip to content

Synchronous Keepalive Networking Bug [All versions] #6821

Closed
@PettitWesley

Description

@PettitWesley

Synchronous Keepalive Networking Bug

Credit

I want to be clear that I only wrote this explainer, @matthewfala discovered this issue.

Impact

Fluent Bit can crash when net.keepalive is enabled (the default) and the synchronous networking stack is used. Currently, the Amazon S3 plugin is the main output plugin that uses sync networking.

The stack trace in below was obtained from a customer case and shows what can occur when this bug surfaces.

(gdb) bt
#0  0x00007fd0bba0bca0 in raise () from /lib64/libc.so.6
#1  0x00007fd0bba0d148 in abort () from /lib64/libc.so.6
#2  0x000000000045599e in flb_signal_handler (signal=11) at /tmp/fluent-bit-1.9.10/src/fluent-bit.c:581
#3  <signal handler called>
#4  0x00000000004fd80e in __mk_list_del (prev=0x0, next=0x0) at /tmp/fluent-bit-1.9.10/lib/monkey/include/monkey/mk_core/mk_list.h:87
#5  0x00000000004fd846 in mk_list_del (entry=0x7fd0b4a42a60) at /tmp/fluent-bit-1.9.10/lib/monkey/include/monkey/mk_core/mk_list.h:93
#6  0x00000000004fe703 in prepare_destroy_conn (u_conn=0x7fd0b4a429c0) at /tmp/fluent-bit-1.9.10/src/flb_upstream.c:443
#7  0x00000000004fe786 in prepare_destroy_conn_safe (u_conn=0x7fd0b4a429c0) at /tmp/fluent-bit-1.9.10/src/flb_upstream.c:469
#8  0x00000000004ff04b in cb_upstream_conn_ka_dropped (data=0x7fd0b4a429c0) at /tmp/fluent-bit-1.9.10/src/flb_upstream.c:724
#9  0x00000000004e7cf5 in output_thread (data=0x7fd0b612e100) at /tmp/fluent-bit-1.9.10/src/flb_output_thread.c:298
#10 0x0000000000500712 in step_callback (data=0x7fd0b60f4ac0) at /tmp/fluent-bit-1.9.10/src/flb_worker.c:43
#11 0x00007fd0bd6cc44b in start_thread () from /lib64/libpthread.so.0
#12 0x00007fd0bbac752f in clone () from /lib64/libc.so.6

Root Cause

Relevant code in 2.0.9:

Relevant code in 1.9.10:

When net.keepalive is enabled (the default), Fluent Bit will try to keep connections open so that they can be re-used. If the connection is closed for any reason (by remote server, or some networking issue), then Fluent Bit can no longer re-use the connection. Therefore, it always inserts an event on the event loop to monitor for connection close:

/*
 * The socket at this point is not longer monitored, so if we want to be
 * notified if the 'available keepalive connection' gets disconnected by
 * the remote endpoint we need to add it again.
 */
conn->event.handler = cb_upstream_conn_ka_dropped;

ret = mk_event_add(conn->evl,
                   conn->fd,
                   FLB_ENGINE_EV_CUSTOM,
                   MK_EVENT_CLOSE,
                   &conn->event);

When the connection is freed or cleaned up for any reason, the event must be removed. Currently, in prepare_destroy_conn the event is only removed for the async case:

if (flb_stream_is_async(&u->base)) {
    mk_event_del(u_conn->evl, &u_conn->event);
}

This means that in the sync case, the event can remain on the event loop. If it was already triggered and is pending processing, or is triggered subsequently, it could run on the already freed connection leading to an invalid memory access and a SIGSEGV crash like the one shown above.

We suspect that the code is in this state possibly both because the keepalive code is newer, and also because the sync case covers connections created by filters and in output plugin init callbacks, both of which cannot use async networking.

Solution

Simply add an additional flag on the connection that tracks whether or not the keepalive close event was added, and check this to determine if we should remove the event. Since a mk_event is only added either for async networking and/or for keepalive connection close monitoring, this covers all cases.

if (flb_stream_is_async(&u->base) 
    || u_conn->ka_dropped_event_added == FLB_TRUE) {
    mk_event_del(u_conn->evl, &u_conn->event);
}

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions