in_kubernetes_events: support for net.* options including TCP keepalive settings #10487

multi-io · 2025-06-17T16:21:38Z

Support net.* options including TCP keepalive settings in the kubernetes_events plugin

This allows the user to set net.* options in the kubernetes_events input plugin config. This is particularly useful for configuring TCP keepalive settings because kubernetes_events opens a watch on the Kubernetes API, which is a long-running connection that might see long periods of inactivity during which intermediate networking infrastructure like proxies might drop the connection silently. The Go K8s client sends keepalives automatically, e.g. kubectl get event -w (which opens a watch on k8s events similar to the kubernetes_events plugin) will send keepalives every 30s without the user having to configure anything (those will be HTTP/2 pings rather than raw zero-length TCP keepalives, but serves the same purpose).

Testing
Before we can approve your change; please submit the following in a comment:

Example configuration file for the change
Debug log output from testing the change

Sample config:

    [INPUT]
        name kubernetes_events
        tag k8s_events
        kube_url https://kubernetes.default.svc
        interval_sec 120
        net.keepalive on
       # TCP keepalives every 20s, drop connection after 2 failed probes
        net.tcp_keepalive on
        net.tcp_keepalive_time 20
        net.tcp_keepalive_interval 20
        net.tcp_keepalive_probes 2

tcpdump extract:

# client (fluent-bit): fbit-5948f95f98-4frst.47272
# server (K8s API): kubernetes.default.svc.cluster.local.https

# regular event being reported by the API 
15:14:34.051110 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [P.], seq 436095:437316, ack 3239, win 490, options [nop,nop,TS val 2459710268 ecr 707590365], length 1221
15:14:34.051134 eth0  Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707603915 ecr 2459710268], length 0

# TCP keepalives sent by fluent-bit and ACKed by API after 20s of inactivity, and then every 20s afterwards
15:14:54.500436 eth0  Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707624365 ecr 2459710268], length 0
15:14:54.502250 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [.], ack 3239, win 490, options [nop,nop,TS val 2459730719 ecr 707603915], length 0
15:15:14.980418 eth0  Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707644845 ecr 2459730719], length 0
15:15:14.981103 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [.], ack 3239, win 490, options [nop,nop,TS val 2459751198 ecr 707603915], length 0
15:15:35.460342 eth0  Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707665325 ecr 2459751198], length 0
15:15:35.462084 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [.], ack 3239, win 490, options [nop,nop,TS val 2459771679 ecr 707603915], length 0

Other config: Longer timeout (120s) to test reconnect

    net.tcp_keepalive on
    net.tcp_keepalive_time 120
    net.tcp_keepalive_interval 120
    net.tcp_keepalive_probes 1

tcpdump:

# regular event being reported by the API
15:30:26.795408 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-6bf68b4f74-vc2dk.52592: Flags [P.], seq 4202315726:4202316947, ack 1626483173, win 524, options [nop,nop,TS val 1489545531 ecr 2059981423], length 1221
15:30:26.795440 eth0  Out IP fbit-6bf68b4f74-vc2dk.52592 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 1221, win 761, options [nop,nop,TS val 2059994680 ecr 1489545531], length 0

# Somewhere during this period, the proxy that's being used in this test drops the connection silently.

# After 120s of inactivity, fluent-bit sends a keepalive probe, to which the proxy replies with a Reset packet because
# it doesn't know the connection anymore:
15:32:27.172331 eth0  Out IP fbit-6bf68b4f74-vc2dk.52592 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 1221, win 761, options [nop,nop,TS val 2060115057 ecr 1489545531], length 0
15:32:27.174135 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-6bf68b4f74-vc2dk.52592: Flags [R], seq 4202316947, win 0, length 0

As a consequence, fluent-bit recreates the connection and catches up on the events that might have happened in the meantime:

[2025/06/17 15:32:27] [error] [/src/fluent-bit/src/tls/openssl.c:904 errno=104] Connection reset by peer
[2025/06/17 15:32:27] [error] [tls] syscall error: error:00000005:lib(0)::reason(5)
[2025/06/17 15:32:27] [error] [http_client] broken connection to kubernetes.default.svc:443 ?
[2025/06/17 15:32:27] [ warn] [input:kubernetes_events:kubernetes_events.0] kubernetes chunked stream error.
[2025/06/17 15:32:27] [ info] [input:kubernetes_events:kubernetes_events.0] kubernetes stream disconnected, ret=-1
[2025/06/17 15:33:01] [ info] [input:kubernetes_events:kubernetes_events.0] Requesting /api/v1/events?watch=1&resourceVersion=7869278
k8s_events: [1750174158.000000000, {"metadata":{"name":"myevent-1750174158","namespace":"..

including TCP keepalive Signed-off-by: Olaf Klischat <olaf.klischat@gmail.com>

multi-io · 2025-06-24T14:08:58Z

Anything I can do here? I'm not sure what the failed test does, but I don't think it's caused by the PR change.

multi-io requested review from edsiper, leonardo-albertovich, fujimotos and koleini as code owners June 17, 2025 16:21

github-actions bot added the docs-required label Jun 17, 2025

in_kubernetes_events: support for net.* options

5036307

including TCP keepalive Signed-off-by: Olaf Klischat <olaf.klischat@gmail.com>

multi-io force-pushed the k8s_events_net_setup branch from 0699d1b to 5036307 Compare June 18, 2025 07:00

multi-io changed the title ~~kubernetes_events: support for net.* options including TCP keepalive settings~~ in_kubernetes_events: support for net.* options including TCP keepalive settings Jun 18, 2025

multi-io temporarily deployed to pr June 19, 2025 20:46 — with GitHub Actions Inactive

edsiper added this to the Fluent Bit v4.0.4 milestone Jun 19, 2025

multi-io temporarily deployed to pr June 19, 2025 21:04 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

in_kubernetes_events: support for net.* options including TCP keepalive settings #10487

in_kubernetes_events: support for net.* options including TCP keepalive settings #10487

Uh oh!

multi-io commented Jun 17, 2025 •

edited

Loading

Uh oh!

multi-io commented Jun 24, 2025

Uh oh!

Uh oh!

in_kubernetes_events: support for net.* options including TCP keepalive settings #10487

Are you sure you want to change the base?

in_kubernetes_events: support for net.* options including TCP keepalive settings #10487

Uh oh!

Conversation

multi-io commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

multi-io commented Jun 24, 2025

Uh oh!

Uh oh!

multi-io commented Jun 17, 2025 •

edited

Loading