Skip to content

in_kubernetes_events: support for net.* options including TCP keepalive settings #10487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

multi-io
Copy link

@multi-io multi-io commented Jun 17, 2025

Support net.* options including TCP keepalive settings in the kubernetes_events plugin

This allows the user to set net.* options in the kubernetes_events input plugin config. This is particularly useful for configuring TCP keepalive settings because kubernetes_events opens a watch on the Kubernetes API, which is a long-running connection that might see long periods of inactivity during which intermediate networking infrastructure like proxies might drop the connection silently. The Go K8s client sends keepalives automatically, e.g. kubectl get event -w (which opens a watch on k8s events similar to the kubernetes_events plugin) will send keepalives every 30s without the user having to configure anything (those will be HTTP/2 pings rather than raw zero-length TCP keepalives, but serves the same purpose).

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change

Sample config:

    [INPUT]
        name kubernetes_events
        tag k8s_events
        kube_url https://kubernetes.default.svc
        interval_sec 120
        net.keepalive on
       # TCP keepalives every 20s, drop connection after 2 failed probes
        net.tcp_keepalive on
        net.tcp_keepalive_time 20
        net.tcp_keepalive_interval 20
        net.tcp_keepalive_probes 2

tcpdump extract:

# client (fluent-bit): fbit-5948f95f98-4frst.47272
# server (K8s API): kubernetes.default.svc.cluster.local.https

# regular event being reported by the API 
15:14:34.051110 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [P.], seq 436095:437316, ack 3239, win 490, options [nop,nop,TS val 2459710268 ecr 707590365], length 1221
15:14:34.051134 eth0  Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707603915 ecr 2459710268], length 0

# TCP keepalives sent by fluent-bit and ACKed by API after 20s of inactivity, and then every 20s afterwards
15:14:54.500436 eth0  Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707624365 ecr 2459710268], length 0
15:14:54.502250 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [.], ack 3239, win 490, options [nop,nop,TS val 2459730719 ecr 707603915], length 0
15:15:14.980418 eth0  Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707644845 ecr 2459730719], length 0
15:15:14.981103 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [.], ack 3239, win 490, options [nop,nop,TS val 2459751198 ecr 707603915], length 0
15:15:35.460342 eth0  Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707665325 ecr 2459751198], length 0
15:15:35.462084 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [.], ack 3239, win 490, options [nop,nop,TS val 2459771679 ecr 707603915], length 0

Other config: Longer timeout (120s) to test reconnect

    net.tcp_keepalive on
    net.tcp_keepalive_time 120
    net.tcp_keepalive_interval 120
    net.tcp_keepalive_probes 1

tcpdump:

# regular event being reported by the API
15:30:26.795408 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-6bf68b4f74-vc2dk.52592: Flags [P.], seq 4202315726:4202316947, ack 1626483173, win 524, options [nop,nop,TS val 1489545531 ecr 2059981423], length 1221
15:30:26.795440 eth0  Out IP fbit-6bf68b4f74-vc2dk.52592 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 1221, win 761, options [nop,nop,TS val 2059994680 ecr 1489545531], length 0

# Somewhere during this period, the proxy that's being used in this test drops the connection silently.

# After 120s of inactivity, fluent-bit sends a keepalive probe, to which the proxy replies with a Reset packet because
# it doesn't know the connection anymore:
15:32:27.172331 eth0  Out IP fbit-6bf68b4f74-vc2dk.52592 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 1221, win 761, options [nop,nop,TS val 2060115057 ecr 1489545531], length 0
15:32:27.174135 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-6bf68b4f74-vc2dk.52592: Flags [R], seq 4202316947, win 0, length 0

As a consequence, fluent-bit recreates the connection and catches up on the events that might have happened in the meantime:

[2025/06/17 15:32:27] [error] [/src/fluent-bit/src/tls/openssl.c:904 errno=104] Connection reset by peer
[2025/06/17 15:32:27] [error] [tls] syscall error: error:00000005:lib(0)::reason(5)
[2025/06/17 15:32:27] [error] [http_client] broken connection to kubernetes.default.svc:443 ?
[2025/06/17 15:32:27] [ warn] [input:kubernetes_events:kubernetes_events.0] kubernetes chunked stream error.
[2025/06/17 15:32:27] [ info] [input:kubernetes_events:kubernetes_events.0] kubernetes stream disconnected, ret=-1
[2025/06/17 15:33:01] [ info] [input:kubernetes_events:kubernetes_events.0] Requesting /api/v1/events?watch=1&resourceVersion=7869278
k8s_events: [1750174158.000000000, {"metadata":{"name":"myevent-1750174158","namespace":"..

including TCP keepalive

Signed-off-by: Olaf Klischat <olaf.klischat@gmail.com>
@multi-io multi-io force-pushed the k8s_events_net_setup branch from 0699d1b to 5036307 Compare June 18, 2025 07:00
@multi-io multi-io changed the title kubernetes_events: support for net.* options including TCP keepalive settings in_kubernetes_events: support for net.* options including TCP keepalive settings Jun 18, 2025
@edsiper edsiper added this to the Fluent Bit v4.0.4 milestone Jun 19, 2025
@multi-io
Copy link
Author

Anything I can do here? I'm not sure what the failed test does, but I don't think it's caused by the PR change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants