-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tcp window clamp example #172
Conversation
Well behaving download from the linked report: ``` ivan@vm:~$ curl -o /dev/null http://localhost:4444 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 3616M 0 3616M 0 0 33.7M 0 --:--:-- 0:01:47 --:--:-- 44.7M ``` When it gets clamped: ``` ivan@vm:~$ curl -o /dev/null http://localhost:4444 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 4378M 0 4378M 0 0 30.3M 0 --:--:-- 0:02:24 --:--:-- 9832k ``` The corresponding counter: ``` # HELP ebpf_exporter_tcp_window_clamps_total Number of times that TCP window was clamped to a low value # TYPE ebpf_exporter_tcp_window_clamps_total counter ebpf_exporter_tcp_window_clamps_total 19668 ```
a66d87b
to
9f44f30
Compare
return 0; | ||
} | ||
|
||
SEC("kretprobe/tcp_try_rmem_schedule") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I met performance issue when use kreprobe to probe function ipt_do_table() in kernel 5.4.0. So I'd suggest that you'd better have a LnP test before this enabled on production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by LNP test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load and performance, stress test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a given. I did a quick test with kernel source download in a VM:
ivan@vm:~$ curl -o /dev/null https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.0.11.tar.xz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 127M 100 127M 0 0 44.5M 0 0:00:02 0:00:02 --:--:-- 44.5M
I see ~4k packets going via iptables for this:
[3973:134454161] -A INPUT -i eth0 -p tcp -m tcp --sport 443 -j ACCEPT
There are 1286 bpf program runs and ~8ms of CPU time spent:
# HELP ebpf_exporter_ebpf_program_info Info about ebpf programs
# TYPE ebpf_exporter_ebpf_program_info gauge
ebpf_exporter_ebpf_program_info{config="tcp-window-clamps",id="552",program="tcp_try_rmem_schedule",tag="f14b021593f58e05"} 1
ebpf_exporter_ebpf_program_info{config="tcp-window-clamps",id="553",program="tcp_try_rmem_schedule_ret",tag="d88afa963de02adb"} 1
# HELP ebpf_exporter_ebpf_program_run_count_total How many times has the program been executed
# TYPE ebpf_exporter_ebpf_program_run_count_total counter
ebpf_exporter_ebpf_program_run_count_total{id="552"} 1286
ebpf_exporter_ebpf_program_run_count_total{id="553"} 1286
# HELP ebpf_exporter_ebpf_program_run_time_seconds How long has the program been executing
# TYPE ebpf_exporter_ebpf_program_run_time_seconds counter
ebpf_exporter_ebpf_program_run_time_seconds{id="552"} 0.00434047
ebpf_exporter_ebpf_program_run_time_seconds{id="553"} 0.003490548
With fexit
based probe it drops to 1190 runs and 3ms of CPU time:
# HELP ebpf_exporter_ebpf_program_run_count_total How many times has the program been executed
# TYPE ebpf_exporter_ebpf_program_run_count_total counter
ebpf_exporter_ebpf_program_run_count_total{id="547"} 1190
# HELP ebpf_exporter_ebpf_program_run_time_seconds How long has the program been executing
# TYPE ebpf_exporter_ebpf_program_run_time_seconds counter
ebpf_exporter_ebpf_program_run_time_seconds{id="547"} 0.002882893
The number of runs depends on how buffers are drained.
With curl spending 711ms of combined system and user time it comes down to 1.13% for kprobe
+ kretprobe
and 0.42% for fexit
. Whether that's an acceptable overhead is up to any consumer (as with any other config).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bobrik Thanks for the data.
In old kernel versions(at least 5.4.0), kretprobe is implemented with acquire global locks raw_spin_lock_irqsave(&rp->lock, flags) before get a free instance in pre_handler_kretprobe(). When there are high traffics (different 5 tuples) which means they are handled by different CPUs, it may trigger this issue. You can see high si usage and high latency of tcp packets.
This prevents low tgid tasks from colliding with cpu id, however unlikely that is.
Well behaving download from the linked report:
When it gets clamped:
The corresponding counter: