Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tcp window clamp example #172

Merged
merged 4 commits into from
Dec 7, 2022
Merged

Conversation

bobrik
Copy link
Contributor

@bobrik bobrik commented Nov 25, 2022

Well behaving download from the linked report:

ivan@vm:~$ curl -o /dev/null http://localhost:4444
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3616M    0 3616M    0     0  33.7M      0 --:--:--  0:01:47 --:--:-- 44.7M

When it gets clamped:

ivan@vm:~$ curl -o /dev/null http://localhost:4444
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4378M    0 4378M    0     0  30.3M      0 --:--:--  0:02:24 --:--:-- 9832k

The corresponding counter:

  # HELP ebpf_exporter_tcp_window_clamps_total Number of times that TCP window was clamped to a low value
  # TYPE ebpf_exporter_tcp_window_clamps_total counter
  ebpf_exporter_tcp_window_clamps_total 19668

Well behaving download from the linked report:

```
ivan@vm:~$ curl -o /dev/null http://localhost:4444
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3616M    0 3616M    0     0  33.7M      0 --:--:--  0:01:47 --:--:-- 44.7M
```

When it gets clamped:

```
ivan@vm:~$ curl -o /dev/null http://localhost:4444
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4378M    0 4378M    0     0  30.3M      0 --:--:--  0:02:24 --:--:-- 9832k
```

The corresponding counter:

```
  # HELP ebpf_exporter_tcp_window_clamps_total Number of times that TCP window was clamped to a low value
  # TYPE ebpf_exporter_tcp_window_clamps_total counter
  ebpf_exporter_tcp_window_clamps_total 19668
```
return 0;
}

SEC("kretprobe/tcp_try_rmem_schedule")
Copy link
Contributor

@wenlxie wenlxie Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I met performance issue when use kreprobe to probe function ipt_do_table() in kernel 5.4.0. So I'd suggest that you'd better have a LnP test before this enabled on production.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by LNP test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load and performance, stress test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a given. I did a quick test with kernel source download in a VM:

ivan@vm:~$ curl -o /dev/null https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.0.11.tar.xz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  127M  100  127M    0     0  44.5M      0  0:00:02  0:00:02 --:--:-- 44.5M

I see ~4k packets going via iptables for this:

[3973:134454161] -A INPUT -i eth0 -p tcp -m tcp --sport 443 -j ACCEPT

There are 1286 bpf program runs and ~8ms of CPU time spent:

# HELP ebpf_exporter_ebpf_program_info Info about ebpf programs
# TYPE ebpf_exporter_ebpf_program_info gauge
ebpf_exporter_ebpf_program_info{config="tcp-window-clamps",id="552",program="tcp_try_rmem_schedule",tag="f14b021593f58e05"} 1
ebpf_exporter_ebpf_program_info{config="tcp-window-clamps",id="553",program="tcp_try_rmem_schedule_ret",tag="d88afa963de02adb"} 1

# HELP ebpf_exporter_ebpf_program_run_count_total How many times has the program been executed
# TYPE ebpf_exporter_ebpf_program_run_count_total counter
ebpf_exporter_ebpf_program_run_count_total{id="552"} 1286
ebpf_exporter_ebpf_program_run_count_total{id="553"} 1286

# HELP ebpf_exporter_ebpf_program_run_time_seconds How long has the program been executing
# TYPE ebpf_exporter_ebpf_program_run_time_seconds counter
ebpf_exporter_ebpf_program_run_time_seconds{id="552"} 0.00434047
ebpf_exporter_ebpf_program_run_time_seconds{id="553"} 0.003490548

With fexit based probe it drops to 1190 runs and 3ms of CPU time:

# HELP ebpf_exporter_ebpf_program_run_count_total How many times has the program been executed
# TYPE ebpf_exporter_ebpf_program_run_count_total counter
ebpf_exporter_ebpf_program_run_count_total{id="547"} 1190

# HELP ebpf_exporter_ebpf_program_run_time_seconds How long has the program been executing
# TYPE ebpf_exporter_ebpf_program_run_time_seconds counter
ebpf_exporter_ebpf_program_run_time_seconds{id="547"} 0.002882893

The number of runs depends on how buffers are drained.

With curl spending 711ms of combined system and user time it comes down to 1.13% for kprobe + kretprobe and 0.42% for fexit. Whether that's an acceptable overhead is up to any consumer (as with any other config).

Copy link
Contributor

@wenlxie wenlxie Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobrik Thanks for the data.
In old kernel versions(at least 5.4.0), kretprobe is implemented with acquire global locks raw_spin_lock_irqsave(&rp->lock, flags) before get a free instance in pre_handler_kretprobe(). When there are high traffics (different 5 tuples) which means they are handled by different CPUs, it may trigger this issue. You can see high si usage and high latency of tcp packets.

This prevents low tgid tasks from colliding with cpu id, however unlikely that is.
@bobrik bobrik merged commit e833ea8 into cloudflare:master Dec 7, 2022
@bobrik bobrik deleted the ivan/tcp-clamping branch December 7, 2022 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants