libbpf-tools/klockstat: Search for correct stack offset when printing #5203

tohojo · 2025-01-30T17:45:08Z

The klockstat BPF code contains a hard-coded offset (of 4) for the
captured stack entries, which is supposed to get rid of the topmost
entries that are just locking functions. However, when running the
klockstat tool on RHEL, this offset is not correct, leading to output
like the following:

                           Caller  Avg Wait    Count   Max Wait   Total Wait
                   mutex_lock+0x5    1.1 us        3     1.8 us       3.2 us
                   mutex_lock+0x5    825 ns        5     1.6 us       4.1 us
                   mutex_lock+0x5    1.1 us        4     1.5 us       4.3 us

                           Caller  Avg Hold    Count   Max Hold   Total Hold
                   mutex_lock+0x5   18.0 ms        3    23.0 ms      54.0 ms
                   mutex_lock+0x5    1.7 ms       15     2.2 ms      25.1 ms
                   mutex_lock+0x5    1.7 ms       15     2.2 ms      25.1 ms

To fix this, add logic in the userspace component of the utility to
walk the stack frames, skipping any entries that correspond to any of
the kprobe hooks that the tool attaches to. This fixes the output
on RHEL, without affecting other systems where the offset works
correctly.

Signed-off-by: Toke Høiland-Jørgensen toke@toke.dk

libbpf-tools/klockstat.c

yonghong-song · 2025-02-01T00:30:26Z

libbpf-tools/klockstat.c

+
+		for (j = 0; j < ARRAY_SIZE(internal_ksym_names); j++)
+			if (!strcmp(ksym->name, internal_ksym_names[j]))
+				ret = i + 1;


Should we break out if 'ret = i + 1' is not triggered?

Hmm, no, I don't think so; the loop is trying to find the top-most instance of one of the names in the list, so to do that we'll keep matching while walking the whole stack. I.e., I'm not assuming that it's definitely an off-by-one thing here...

@tohojo Could you show an stack example before and after the change for the above case? It will give a visible example so we can decide what is the proper way for the stack trace.

Sorry for the delay. While investigating this further I discovered that this only affects older versions of RHEL (9.4 in this case), which outputs a kinda odd-looking upper stack trace, that also happens to be one entry longer before getting to the mutex_lock entry.

I'm including three different stack traces below (from klockstat -L ftrace_lock), obtained by modifying the klockstat in-kernel program to change the captured stack offset to 0 (and disable the userspace filtering in this PR).

In all cases, the change in this PR correctly trims the stack trace (even when it's not done in the kernel) to start just above mutex_lock. If we break out of the loop as you suggest, the filter only works if the mutex_lock entry is the topmost one. So I think it makes the most sense to keep looping; WDYT?

For RHEL-9.4:

Caller Avg Hold Count Max Hold Total Hold __init_scratch_end+0xe3361f5 387 ns 5 629 ns 1.9 us bpf_get_stackid_raw_tp+0x73 __init_scratch_end+0xe3361f5 virtio_balloon_driver_init+0x2042 mutex_lock+0x5 __modify_ftrace_direct_multi.part.0+0x1 modify_ftrace_direct_multi+0x8c bpf_trampoline_update+0x183 __bpf_trampoline_link_prog+0xcc bpf_trampoline_link_prog+0x27 bpf_tracing_prog_attach+0x312 link_create+0x196 __sys_bpf+0x5fe __x64_sys_bpf+0x1a do_syscall_64+0x59 entry_SYSCALL_64_after_hwframe+0x77 Max PID 23590, COMM klockstat, Lock ftrace_lock (0xffffffffb084fd00)

For RHEL-9.6:

Caller Avg Hold Count Max Hold Total Hold __init_scratch_end+0xc4cc92e 258 ns 5 278 ns 1.3 us __init_scratch_end+0xc4cc92e __init_scratch_end+0xc4ce752 mutex_lock+0x5 __modify_ftrace_direct_multi.part.0+0x1 modify_ftrace_direct_multi+0x8c bpf_trampoline_update+0x192 __bpf_trampoline_link_prog+0xcc bpf_trampoline_link_prog+0x27 bpf_tracing_prog_attach+0x312 link_create+0x19f __sys_bpf+0x8d2 __x64_sys_bpf+0x1a do_syscall_64+0x5f entry_SYSCALL_64_after_hwframe+0x78 Max PID 23109, COMM klockstat, Lock ftrace_lock (0xffffffffb2658cc0)

Upstream kernel 6.13:

Caller Avg Hold Count Max Hold Total Hold bpf_prog_6deef7357e7b4530_sd_fw_ingress 667 ns 5 796 ns 3.3 us bpf_prog_6deef7357e7b4530_sd_fw_ingress bpf_prog_6deef7357e7b4530_sd_fw_ingress mutex_lock+0x9 __modify_ftrace_direct.part.0+0x18 modify_ftrace_direct+0xa9 bpf_trampoline_update+0x266 __bpf_trampoline_link_prog+0xdd bpf_trampoline_link_prog+0x32 bpf_tracing_prog_attach+0x329 __sys_bpf+0x190f __x64_sys_bpf+0x21 do_syscall_64+0x82 entry_SYSCALL_64_after_hwframe+0x76 Max PID 427582, COMM klockstat, Lock ftrace_lock (0xffffffff96f43480)

ekyooo

FYI, it seems that other methods are also being attempted to resolve the same issue. I'm not sure if it's in progress, but you might want to check it out:
https://www.youtube.com/watch?v=P3KgDhnYohY&list=PLTfHzETcek4S1XwF8KRXGDpr93j5lNnyc&t=526s

ekyooo · 2025-02-25T08:53:09Z

libbpf-tools/klockstat.c

+	int ret = 0;
+        int i, j;
+
+	for (i = 0; i < PERF_MAX_STACK_DEPTH; i++) {


This patch retains offset 4 in the BPF program and adds find_stack_start.
In systems where the offset works correctly, bt[0] is already the caller. Therefore, running this loop seems to present the following issue:
If the stacktrace entry does not have internal_ksym_names, it traverses all stacktrace entries of the entire stacktrace without gaining any benefit in obtaining the caller.

Hmm, I don't think running through a dozen or so entries on each print is a big deal? But we could certainly get rid of the filtering on the kernel side to cut down on this. @yonghong-song WDYT?

Does 'dozen on each print' take into account the maximum number of entries in the map (or env.nr_locks) and the maximum depth?

ekyooo · 2025-02-25T09:04:38Z

libbpf-tools/klockstat.c

@@ -93,6 +94,20 @@ static const char program_doc[] =
 "  klockstat -P                  # print stats per thread\n"
 ;

+static const char *internal_ksym_names[] = {


Comparing the address of each symbol appears to be more cost-effective than comparing the symbol strings.

Hmm, don't think so? We would have to resolve all the lock names to symbol addresses instead, and that does a linear scan through all the kernel symbols, doing a strcmp() on each. Whereas ksyms__map_addr() uses a binary search...

Yes, I have confirmed that ksyms__get_symbol operates linearly. While it is possible to use the cache in klockstat, I won't insist on it.
However, if the -i (interval) option of this tool is used for an extended period, the cost of this patch may increase.
Thank you.

ekyooo · 2025-02-26T10:48:59Z

libbpf-tools/klockstat.c

@@ -31,6 +31,7 @@
 #include "trace_helpers.h"



It would be good to check if the -c option works correctly with this patch. Please refer to line 578.

Ah yes, good point, it doesn't; will fix!

ekyooo · 2025-02-26T10:56:24Z

libbpf-tools/klockstat.c

@@ -93,6 +94,20 @@ static const char program_doc[] =
 "  klockstat -P                  # print stats per thread\n"
 ;

+static const char *internal_ksym_names[] = {


Yes, I have confirmed that ksyms__get_symbol operates linearly. While it is possible to use the cache in klockstat, I won't insist on it.
However, if the -i (interval) option of this tool is used for an extended period, the cost of this patch may increase.
Thank you.

ekyooo · 2025-02-26T11:14:58Z

libbpf-tools/klockstat.c

+	int ret = 0;
+        int i, j;
+
+	for (i = 0; i < PERF_MAX_STACK_DEPTH; i++) {


Does 'dozen on each print' take into account the maximum number of entries in the map (or env.nr_locks) and the maximum depth?

tohojo · 2025-02-26T12:32:22Z

Alright, switched to pre-resolving the lock names and doing a symbol address comparison in the loop, and got rid of the offset from the kernel side. Also fixed -c - PTAL :)

The klockstat BPF code contains a hard-coded offset (of 4) for the captured stack entries, which is supposed to get rid of the topmost entries that are just locking functions. However, when running the klockstat tool on RHEL, this offset is not correct, leading to output like the following: Caller Avg Wait Count Max Wait Total Wait mutex_lock+0x5 1.1 us 3 1.8 us 3.2 us mutex_lock+0x5 825 ns 5 1.6 us 4.1 us mutex_lock+0x5 1.1 us 4 1.5 us 4.3 us Caller Avg Hold Count Max Hold Total Hold mutex_lock+0x5 18.0 ms 3 23.0 ms 54.0 ms mutex_lock+0x5 1.7 ms 15 2.2 ms 25.1 ms mutex_lock+0x5 1.7 ms 15 2.2 ms 25.1 ms To fix this, remove the offset in the BPF code and add logic in the userspace component to walk the stack frames and find the right offset. We do this by keeping a list of all the possible lock function names we attach to, pre-resolving them into kernel symbol addresses. Then, each time we process a stack trace obtained from the kernel, we walk the stack until we find one of those symbols, and use that (+1) as the stack offset to process. This makes the output work correctly regardless of the offset of the stack offset on a given system. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>

tohojo requested review from drzaeus77, goldshtn, yonghong-song, 4ast, brendangregg and davemarchevsky as code owners January 30, 2025 17:45

yonghong-song reviewed Feb 1, 2025

View reviewed changes

tohojo force-pushed the klockstat-stack-fix branch from 447cd67 to de3cde1 Compare February 3, 2025 10:58

ekyooo reviewed Feb 25, 2025

View reviewed changes

ekyooo reviewed Feb 26, 2025

View reviewed changes

tohojo force-pushed the klockstat-stack-fix branch from de3cde1 to 2243cb5 Compare February 26, 2025 12:31

tohojo force-pushed the klockstat-stack-fix branch from 2243cb5 to 22cca3c Compare February 26, 2025 15:58

yonghong-song merged commit f449d05 into iovisor:master Feb 27, 2025
1 of 12 checks passed

tohojo deleted the klockstat-stack-fix branch March 3, 2025 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libbpf-tools/klockstat: Search for correct stack offset when printing #5203

libbpf-tools/klockstat: Search for correct stack offset when printing #5203

tohojo commented Jan 30, 2025

yonghong-song Feb 1, 2025

tohojo Feb 3, 2025

yonghong-song Feb 4, 2025

tohojo Feb 21, 2025

ekyooo left a comment

ekyooo Feb 25, 2025 •

edited

Loading

tohojo Feb 25, 2025

ekyooo Feb 26, 2025

ekyooo Feb 25, 2025

tohojo Feb 25, 2025

ekyooo Feb 26, 2025

ekyooo Feb 26, 2025

tohojo Feb 26, 2025

ekyooo Feb 26, 2025

ekyooo Feb 26, 2025

tohojo commented Feb 26, 2025

libbpf-tools/klockstat: Search for correct stack offset when printing #5203

libbpf-tools/klockstat: Search for correct stack offset when printing #5203

Conversation

tohojo commented Jan 30, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekyooo left a comment

Choose a reason for hiding this comment

ekyooo Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tohojo commented Feb 26, 2025

ekyooo Feb 25, 2025 •

edited

Loading