-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libbpf-tools/klockstat: Search for correct stack offset when printing #5203
Conversation
libbpf-tools/klockstat.c
Outdated
|
||
for (j = 0; j < ARRAY_SIZE(internal_ksym_names); j++) | ||
if (!strcmp(ksym->name, internal_ksym_names[j])) | ||
ret = i + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we break out if 'ret = i + 1' is not triggered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, no, I don't think so; the loop is trying to find the top-most instance of one of the names in the list, so to do that we'll keep matching while walking the whole stack. I.e., I'm not assuming that it's definitely an off-by-one thing here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tohojo Could you show an stack example before and after the change for the above case? It will give a visible example so we can decide what is the proper way for the stack trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. While investigating this further I discovered that this only affects older versions of RHEL (9.4 in this case), which outputs a kinda odd-looking upper stack trace, that also happens to be one entry longer before getting to the mutex_lock
entry.
I'm including three different stack traces below (from klockstat -L ftrace_lock
), obtained by modifying the klockstat in-kernel program to change the captured stack offset to 0 (and disable the userspace filtering in this PR).
In all cases, the change in this PR correctly trims the stack trace (even when it's not done in the kernel) to start just above mutex_lock. If we break out of the loop as you suggest, the filter only works if the mutex_lock entry is the topmost one. So I think it makes the most sense to keep looping; WDYT?
For RHEL-9.4:
Caller Avg Hold Count Max Hold Total Hold
__init_scratch_end+0xe3361f5 387 ns 5 629 ns 1.9 us
bpf_get_stackid_raw_tp+0x73
__init_scratch_end+0xe3361f5
virtio_balloon_driver_init+0x2042
mutex_lock+0x5
__modify_ftrace_direct_multi.part.0+0x1
modify_ftrace_direct_multi+0x8c
bpf_trampoline_update+0x183
__bpf_trampoline_link_prog+0xcc
bpf_trampoline_link_prog+0x27
bpf_tracing_prog_attach+0x312
link_create+0x196
__sys_bpf+0x5fe
__x64_sys_bpf+0x1a
do_syscall_64+0x59
entry_SYSCALL_64_after_hwframe+0x77
Max PID 23590, COMM klockstat, Lock ftrace_lock (0xffffffffb084fd00)
For RHEL-9.6:
Caller Avg Hold Count Max Hold Total Hold
__init_scratch_end+0xc4cc92e 258 ns 5 278 ns 1.3 us
__init_scratch_end+0xc4cc92e
__init_scratch_end+0xc4ce752
mutex_lock+0x5
__modify_ftrace_direct_multi.part.0+0x1
modify_ftrace_direct_multi+0x8c
bpf_trampoline_update+0x192
__bpf_trampoline_link_prog+0xcc
bpf_trampoline_link_prog+0x27
bpf_tracing_prog_attach+0x312
link_create+0x19f
__sys_bpf+0x8d2
__x64_sys_bpf+0x1a
do_syscall_64+0x5f
entry_SYSCALL_64_after_hwframe+0x78
Max PID 23109, COMM klockstat, Lock ftrace_lock (0xffffffffb2658cc0)
Upstream kernel 6.13:
Caller Avg Hold Count Max Hold Total Hold
bpf_prog_6deef7357e7b4530_sd_fw_ingress 667 ns 5 796 ns 3.3 us
bpf_prog_6deef7357e7b4530_sd_fw_ingress
bpf_prog_6deef7357e7b4530_sd_fw_ingress
mutex_lock+0x9
__modify_ftrace_direct.part.0+0x18
modify_ftrace_direct+0xa9
bpf_trampoline_update+0x266
__bpf_trampoline_link_prog+0xdd
bpf_trampoline_link_prog+0x32
bpf_tracing_prog_attach+0x329
__sys_bpf+0x190f
__x64_sys_bpf+0x21
do_syscall_64+0x82
entry_SYSCALL_64_after_hwframe+0x76
Max PID 427582, COMM klockstat, Lock ftrace_lock (0xffffffff96f43480)
447cd67
to
de3cde1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, it seems that other methods are also being attempted to resolve the same issue. I'm not sure if it's in progress, but you might want to check it out:
https://www.youtube.com/watch?v=P3KgDhnYohY&list=PLTfHzETcek4S1XwF8KRXGDpr93j5lNnyc&t=526s
libbpf-tools/klockstat.c
Outdated
int ret = 0; | ||
int i, j; | ||
|
||
for (i = 0; i < PERF_MAX_STACK_DEPTH; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch retains offset 4 in the BPF program and adds find_stack_start.
In systems where the offset works correctly, bt[0] is already the caller. Therefore, running this loop seems to present the following issue:
If the stacktrace entry does not have internal_ksym_names, it traverses all stacktrace entries of the entire stacktrace without gaining any benefit in obtaining the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think running through a dozen or so entries on each print is a big deal? But we could certainly get rid of the filtering on the kernel side to cut down on this. @yonghong-song WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does 'dozen on each print' take into account the maximum number of entries in the map (or env.nr_locks) and the maximum depth?
libbpf-tools/klockstat.c
Outdated
@@ -93,6 +94,20 @@ static const char program_doc[] = | |||
" klockstat -P # print stats per thread\n" | |||
; | |||
|
|||
static const char *internal_ksym_names[] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing the address of each symbol appears to be more cost-effective than comparing the symbol strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, don't think so? We would have to resolve all the lock names to symbol addresses instead, and that does a linear scan through all the kernel symbols, doing a strcmp()
on each. Whereas ksyms__map_addr()
uses a binary search...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have confirmed that ksyms__get_symbol operates linearly. While it is possible to use the cache in klockstat, I won't insist on it.
However, if the -i (interval) option of this tool is used for an extended period, the cost of this patch may increase.
Thank you.
@@ -31,6 +31,7 @@ | |||
#include "trace_helpers.h" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to check if the -c option works correctly with this patch. Please refer to line 578.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, good point, it doesn't; will fix!
libbpf-tools/klockstat.c
Outdated
@@ -93,6 +94,20 @@ static const char program_doc[] = | |||
" klockstat -P # print stats per thread\n" | |||
; | |||
|
|||
static const char *internal_ksym_names[] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have confirmed that ksyms__get_symbol operates linearly. While it is possible to use the cache in klockstat, I won't insist on it.
However, if the -i (interval) option of this tool is used for an extended period, the cost of this patch may increase.
Thank you.
libbpf-tools/klockstat.c
Outdated
int ret = 0; | ||
int i, j; | ||
|
||
for (i = 0; i < PERF_MAX_STACK_DEPTH; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does 'dozen on each print' take into account the maximum number of entries in the map (or env.nr_locks) and the maximum depth?
de3cde1
to
2243cb5
Compare
Alright, switched to pre-resolving the lock names and doing a symbol address comparison in the loop, and got rid of the offset from the kernel side. Also fixed |
The klockstat BPF code contains a hard-coded offset (of 4) for the captured stack entries, which is supposed to get rid of the topmost entries that are just locking functions. However, when running the klockstat tool on RHEL, this offset is not correct, leading to output like the following: Caller Avg Wait Count Max Wait Total Wait mutex_lock+0x5 1.1 us 3 1.8 us 3.2 us mutex_lock+0x5 825 ns 5 1.6 us 4.1 us mutex_lock+0x5 1.1 us 4 1.5 us 4.3 us Caller Avg Hold Count Max Hold Total Hold mutex_lock+0x5 18.0 ms 3 23.0 ms 54.0 ms mutex_lock+0x5 1.7 ms 15 2.2 ms 25.1 ms mutex_lock+0x5 1.7 ms 15 2.2 ms 25.1 ms To fix this, remove the offset in the BPF code and add logic in the userspace component to walk the stack frames and find the right offset. We do this by keeping a list of all the possible lock function names we attach to, pre-resolving them into kernel symbol addresses. Then, each time we process a stack trace obtained from the kernel, we walk the stack until we find one of those symbols, and use that (+1) as the stack offset to process. This makes the output work correctly regardless of the offset of the stack offset on a given system. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
2243cb5
to
22cca3c
Compare
The klockstat BPF code contains a hard-coded offset (of 4) for the
captured stack entries, which is supposed to get rid of the topmost
entries that are just locking functions. However, when running the
klockstat tool on RHEL, this offset is not correct, leading to output
like the following:
To fix this, add logic in the userspace component of the utility to
walk the stack frames, skipping any entries that correspond to any of
the kprobe hooks that the tool attaches to. This fixes the output
on RHEL, without affecting other systems where the offset works
correctly.
Signed-off-by: Toke Høiland-Jørgensen toke@toke.dk