Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libbpf-tools/klockstat: Search for correct stack offset when printing #5203

Merged
merged 1 commit into from
Feb 27, 2025

Conversation

tohojo
Copy link
Contributor

@tohojo tohojo commented Jan 30, 2025

The klockstat BPF code contains a hard-coded offset (of 4) for the
captured stack entries, which is supposed to get rid of the topmost
entries that are just locking functions. However, when running the
klockstat tool on RHEL, this offset is not correct, leading to output
like the following:

                           Caller  Avg Wait    Count   Max Wait   Total Wait
                   mutex_lock+0x5    1.1 us        3     1.8 us       3.2 us
                   mutex_lock+0x5    825 ns        5     1.6 us       4.1 us
                   mutex_lock+0x5    1.1 us        4     1.5 us       4.3 us

                           Caller  Avg Hold    Count   Max Hold   Total Hold
                   mutex_lock+0x5   18.0 ms        3    23.0 ms      54.0 ms
                   mutex_lock+0x5    1.7 ms       15     2.2 ms      25.1 ms
                   mutex_lock+0x5    1.7 ms       15     2.2 ms      25.1 ms

To fix this, add logic in the userspace component of the utility to
walk the stack frames, skipping any entries that correspond to any of
the kprobe hooks that the tool attaches to. This fixes the output
on RHEL, without affecting other systems where the offset works
correctly.

Signed-off-by: Toke Høiland-Jørgensen toke@toke.dk


for (j = 0; j < ARRAY_SIZE(internal_ksym_names); j++)
if (!strcmp(ksym->name, internal_ksym_names[j]))
ret = i + 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we break out if 'ret = i + 1' is not triggered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, no, I don't think so; the loop is trying to find the top-most instance of one of the names in the list, so to do that we'll keep matching while walking the whole stack. I.e., I'm not assuming that it's definitely an off-by-one thing here...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tohojo Could you show an stack example before and after the change for the above case? It will give a visible example so we can decide what is the proper way for the stack trace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. While investigating this further I discovered that this only affects older versions of RHEL (9.4 in this case), which outputs a kinda odd-looking upper stack trace, that also happens to be one entry longer before getting to the mutex_lock entry.

I'm including three different stack traces below (from klockstat -L ftrace_lock), obtained by modifying the klockstat in-kernel program to change the captured stack offset to 0 (and disable the userspace filtering in this PR).

In all cases, the change in this PR correctly trims the stack trace (even when it's not done in the kernel) to start just above mutex_lock. If we break out of the loop as you suggest, the filter only works if the mutex_lock entry is the topmost one. So I think it makes the most sense to keep looping; WDYT?

For RHEL-9.4:

                              Caller  Avg Hold    Count   Max Hold   Total Hold
         __init_scratch_end+0xe3361f5    387 ns        5     629 ns       1.9 us
          bpf_get_stackid_raw_tp+0x73
         __init_scratch_end+0xe3361f5
    virtio_balloon_driver_init+0x2042
                       mutex_lock+0x5
__modify_ftrace_direct_multi.part.0+0x1
      modify_ftrace_direct_multi+0x8c
          bpf_trampoline_update+0x183
      __bpf_trampoline_link_prog+0xcc
        bpf_trampoline_link_prog+0x27
        bpf_tracing_prog_attach+0x312
                    link_create+0x196
                      __sys_bpf+0x5fe
                   __x64_sys_bpf+0x1a
                   do_syscall_64+0x59
  entry_SYSCALL_64_after_hwframe+0x77
                              Max PID 23590, COMM klockstat, Lock ftrace_lock (0xffffffffb084fd00)

For RHEL-9.6:

                               Caller  Avg Hold    Count   Max Hold   Total Hold
         __init_scratch_end+0xc4cc92e    258 ns        5     278 ns       1.3 us
         __init_scratch_end+0xc4cc92e
         __init_scratch_end+0xc4ce752
                       mutex_lock+0x5
__modify_ftrace_direct_multi.part.0+0x1
      modify_ftrace_direct_multi+0x8c
          bpf_trampoline_update+0x192
      __bpf_trampoline_link_prog+0xcc
        bpf_trampoline_link_prog+0x27
        bpf_tracing_prog_attach+0x312
                    link_create+0x19f
                      __sys_bpf+0x8d2
                   __x64_sys_bpf+0x1a
                   do_syscall_64+0x5f
  entry_SYSCALL_64_after_hwframe+0x78
                              Max PID 23109, COMM klockstat, Lock ftrace_lock (0xffffffffb2658cc0)

Upstream kernel 6.13:

                               Caller  Avg Hold    Count   Max Hold   Total Hold
bpf_prog_6deef7357e7b4530_sd_fw_ingress    667 ns        5     796 ns       3.3 us
bpf_prog_6deef7357e7b4530_sd_fw_ingress
bpf_prog_6deef7357e7b4530_sd_fw_ingress
                       mutex_lock+0x9
   __modify_ftrace_direct.part.0+0x18
            modify_ftrace_direct+0xa9
          bpf_trampoline_update+0x266
      __bpf_trampoline_link_prog+0xdd
        bpf_trampoline_link_prog+0x32
        bpf_tracing_prog_attach+0x329
                     __sys_bpf+0x190f
                   __x64_sys_bpf+0x21
                   do_syscall_64+0x82
  entry_SYSCALL_64_after_hwframe+0x76
                              Max PID 427582, COMM klockstat, Lock ftrace_lock (0xffffffff96f43480)

@tohojo tohojo force-pushed the klockstat-stack-fix branch from 447cd67 to de3cde1 Compare February 3, 2025 10:58
Copy link
Contributor

@ekyooo ekyooo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, it seems that other methods are also being attempted to resolve the same issue. I'm not sure if it's in progress, but you might want to check it out:
https://www.youtube.com/watch?v=P3KgDhnYohY&list=PLTfHzETcek4S1XwF8KRXGDpr93j5lNnyc&t=526s

int ret = 0;
int i, j;

for (i = 0; i < PERF_MAX_STACK_DEPTH; i++) {
Copy link
Contributor

@ekyooo ekyooo Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch retains offset 4 in the BPF program and adds find_stack_start.
In systems where the offset works correctly, bt[0] is already the caller. Therefore, running this loop seems to present the following issue:
If the stacktrace entry does not have internal_ksym_names, it traverses all stacktrace entries of the entire stacktrace without gaining any benefit in obtaining the caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't think running through a dozen or so entries on each print is a big deal? But we could certainly get rid of the filtering on the kernel side to cut down on this. @yonghong-song WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 'dozen on each print' take into account the maximum number of entries in the map (or env.nr_locks) and the maximum depth?

@@ -93,6 +94,20 @@ static const char program_doc[] =
" klockstat -P # print stats per thread\n"
;

static const char *internal_ksym_names[] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing the address of each symbol appears to be more cost-effective than comparing the symbol strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, don't think so? We would have to resolve all the lock names to symbol addresses instead, and that does a linear scan through all the kernel symbols, doing a strcmp() on each. Whereas ksyms__map_addr() uses a binary search...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have confirmed that ksyms__get_symbol operates linearly. While it is possible to use the cache in klockstat, I won't insist on it.
However, if the -i (interval) option of this tool is used for an extended period, the cost of this patch may increase.
Thank you.

@@ -31,6 +31,7 @@
#include "trace_helpers.h"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to check if the -c option works correctly with this patch. Please refer to line 578.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, good point, it doesn't; will fix!

@@ -93,6 +94,20 @@ static const char program_doc[] =
" klockstat -P # print stats per thread\n"
;

static const char *internal_ksym_names[] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have confirmed that ksyms__get_symbol operates linearly. While it is possible to use the cache in klockstat, I won't insist on it.
However, if the -i (interval) option of this tool is used for an extended period, the cost of this patch may increase.
Thank you.

int ret = 0;
int i, j;

for (i = 0; i < PERF_MAX_STACK_DEPTH; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 'dozen on each print' take into account the maximum number of entries in the map (or env.nr_locks) and the maximum depth?

@tohojo tohojo force-pushed the klockstat-stack-fix branch from de3cde1 to 2243cb5 Compare February 26, 2025 12:31
@tohojo
Copy link
Contributor Author

tohojo commented Feb 26, 2025

Alright, switched to pre-resolving the lock names and doing a symbol address comparison in the loop, and got rid of the offset from the kernel side. Also fixed -c - PTAL :)

The klockstat BPF code contains a hard-coded offset (of 4) for the
captured stack entries, which is supposed to get rid of the topmost
entries that are just locking functions. However, when running the
klockstat tool on RHEL, this offset is not correct, leading to output
like the following:

                               Caller  Avg Wait    Count   Max Wait   Total Wait
                       mutex_lock+0x5    1.1 us        3     1.8 us       3.2 us
                       mutex_lock+0x5    825 ns        5     1.6 us       4.1 us
                       mutex_lock+0x5    1.1 us        4     1.5 us       4.3 us

                               Caller  Avg Hold    Count   Max Hold   Total Hold
                       mutex_lock+0x5   18.0 ms        3    23.0 ms      54.0 ms
                       mutex_lock+0x5    1.7 ms       15     2.2 ms      25.1 ms
                       mutex_lock+0x5    1.7 ms       15     2.2 ms      25.1 ms

To fix this, remove the offset in the BPF code and add logic in the
userspace component to walk the stack frames and find the right offset.
We do this by keeping a list of all the possible lock function names we
attach to, pre-resolving them into kernel symbol addresses. Then, each
time we process a stack trace obtained from the kernel, we walk the
stack until we find one of those symbols, and use that (+1) as the stack
offset to process.

This makes the output work correctly regardless of the offset of the
stack offset on a given system.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
@tohojo tohojo force-pushed the klockstat-stack-fix branch from 2243cb5 to 22cca3c Compare February 26, 2025 15:58
@yonghong-song yonghong-song merged commit f449d05 into iovisor:master Feb 27, 2025
1 of 12 checks passed
@tohojo tohojo deleted the klockstat-stack-fix branch March 3, 2025 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants