per_cpu() does not work for external module percpu variables #185

brenns10 · 2022-06-29T18:00:22Z

For external kernel modules, it seems that percpu variables have their indices starting from zero. When they are loaded, the percpu memory is allocated, and the symbol addresses are adjusted based on the allocated percpu area (see simplify_symbols() in kernel/module/main.c). Surprisingly, this patch from 2003 is simple and mostly still revelant enough to show how it works.

This means that the per_cpu() helper is broken for module percpu variables. For example, see rds_stats in net/rds/stats.c, assuming it's built as a module.

>>> repr(prog["rds_stats"])
"Object(prog, 'struct rds_statistics', address=0x0)"

The per_cpu() helper happily uses offset=0 and reads out the wrong value. In reality, the module's percpu allocation is stored in the percpu field of struct module, and you need to add the address for the module's percpu variables to that offset. Thus, with the following two helper functions, it's fixed:

def get_module(prog, name):
    """Lookup and return the struct module * of a name."""
    iterator = list_for_each_entry(
        "struct module", prog["modules"].address_of_(), "list")
    for module in iterator:
        if module.name.string_().decode() == name:
            return module
    raise ValueError(f"no module {name} found!")


def module_per_cpu(mod, var, cpu):
    """Return a per_cpu variable, which is in a module."""
    real_pcpu = drgn.Object(
        var.prog_, var.type_, address=mod.percpu.value_() + var.address_,
    )
    return per_cpu(real_pcpu, cpu)

We can do the following, and I verified the computation compared to crash:

>>> mod = get_module(prog, "rds")
>>> percpu = module_per_cpu(mod, prog["rds_stats"], 0)
>>> percpu
(struct rds_statistics){
 ... correct data ...
}
>>> repr(percpu)
"Object(prog, 'struct rds_statistics', address=0xffff9394c0825080)"

I don't think that's the right approach though, since we shouldn't need to know whether something was built as a module or not. Maybe we need to adjust the symbol lookup routines for modules, so that they fixup the percpu variable addresses at load time, or as they're looked up?

The text was updated successfully, but these errors were encountered:

osandov · 2022-06-30T07:58:44Z

I've run into this before but haven't gotten around to fixing it. I think that the correct way to fix this is to apply ELF relocations to the .data..percpu section using the module's per-cpu offset as the section address. drgn gets section addresses on the live kernel by walking /sys/module/$mod/sections, and on a core dump by walking struct module::sect_attrs (which is the backing structure for that sysfs directory). The latter can also be used for the live kernel by setting the DRGN_USE_PROC_AND_SYS_MODULES environment variable to 0.

Unfortunately, neither approach finds the address of .data..percpu. This is because add_sect_attrs() excludes skips "empty" sections, which are defined as sections without the SHF_ALLOC flag or with sh_size == 0. .data..percpu has SHF_ALLOC set and a non-zero size, but module loading clears SHF_ALLOC.

As you pointed out, we can get the "address" of .data..percpu from struct module::percpu instead. I pushed a proof of concept of this fix here: https://github.com/osandov/drgn/tree/module-percpu-hack. Does that fix the issue for you?

The downside of this fix is that we can no longer use the sysfs-based fast path for the live kernel. The struct module-based path is slower than the sysfs path for a couple of reasons: accessing sysfs is faster than reading from /proc/kcore, and, more importantly, depending on struct module means that vmlinux must be fully indexed before modules can be found, which means that we can't parallelize indexing vmlinux and kernel modules together like we can for the sysfs fast path. The impact varies based on the number of CPUs on the host: with just a few CPUs, there is little to no difference, but with many, startup takes almost 1.5x as long (e.g., from ~0.35s to ~0.52s on my 56-thread development server).

Correctness should of course take priority over performance here, but I was hoping to come up with some way to avoid this performance hit and hence kept this on the backburner.

brenns10 · 2022-07-03T14:59:33Z

Thanks for the detailed explanation, and I share your disappointment. Is there a way to make this work at runtime, only when a module percpu variable is accessed? For instance, when the object finder loads a symbol that is located in the .data.percpu section, it could walk the modules list and set the section offset for each module?

I don't know if that's even feasible since I don't really know the code well.

Unfortunately I won't be able to test your module-percpu-hack branch for a while, since I a trip of my own recently, and left my vmcores on my work laptop :) Thankfully the folks impacted on my team have the above module_per_cpu() workaround, so I think we can let this one sit a bit if you prefer. I'll set a reminder to test the branch when I return.

osandov · 2022-09-12T04:44:41Z

So I revisited this, and I found that my initial analysis was incorrect -- the extra overhead from disabling the /proc/modules//sys/module fast path isn't from reducing the parallelism, but rather almost entirely from reading from /proc/kcore (which is stupidly slow, see #106). But I was able to reduce the necessary number of reads to 1 per module, which reduces the performance hit from ~100ms to ~13ms in my benchmark. I have some ideas to squeeze a little bit more out, but even if those don't pan out, I'm happy to give up 13ms to fix this bug. I'll probably finish this in the next couple of days.

osandov · 2022-09-12T23:29:35Z

According to my test case, I fixed this, but please re-open it if you test it and it doesn't fix this for you.

brenns10 · 2022-09-13T08:15:13Z

Thanks for doing this - I read through the commits and it makes sense but it's unfortunate to lose the /proc/modules fast path. But reading the whole struct module at once is a nice optimization :)

I'll verify this fix momentarily on a similar vmcore, but it looks right to me too.

brenns10 mentioned this issue Aug 10, 2022

Module addresses loading as relative rather than absolute #198

Closed

osandov closed this as completed in 30c9ad4 Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

per_cpu() does not work for external module percpu variables #185

per_cpu() does not work for external module percpu variables #185

brenns10 commented Jun 29, 2022

osandov commented Jun 30, 2022

brenns10 commented Jul 3, 2022

osandov commented Sep 12, 2022

osandov commented Sep 12, 2022

brenns10 commented Sep 13, 2022

per_cpu() does not work for external module percpu variables #185

per_cpu() does not work for external module percpu variables #185

Comments

brenns10 commented Jun 29, 2022

osandov commented Jun 30, 2022

brenns10 commented Jul 3, 2022

osandov commented Sep 12, 2022

osandov commented Sep 12, 2022

brenns10 commented Sep 13, 2022