Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for built-in ORC information #460

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

brenns10
Copy link
Contributor

Thanks to your work today with 3ce0fee ("tests: don't clobber file in use by libelf") and a1869f9 ("Make StackFrame.name fall back to symbol/PC and add StackFrame.function_name"), this branch is fully unblocked!

Now that the module API has landed, we can usually rely on having a struct drgn_module available for a kernel module, even if the module debuginfo is not loaded (either because you have only loaded debuginfo for the kernel, or because you are using CTF). My previous ORC support required essentially re-implementing a small portion of the module API, specifically a tree that mapped address ranges to ORC data. Now, I can drop all that complexity and share this branch which I think is ready for consideration.

This branch's main goal is to enable using built-in ORC information for stack unwinding, for the cases where a module does not have debuginfo. I think the commit messages here describe everything that's important. I think the testing is okay -- I wish I could test the vmlinux ORC, but I think I would need a type finder to allow the module API to even get initialized.

This will allow orc_version_from_header() to be reused for upcoming ORC
integration that does not use libelf.

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
This allows users to get the object which the module was created from.
The primary use case is for Linux kernel modules, to return the "struct
module" associated with the drgn module object.

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
@brenns10
Copy link
Contributor Author

So, the error here was that the 4.9 kernel does not support ORC, and I was passing through the lookup error for "num_orcs" in struct mod_arch_specific, rather than swallowing it (since that shouldn't interrupt a stack trace).

The interesting thing is that after correcting that, the test passes, because on 4.9, unwind can be done via frame pointers rather than ORC. So really, my stack tracing test isn't (necessarily) testing the ability to unwind with ORC, it's testing the ability to unwind with anything other than DWARF. I don't really have a knob to test this in the Python API. But unfortunately, this isn't really something easy to test in a C unit test either.

Since drgn typically assumes the presence of debug files, ORC has always
been loaded from the ELF file, rather than the program. However, ORC is
present in kernel core dumps, so it can be used when the debug file is
unavailable. Implement the ability to load built-in ORC for vmlinux and
kernel modules. We still prefer to load ORC from the debug file wherever
possible, because this is almost certainly faster.

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
When looking up CFI rules using ORC, we use module->debug_file_bias
unconditionally. This makes sense when the ORC is always loaded from an
ELF debug file. However, now that built-in ORC can be loaded, it is
possible that:

1. ORC is loaded from the built-in source, prior to loading the debug
   file. The module->debug_file_bias == 0, so the ORC is interpreted
   correctly.
2. Later, a debug file is loaded, updating debug_file_bias. However, the
   ORC hasn't been loaded from the debug file, so the bias is not
   applicable.
3. Future CFI lookups using ORC fail due to the extra bias.

To avoid this, apply the debug_file_bias once to module->orc.pc_base,
at the time we load the ORC sections out of the debug file. This ensures
that the bias is only applied to the ORC data when we know we need it.

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
@brenns10
Copy link
Contributor Author

I made a couple changes:

  1. Now, we log whenever we load built-in ORC.
  2. The tests now detect the log message to verify that ORC is getting used for unwinding.
  3. The tests also skip kernel 4.9 by detecting whether the kernel has the __start_orc_unwind symbol.

@brenns10
Copy link
Contributor Author

Interestingly, my test above that tested log messages failed on Python 3.6. I guess there is a difference in how the logging got initialized. I've added a context manager to explicitly set drgn's log level to DEBUG for the duration of the test, which I believe resolves the issue.

Finally, I've gone ahead and added one more test, which actually does test vmlinux ORC. It creates a Program with no debuginfo, and then copies a struct pt_regs from the Program which has debuginfo, and asks to unwind that. It's quite funky, but it does work, and I tested it on 4.9 as well as 6.12.

With that, I do feel happy about the testing now!

Copy link
Owner

@osandov osandov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I perused the ORC changes and they looked sane overall, but I'll have to give them a closer look after the holidays. I looked over the Module.object change more closely, so some comments there.

The Python 3.6 logging issues are probably due to our very sketchy integration with Python logging:

// This is slightly heinous. We need to sync the Python logging configuration
// with libdrgn, but the Python log level and handlers can change at any time,
// and there are no APIs to be notified of this.
//
// To sync the log level, we monkey patch logger._cache.clear() to update the
// libdrgn log level on every live program. This only works since CPython commit
// 78c18a9b9a14 ("bpo-30962: Added caching to Logger.isEnabledFor() (GH-2752)")
// (in v3.7), though. Before that, the best we can do is sync the level at the
// time that the program is created.
//
// We also check handlers in that monkey patch, which isn't the right place to
// hook but should work in practice in most cases.
.

Comment on lines +1506 to +1507
For Linux kernel loadable modules, this is a reference to the ``struct
module`` this was created from. For other kinds, this is currently an absent
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pointer (i.e. struct module *) feels slightly more natural to me (following the precedent of drgn.Thread.object). I'd be fine restricting drgn.Program.linux_kernel_loadable_module() to only accept struct module * and not struct module to make that simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking about that as well. I get that value pointers vs reference objects are always confusing to users, so I agree it would probably be better to stick with the convention that's already established.

I'd actually love to try to keep linux_kernel_loadable_module() accepting struct module. If we can't create a pointer to the object, then we can just leave the Module.object field absent as before. If you're ok with that.

The reason being that I really enjoyed the possibility of somebody constructing a struct module value Object, and then passing it in. I don't actually know how feasible that would be, but it's kinda fun :)

Comment on lines +292 to +293
struct drgn_program *prog = drgn_module_program(self->module);
Program *prog_obj = container_of(prog, Program, prog);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are enough places that do this that we probably need a Program *Module_prog(Module *module) akin to DrgnObject_prog() and DrgnType_prog().

* @param[out] ret Initialized object where the module object is placed
*/
struct drgn_error *
drgn_module_object(struct drgn_module *module, struct drgn_object *ret);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also add a drgn_module_set_object() (and the corresponding Python binding setter)? The module API in general lets the user say it knows better than drgn, so we should have a way to override this (for any kind of module).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It strikes me that this opens up the possibility for a user to assign a completely irrelevant object to the module, thus making it much more likely for an error to occur when we're using it. I don't have a problem with that, of course: good error handling is important regardless.

But in my experience, sometimes errors that come from too deep within the bowels of drgn are no longer helpful, because they don't come with a stack trace like Python errors. For example here: if a user ran prog.thread(1).stack_trace(), and got an error related to the ORC unwinder, it may not be obvious where it came from.

I know we can catch errors that we expect to be likely, and replace them with more informative ones. But I wonder if it wouldn't be worthwhile to add some sort of error-chaining functionality for drgn? That way, we could implement it so that any error returned by drgn_read_builtin_orc() might get wrapped with the context "encountered error while loading built-in ORC for module foo", without eliminating the specifics of the error that is wrapped.

That's just an oddball idea which doesn't have to be part of this PR, in any case. And I could make the case against it as well: maybe we should rely on logging to give us the context for the error! Just an idea to ponder over the holidays.

@brenns10
Copy link
Contributor Author

Ahh, thank you! I knew about the loglevel monkey-patching, but I hadn't read the full comment, so I missed the bit about syncing the log level when the Program is created. That makes sense, and I can easily fix it.

And yeah, I should have made more clear on this that I wasn't hoping for quick action before the holidays :) All the quick updates have been because I've been nerd-sniped by the tests, which are really fun to make work.

@brenns10
Copy link
Contributor Author

At your leisure: How do you feel about the API choice of Module.object having an absent object, instead of None, when there is no object present? On the one hand, it seems nice to not have an Optional[Object] type annotation. But on the other, part of me feels like it's overloading the concept of an absent object.

@osandov
Copy link
Owner

osandov commented Dec 21, 2024

Stick with the absent object, I prefer that over None in this case.

Loading built-in ORC is a difficult functionality to test: it is best
tested when there is no debuginfo file. Thus, we add two tests: one
simpler test in which the kernel has debuginfo, but a module does not,
and we must unwind a stack with functions from the module. The second
test is more complex, where we create a program with no debuginfo at
all, and provide it just enough data to initialize the module API and
unwind with built-in ORC.

In both cases, to verify that drgn is actually using ORC, we capture its
log messages.

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants