Support for built-in ORC information #460

brenns10 · 2024-12-20T00:19:34Z

Thanks to your work today with 3ce0fee ("tests: don't clobber file in use by libelf") and a1869f9 ("Make StackFrame.name fall back to symbol/PC and add StackFrame.function_name"), this branch is fully unblocked!

Now that the module API has landed, we can usually rely on having a struct drgn_module available for a kernel module, even if the module debuginfo is not loaded (either because you have only loaded debuginfo for the kernel, or because you are using CTF). My previous ORC support required essentially re-implementing a small portion of the module API, specifically a tree that mapped address ranges to ORC data. Now, I can drop all that complexity and share this branch which I think is ready for consideration.

This branch's main goal is to enable using built-in ORC information for stack unwinding, for the cases where a module does not have debuginfo. I think the commit messages here describe everything that's important. I think the testing is okay -- I wish I could test the vmlinux ORC, but I think I would need a type finder to allow the module API to even get initialized.

This will allow orc_version_from_header() to be reused for upcoming ORC integration that does not use libelf. Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>

This allows users to get the object which the module was created from. The primary use case is for Linux kernel modules, to return the "struct module" associated with the drgn module object. Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>

brenns10 · 2024-12-20T07:09:51Z

So, the error here was that the 4.9 kernel does not support ORC, and I was passing through the lookup error for "num_orcs" in struct mod_arch_specific, rather than swallowing it (since that shouldn't interrupt a stack trace).

The interesting thing is that after correcting that, the test passes, because on 4.9, unwind can be done via frame pointers rather than ORC. So really, my stack tracing test isn't (necessarily) testing the ability to unwind with ORC, it's testing the ability to unwind with anything other than DWARF. I don't really have a knob to test this in the Python API. But unfortunately, this isn't really something easy to test in a C unit test either.

Since drgn typically assumes the presence of debug files, ORC has always been loaded from the ELF file, rather than the program. However, ORC is present in kernel core dumps, so it can be used when the debug file is unavailable. Implement the ability to load built-in ORC for vmlinux and kernel modules. We still prefer to load ORC from the debug file wherever possible, because this is almost certainly faster. Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>

When looking up CFI rules using ORC, we use module->debug_file_bias unconditionally. This makes sense when the ORC is always loaded from an ELF debug file. However, now that built-in ORC can be loaded, it is possible that: 1. ORC is loaded from the built-in source, prior to loading the debug file. The module->debug_file_bias == 0, so the ORC is interpreted correctly. 2. Later, a debug file is loaded, updating debug_file_bias. However, the ORC hasn't been loaded from the debug file, so the bias is not applicable. 3. Future CFI lookups using ORC fail due to the extra bias. To avoid this, apply the debug_file_bias once to module->orc.pc_base, at the time we load the ORC sections out of the debug file. This ensures that the bias is only applied to the ORC data when we know we need it. Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>

brenns10 · 2024-12-20T18:55:10Z

I made a couple changes:

Now, we log whenever we load built-in ORC.
The tests now detect the log message to verify that ORC is getting used for unwinding.
The tests also skip kernel 4.9 by detecting whether the kernel has the __start_orc_unwind symbol.

brenns10 · 2024-12-20T23:02:56Z

Interestingly, my test above that tested log messages failed on Python 3.6. I guess there is a difference in how the logging got initialized. I've added a context manager to explicitly set drgn's log level to DEBUG for the duration of the test, which I believe resolves the issue.

Finally, I've gone ahead and added one more test, which actually does test vmlinux ORC. It creates a Program with no debuginfo, and then copies a struct pt_regs from the Program which has debuginfo, and asks to unwind that. It's quite funky, but it does work, and I tested it on 4.9 as well as 6.12.

With that, I do feel happy about the testing now!

osandov

I perused the ORC changes and they looked sane overall, but I'll have to give them a closer look after the holidays. I looked over the Module.object change more closely, so some comments there.

The Python 3.6 logging issues are probably due to our very sketchy integration with Python logging:

drgn/libdrgn/python/program.c

Lines 144 to 155 in a1869f9

    
           // This is slightly heinous. We need to sync the Python logging configuration 
        
           // with libdrgn, but the Python log level and handlers can change at any time, 
        
           // and there are no APIs to be notified of this. 
        
           // 
        
           // To sync the log level, we monkey patch logger._cache.clear() to update the 
        
           // libdrgn log level on every live program. This only works since CPython commit 
        
           // 78c18a9b9a14 ("bpo-30962: Added caching to Logger.isEnabledFor() (GH-2752)") 
        
           // (in v3.7), though. Before that, the best we can do is sync the level at the 
        
           // time that the program is created. 
        
           // 
        
           // We also check handlers in that monkey patch, which isn't the right place to 
        
           // hook but should work in practice in most cases.

.

osandov · 2024-12-20T23:32:28Z

_drgn.pyi

+    For Linux kernel loadable modules, this is a reference to the ``struct
+    module`` this was created from. For other kinds, this is currently an absent


A pointer (i.e. struct module *) feels slightly more natural to me (following the precedent of drgn.Thread.object). I'd be fine restricting drgn.Program.linux_kernel_loadable_module() to only accept struct module * and not struct module to make that simpler.

Yeah, I was thinking about that as well. I get that value pointers vs reference objects are always confusing to users, so I agree it would probably be better to stick with the convention that's already established.

I'd actually love to try to keep linux_kernel_loadable_module() accepting struct module. If we can't create a pointer to the object, then we can just leave the Module.object field absent as before. If you're ok with that.

The reason being that I really enjoyed the possibility of somebody constructing a struct module value Object, and then passing it in. I don't actually know how feasible that would be, but it's kinda fun :)

osandov · 2024-12-20T23:37:12Z

libdrgn/python/module.c

+	struct drgn_program *prog = drgn_module_program(self->module);
+	Program *prog_obj = container_of(prog, Program, prog);


There are enough places that do this that we probably need a Program *Module_prog(Module *module) akin to DrgnObject_prog() and DrgnType_prog().

osandov · 2024-12-20T23:39:55Z

libdrgn/drgn.h

+ * @param[out] ret Initialized object where the module object is placed
+ */
+struct drgn_error *
+drgn_module_object(struct drgn_module *module, struct drgn_object *ret);


Can you please also add a drgn_module_set_object() (and the corresponding Python binding setter)? The module API in general lets the user say it knows better than drgn, so we should have a way to override this (for any kind of module).

It strikes me that this opens up the possibility for a user to assign a completely irrelevant object to the module, thus making it much more likely for an error to occur when we're using it. I don't have a problem with that, of course: good error handling is important regardless.

But in my experience, sometimes errors that come from too deep within the bowels of drgn are no longer helpful, because they don't come with a stack trace like Python errors. For example here: if a user ran prog.thread(1).stack_trace(), and got an error related to the ORC unwinder, it may not be obvious where it came from.

I know we can catch errors that we expect to be likely, and replace them with more informative ones. But I wonder if it wouldn't be worthwhile to add some sort of error-chaining functionality for drgn? That way, we could implement it so that any error returned by drgn_read_builtin_orc() might get wrapped with the context "encountered error while loading built-in ORC for module foo", without eliminating the specifics of the error that is wrapped.

That's just an oddball idea which doesn't have to be part of this PR, in any case. And I could make the case against it as well: maybe we should rely on logging to give us the context for the error! Just an idea to ponder over the holidays.

brenns10 · 2024-12-21T00:03:52Z

Ahh, thank you! I knew about the loglevel monkey-patching, but I hadn't read the full comment, so I missed the bit about syncing the log level when the Program is created. That makes sense, and I can easily fix it.

And yeah, I should have made more clear on this that I wasn't hoping for quick action before the holidays :) All the quick updates have been because I've been nerd-sniped by the tests, which are really fun to make work.

brenns10 · 2024-12-21T00:08:08Z

At your leisure: How do you feel about the API choice of Module.object having an absent object, instead of None, when there is no object present? On the one hand, it seems nice to not have an Optional[Object] type annotation. But on the other, part of me feels like it's overloading the concept of an absent object.

osandov · 2024-12-21T00:11:46Z

Stick with the absent object, I prefer that over None in this case.

Loading built-in ORC is a difficult functionality to test: it is best tested when there is no debuginfo file. Thus, we add two tests: one simpler test in which the kernel has debuginfo, but a module does not, and we must unwind a stack with functions from the module. The second test is more complex, where we create a program with no debuginfo at all, and provide it just enough data to initialize the module API and unwind with built-in ORC. In both cases, to verify that drgn is actually using ORC, we capture its log messages. Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>

brenns10 added 2 commits December 19, 2024 16:10

orc_info: don't use ELF_Data when processing orc header

7e72870

This will allow orc_version_from_header() to be reused for upcoming ORC integration that does not use libelf. Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>

module: add object

278f004

This allows users to get the object which the module was created from. The primary use case is for Linux kernel modules, to return the "struct module" associated with the drgn module object. Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>

brenns10 force-pushed the orc branch from 17872c3 to f39f386 Compare December 20, 2024 07:05

brenns10 added 2 commits December 20, 2024 10:53

brenns10 force-pushed the orc branch from f39f386 to 58de01c Compare December 20, 2024 18:53

brenns10 force-pushed the orc branch from 58de01c to 5a5582b Compare December 20, 2024 22:59

osandov reviewed Dec 20, 2024

View reviewed changes

brenns10 force-pushed the orc branch from 5a5582b to 44e10fd Compare December 21, 2024 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for built-in ORC information #460

Support for built-in ORC information #460

brenns10 commented Dec 20, 2024

brenns10 commented Dec 20, 2024

brenns10 commented Dec 20, 2024

brenns10 commented Dec 20, 2024

osandov left a comment

osandov Dec 20, 2024

brenns10 Dec 21, 2024

osandov Dec 20, 2024

osandov Dec 20, 2024

brenns10 Dec 21, 2024

brenns10 commented Dec 21, 2024

brenns10 commented Dec 21, 2024

osandov commented Dec 21, 2024

	// This is slightly heinous. We need to sync the Python logging configuration
	// with libdrgn, but the Python log level and handlers can change at any time,
	// and there are no APIs to be notified of this.
	//
	// To sync the log level, we monkey patch logger._cache.clear() to update the
	// libdrgn log level on every live program. This only works since CPython commit
	// 78c18a9b9a14 ("bpo-30962: Added caching to Logger.isEnabledFor() (GH-2752)")
	// (in v3.7), though. Before that, the best we can do is sync the level at the
	// time that the program is created.
	//
	// We also check handlers in that monkey patch, which isn't the right place to
	// hook but should work in practice in most cases.

		For Linux kernel loadable modules, this is a reference to the ``struct
		module`` this was created from. For other kinds, this is currently an absent

		struct drgn_program *prog = drgn_module_program(self->module);
		Program *prog_obj = container_of(prog, Program, prog);

Support for built-in ORC information #460

Are you sure you want to change the base?

Support for built-in ORC information #460

Conversation

brenns10 commented Dec 20, 2024

brenns10 commented Dec 20, 2024

brenns10 commented Dec 20, 2024

brenns10 commented Dec 20, 2024

osandov left a comment

Choose a reason for hiding this comment

osandov Dec 20, 2024

Choose a reason for hiding this comment

brenns10 Dec 21, 2024

Choose a reason for hiding this comment

osandov Dec 20, 2024

Choose a reason for hiding this comment

osandov Dec 20, 2024

Choose a reason for hiding this comment

brenns10 Dec 21, 2024

Choose a reason for hiding this comment

brenns10 commented Dec 21, 2024

brenns10 commented Dec 21, 2024

osandov commented Dec 21, 2024