Supporting stack unwinding in the JIT compiler #126910

pablogsal · 2024-11-16T19:33:22Z

TLDR

This is a lot of text because the issue is complex but if you want the gist:

To not break a lot of tools that rely on unwinding, I propose to fix this by compiling the JIT stencils with frame pointers which has a trivial maintenance cost (2 lines) and only involves a 2% hit on speed when using the JIT while making almost all existing debuggers and profilers just work™ in the presence of the JIT.

Although this doesn't fix everything sadly I think is the best compromise that I can find (and we are quite lucky to have it as normally fixing this is a nightmare - as in thousands of complex lines of code and 50% slowdown nightmare).

The issue + proposal

CPython's JIT compiler must provide robust stack unwinding support to maintain compatibility with the Python ecosystem's debugging and profiling tools. This requirement is particularly critical given Python's reliance on native extensions written in C, C++, and Rust. As more performance-critical code moves to these native implementations, the ability to properly unwind through mixed Python and native frames becomes fundamental for effective debugging and profiling. This capability is even more crucial with the introduction of both JIT compilation and free-threaded modes, which significantly increase the complexity of runtime behavior. Without proper unwinding support, debugging te issues that can appear in the presence of these new modes becomes extremely challenging – developers would be unable to generate meaningful stack traces, hampering their ability to understand where and why their programs failed. Consider debugging a deadlock in free threadded code and a Rust extension, or investigating a crash in native code called from JIT-compiled Python – without proper unwinding support, error reports would be incomplete or misleading, making production issues significantly harder to diagnose and fix. The ability to get complete, accurate stack traces in these scenarios is not just a convenience; it's essential for maintaining production applications where Python increasingly interacts with native code through multiple execution modes.

Profiling is equally affected – modern performance analysis tools like py-spy, Austin, or eBPF-based solutions need to understand the complete call stack, including JIT-compiled code, native extensions, and regular Python frames, to provide accurate performance insights.

Some of the popular tools that use native unwinding in some way or form:

Austin (statistical profiler). Uses libunwind
py-spy (statistical profiler).Uses libunwind
memray: uses C++ unwinder for macOS and libunwind for Linux
pystack: uses elfutils (as is the only one supporting core files).
EBPf unwinders : normally rely on frame pointers or rely on the C++ unwinder (https://israelo.io/blog/ehframes/, https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-based-stack-walking-using-ebpf).
Additionally, most JIT compilers implement not only frame pointers to make possible to profile and debug, but also implement either C++ exception unwinder via eh_frames or even more hardcore support for gdb and other debuggers.
Gdb and hdb-helpers: a combination of libdw, libunwind or custom
Perf: frame pointers, lubunwind or libdw
Lldb: libunwind or custom
valgrind: uses a custom unwinder based on libdw

This doesn't include the considerable amount of custom tools that are not open source either in the debugging or profiling space.

Unwinding Libraries and Their Capabilities

Several libraries provide stack unwinding capabilities, each with its own strengths and limitations.

Library	Platforms	Local Unwinding	Remote Unwinding	Core Files	Registration API	Remote Registration Support	Main Users
libunwind	Linux only	Yes (fast)	Yes	No	`_U_dyn_register`	No	Profilers, debuggers
Native Unwinder (LLVM/GDB)	Linux, MacOS	Yes (optimized)	No	No	`__register_frame`	No	C++ exceptions, `backtrace()`
libdw (elfutils)	Linux only	Yes	Yes	Yes	None (ELF parsing)	No	GDB, debuggers like pystack
GDB/LLDB built-in	Cross-platform	Yes	Yes	Yes	`__jit_debug_register_code`	No	Debug tools

Implementation Plan for Stack Unwinding Support

After extensive experimentation and analysis (believe me this was a lot of very difficult research work, @brandtbucher and I found that since we have preserve_none and LLVM 19 we can just compile the JIT stencils with frame pointers and that makes most tools just work. Based on this I propose implementing frame pointer support as the primary strategy for stack unwinding in CPython's JIT. This approach has proven to be the most pragmatic and effective solution, providing broad compatibility with existing tools while maintaining reasonable performance characteristics. I do think we got very lucky this fixes most of the tools since adding unwinding support for JITs otherwise is very challenging (see following sections).

Frame pointer support can be enabled through a minimal change to the JIT compiler flags:

diff --git a/Tools/jit/_targets.py b/Tools/jit/_targets.py
index d8dce0a905c..7c3b2e5aab7 100644
--- a/Tools/jit/_targets.py
+++ b/Tools/jit/_targets.py
@@ -135,6 +135,8 @@ async def _compile(
             # Don't call stack-smashing canaries that we can't find or patch:
             "-fno-stack-protector",
             "-std=c11",
+            "-fno-omit-frame-pointer",
+            "-mno-omit-leaf-frame-pointer",
             "-o",
             f"{o}",
             f"{c}",

This simple change enables compatibility with a wide range of tools:

libunwind (both local and remote unwinding)
libdw (supporting core file analysis, remote and local unwinding)
GDB and LLDB
eBPF-based unwinders

Most of the popular tools will just work with this change, instead of having to implement tons of different unwinding information support, which is a nightmare)

Performance Impact

https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20241114-3.14.0a1%2B-925b70b-JIT/bm-20241114-linux-x86_64-brandtbucher-justin_frame_pointer-3.14.0a1%2B-925b70b-vs-base.md

The performance impact of this change has been carefully measured:

AMD64: Approximately 2% overhead, as shown in recent benchmarks
ARM64 and macOS: Even lower overhead due to the presence of dedicated link registers

This modest performance cost is significantly outweighed by the expected JIT performance improvements. Moreover, the maintenance burden is minimal compared to alternative approaches like implementing custom unwinding support for each tool.

Additional Optimization

We can further enhance compatibility by generating eh-frames in the JIT stencils and calling __register_frame_table (a bulk version of __register_frame). Benchmarks show this additional feature has neutral performance impact while enabling native unwinder support. This makes the solution even more robust without additional overhead.

Given these considerations - the minimal implementation complexity, broad tool compatibility, reasonable performance characteristics, and the ability to extend support to native unwinders - frame pointer support represents the optimal path forward for CPython's JIT implementation.

Why this must be activated by default

The decision to enable frame pointers by default is driven by two critical production requirements.

When applications crash or hang, tools like pystack need to analyze program state without the luxury of reproduction – you cannot simply "run it again with debug options enabled." You need to unwind the stack either remotely or from a core file.
Performance profiling in production requires safely inspecting program state from another process or the kernel. Tools like Austin, py-spy, and eBPF-based profilers need to periodically sample stack traces without interrupting the target process. Both scenarios require frame pointers to work reliably.

This must be enabled by default because most Python users don't compile Python themselves – they install it through package managers or tools like uv. Unless frame pointers are enabled by default, these critical debugging and profiling capabilities won't be available in most Python installations, severely impacting the quality of bug reports to both CPython and C extensions. Given the minimal performance impact of 2% frame pointers should be enabled by default, following the principle of being "safe by default." While we should provide the option to disable them, this should be a conscious decision made by end users who understand they're trading away the ability to properly profile and debug their applications in production – not a choice made by intermediate distributors or package managers.

Additional information

Using frame pointers aligns with a broader industry trend where major tech companies and Linux distributions are reverting previous optimization decisions and re-enabling frame pointers. Companies like Meta and Google now compile their entire software stack, including their Python installations, with frame pointers enabled. This shift is driven by the recognition that observability and debugging capabilities in production environments far outweigh the minor performance impact of frame pointers. Ubuntu has also adopted this approach, now compiling virtually all packages with frame pointers enabled (with Python being a notable exception we aim to address). This industry-wide move reflects a fundamental reality: in modern production environments, the ability to profile, monitor, and debug applications effectively is crucial, and frame pointers provide the most reliable and universal way to achieve this. The performance cost, typically around 1-5%, is normally considered to be a worthwhile trade-off for the improved observability they enable. This is particularly relevant for server workloads where understanding performance characteristics and debugging production issues is far more valuable than the small overhead frame pointers introduce. The fact that major tech companies maintain this policy even for performance-critical services demonstrates that the benefits of comprehensive profiling and debugging support outweigh the minimal performance impact. Some links:

https://ubuntu.com/blog/ubuntu-performance-engineering-with-frame-pointers-by-default
https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html
https://www.polarsignals.com/blog/posts/2023/12/13/embracing-frame-pointers-in-ubuntu-24-04-lts

Background on Stack Unwinding Support Requirements

Adding JIT compilation to CPython requires careful consideration of stack unwinding support. Without proper unwinding capabilities, we risk breaking compatibility with essential development tools that Python developers rely on daily. To understand the scope of this requirement, we need to examine the landscape of tools that depend on stack unwinding.

Tools Requiring Stack Unwinding

Stack unwinding is a critical capability used by three major categories of tools in the Python ecosystem:

Debuggers (GDB, LLDB, pystack) rely on unwinding to show developers where their program is currently executing. These tools deal with remote processes or core files.
Profilers use unwinding to understand program performance.
C++ Exception Handling depends on unwinding information to:
- Propagate exceptions through the stack
- Clean up resources during stack unwinding

Unwinding Mechanisms

Stack unwinding occurs in two fundamentally different contexts:

In-process unwinding:
- Code examines its own stack
- Used by tracing profilers and exception handlers
- Has direct access to program memory and CPU registers
- Primarily used for live analysis
Out-of-process unwinding:
- External program examines another program's stack
- Used by debuggers and statistical profilers
- Includes core dump analysis

Types of Analysis Tools

The tools that perform unwinding can be categorized by their data collection method:

Tracing Profilers:
- Run inside the target process
- Instrument function entries and exits
- Collect detailed execution information
- Higher overhead but more precise
- Rely on local unwinding capabilities
- Examples: cProfile, memray
Statistical Profilers:
- Run outside the target process
- Sample program state periodically
- Much lower overhead
- Suitable for production use
- Growing eBPF ecosystem:
  - Enables efficient system-wide analysis
  - Becoming standard for production monitoring
  - Requires reliable stack unwinding support
  - Examples: perf, py-spy, bcc tools

Unwinding Libraries and Their Capabilities

Several libraries provide stack unwinding capabilities, each with its own strengths and limitations. The strategies to support unwinding for JIT compiled code are:

Allow frame pointers: makes libdw, gdb, lldb and lubinwind work. libunwind and libdw work in remote mode, including core files.
Per unwinder support for dynamically generated code using DWARF (only local mode and only if it supports it - libdw has no support):
- libunwind: Construct ELF .debug_frame data structures, and table_entry data structures, and a unw_dyn_table_info data structure, and a unw_dyn_info_t structure, then call _U_dyn_register.
- C++ exception unwinder: Construct ELF .eh_frame data structures, then call __register_frame
- GDB: Construct a full in-memory ELF object, manually maintain a doubly-linked list of all such objects in a global variable called __jit_debug_descriptor, and call a global function called __jit_debug_register_code when this list is changed

Any other library has no support.

Other JITs

Most JIT compilers implement not only frame pointers to make it possible to profile and debug, but also implement either C++ exception unwinder via eh_frames or even more hardcore support for gdb and other debuggers. Some links:

The text was updated successfully, but these errors were encountered:

brandtbucher · 2024-11-16T20:16:07Z

Just to clarify for anyone reading, it is sufficient to only enable frame pointers in JIT code, not necessarily the entire interpreter.

After spending a week with @pablogsal digging deep into this issue and prototyping and evaluating the different options available to us (including doing nothing), my personal opinion is that this issue is fixing a behavioral change, not adding a feature. It's also my personal opinion that the two-line diff above and the 2% hit are a good compromise that unblocks adoption of the JIT, and I think it's the best path forward. We have explored strategies to emit DWARF as well; copy-and-patch makes harvesting and emitting correct DWARF not-too-difficult, but I also see it as an optional, lower priority nice-to-have.

Frame pointers are a convenient escape hatch; for many tools that still don't work, we can at least point at frame pointers and argue that what the tools want should be possible using them. Currently, what they want is impossible without disabling the JIT... and we want people to turn it on!

Turning on frame pointers in the naive way above also does not preclude the possibility of improving the way we emit frame pointer code in the future should we decide it's worth clawing back that 2% in exchange for additional implementation complexity; such a patch can stand on its own merits. It also shouldn't preclude a runtime option to opt in or out of them using an environment variable (again, though, I'm not sure this is worth additional complexity right now to compile two versions of every stencil to make this work, but it's certainly possible).

I'd just like to thank @pablogsal for patiently educating me on this deep, dark, expansive area (and the people who rely on it) and working with me to explore solutions and compromise on something workable.

brandtbucher · 2024-11-16T22:21:27Z

Also, it's kind of neat that we've pretty much followed through with what PEP 744 had to say about this issue:

Since the code templates emitted by the JIT are compiled by Clang, it may be possible to allow JIT frames to be traced through by simply modifying the compiler flags to use frame pointers more carefully. It may also be possible to harvest and emit the debugging information produced by Clang. Neither of these ideas have been explored very deeply.

While this is an issue that should be fixed, fixing it is not a particularly high priority at this time. This is probably a problem best explored by somebody with more domain expertise in collaboration with those maintaining the JIT, who have little experience with the inner workings of these tools.

pablogsal · 2024-11-16T22:26:44Z

Also, it's kind of neat that we've pretty much followed through with what PEP 744 had to say about this issue:

Maybe except this part 😆 :

Neither of these ideas have been explored very deeply.

Now all of these ideas have been explored deeply. We have been through some stuff 😨

brandtbucher · 2024-11-16T22:35:42Z

Benchmarks on JIT frame pointers are in:

aarch64-apple-darwin: 2.1% slower
aarch64-unknown-linux-gnu: 6.1% slower
x86_64-unknown-linux-gnu: 2.1-3.1% slower
x86_64-pc-windows-msvc: 2.6% slower
i686-pc-windows-msvc: 2.2% slower

Looks like a 2-3% performance hit, with the clear outlier of aarch64-unknown-linux-gnu at around a 6% hit.

markshannon · 2024-11-18T11:23:14Z

According to our profiling jitted code is only ~14% of the execution time, so a 3% slowdown is a ~20% slowdown in the jitted code.

I think that's too slow.

pablogsal · 2024-11-18T11:45:21Z

Then what do you propose? I don't think we have an alternative here. This is just the cost of not breaking existing tools and honestly I think we are extremely lucky to have a way to fix it that's just 2 lines with 2% hit.

cfbolz · 2024-11-18T11:59:34Z

Amazing research work for a super important feature. Thank you @pablogsal and @brandtbucher for this effort! I'm a strong +1.

diegorusso · 2024-11-18T15:22:23Z

In the text of the issue you state:

ARM64 and macOS: Even lower overhead due to the presence of dedicated link registers

and later in the comments:

aarch64-apple-darwin: 2.1% slower
aarch64-unknown-linux-gnu: 6.1% slower

which contradicts what you said earlier. Also that outlier sounds very odd to me as all the AArch64 platforms should behave similarly.

pablogsal · 2024-11-18T15:42:34Z

which contradicts what you said earlier. Also that outlier sounds very odd to me as all the AArch64 platforms should behave similarly.

The original sentence was based on some tests runs I did originally but the second is a full py performance run. I also agree it looks off because macOS is using the same instruction set.

We need to run again and confirm because it makes no sense to me and also contradicts my own small tests I did separately

brandtbucher · 2024-11-18T15:49:43Z

Yeah, the run had some wider min/max values than I'm used to seeing, so it might have been a fluke: https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20241114-3.14.0a1%2B-b1f0a4e-JIT/bm-20241114-arminc-aarch64-brandtbucher-justin_frame_pointer-3.14.0a1%2B-b1f0a4e-vs-base.svg

zooba · 2024-11-18T16:11:37Z

I'm +1 on this. Frame pointers are basically the only viable way to handle stack unwinding here, and it's absolutely worth the cost. At the very least, we would need a runtime option to enable them (i.e. without recompiling CPython) because there are going to be essential scenarios that need them.

FWIW, MSVC doesn't allow enabling frame pointers for x64 or ARM64, but the function unwinding tables allow specifying them. I'm not sure what LLVM does, but it should be possible to have unwinding via a frame pointer work on these platforms too.

pablogsal · 2024-11-18T16:35:30Z

At the very least, we would need a runtime option to enable them (i.e. without recompiling CPython) because there are going to be essential scenarios that need them.

The idea is to have them activated by default (safety/debug should be the default) and may be a env variable or some other way to opt-out if you don't care but this should be set by the end user with knowledge of what the consequences are. Another reason we want this by default is that we don't want to make it impossible for users to sent us backtraces when the interpreter or a C extension crashes or hangs because reproducing this can be quite challenging (all of this is covered in the text).

markshannon · 2024-11-18T16:47:17Z

(safety/debug should be the default)

In what way is using frame pointers safer?

markshannon · 2024-11-18T16:58:29Z

Then what do you propose?

The interpreter is compiled without using the frame pointer. So unwinding clearly doesn't need frame pointers.
So use whatever is used to unwind through the interpreter.

pablogsal · 2024-11-18T16:58:31Z

(safety/debug should be the default)

In what way is using frame pointers safer?

Because it allows debuggers to work.

markshannon · 2024-11-18T16:59:26Z

What do you mean by "safety" in this context?

brandtbucher · 2024-11-18T17:00:22Z

A GitHub issue isn't good for real-time back-and-forth chat.

pablogsal · 2024-11-18T17:00:34Z

What do you mean by "safety" in this context?

That if you application crashes, hangs or generates a core you can actually use a debugger with that and not getting wrong stacks because the JIT makes them choke.

pablogsal · 2024-11-18T17:03:14Z

Then what do you propose?

The interpreter is compiled without using the frame pointer. So unwinding clearly doesn't need frame pointers.

It needs frame pointers if you have a JIT compiler in the middle because the JIT doesn't have DWARF (debug information). This is explained in the issue.

So use whatever is used to unwind through the interpreter.

You cannot because the JIT doesn't have DWARF and a backing elf file that unwinders can use. It's just a random string of bytes.

pablogsal · 2024-11-18T17:05:37Z

A GitHub issue isn't good for real-time back-and-forth chat.

Agreed, let's chat on Wednesday.

brandtbucher · 2024-11-20T15:13:23Z

Some good news: with this fix (that I plan to upstream) for LLVM's existing "reserved frame pointers" functionality...

diff --git a/llvm/lib/Target/X86/X86RegisterInfo.cpp b/llvm/lib/Target/X86/X86RegisterInfo.cpp
index 50db211c99d8..9b8652b7e302 100644
--- a/llvm/lib/Target/X86/X86RegisterInfo.cpp
+++ b/llvm/lib/Target/X86/X86RegisterInfo.cpp
@@ -563,7 +563,7 @@ BitVector X86RegisterInfo::getReservedRegs(const MachineFunction &MF) const {
     Reserved.set(SubReg);
 
   // Set the frame-pointer register and its aliases as reserved if needed.
-  if (TFI->hasFP(MF)) {
+  if (TFI->hasFP(MF) || MF.getTarget().Options.FramePointerIsReserved(MF)) {
     if (MF.getInfo<X86MachineFunctionInfo>()->getFPClobberedByInvoke())
       MF.getContext().reportError(
           SMLoc(),

...and this change to CPython (replacing the 2-line change @pablogsal suggests above)...

diff --git a/Tools/jit/_targets.py b/Tools/jit/_targets.py
index d8dce0a905c..4e898a86f86 100644
--- a/Tools/jit/_targets.py
+++ b/Tools/jit/_targets.py
@@ -121,6 +121,8 @@ async def _compile(
             f"-I{CPYTHON / 'Python'}",
             f"-I{CPYTHON / 'Tools' / 'jit'}",
             "-O3",
+            "-Xclang",
+            f"-mframe-pointer={'all' if opname == 'shim' else 'reserved'}",
             "-c",
             # This debug info isn't necessary, and bloats out the JIT'ed code.
             # We *may* be able to re-enable this, process it, and JIT it for a

...frame-pointer-based unwinding works, with 0% slowdown on benchmarks!

More context:

There are two reasons why frame pointers are slow: you lose a register, and you must save and restore your caller's frame pointer state at the beginning and end of each function. It's the latter that's causing a slowdown for us; we compile each uop as its own function (which tail-calls into the next) and concatenate the bodies. So incrementing a fast local x += 1, should look something like this:

push and incref x
push 1
guard that x is an int
add them
decref and store x

With frame pointers, this instead becomes:

save and set rbp
push and incref x
restore rbp
save and set rbp
push 1
restore rbp
save and set rbp
guard that x is an int
restore rbp
save and set rbp
add them
restore rbp
save and set rbp
decref and store x
restore rbp

All of the frame pointer shuffling obviously isn't necessary; we really only need to do it once at the beginning, and once at the end. But some templates have multiple "returns", so finding and removing all of these is hard. And if we compile without frame pointers, the compiler uses the frame pointer register as scratch space, and clobbers whatever value we put there manually.

However, LLVM does have (broken on main, seemingly fixable with the change I found above) functionality to "reserve" the frame pointer register, meaning it pretends it isn't even there. This is perfect, since it means that as long as we know the frame pointer register is a valid value on entry to our concatenated sequence of code, it will remain valid throughout.

Whenever we call into JIT code, we already push a "shim" frame between the interpreter and the JIT code, to convert between the platform calling convention and the one used for the JIT's tail calls. We can compile just this with frame pointers, since it's very cheap to do so, and compile all of the other JIT code with the frame pointer register reserved. So when the shim calls into the JIT code, the frame pointer register remains valid, and the two frames appear as one to unwinders.

zooba · 2024-11-20T15:42:55Z

and the two frames appear as one to unwinders.

This may not be the case on Windows (I trust you on other platforms), where the IP is used to look up a table embedded in the executable (or registered dynamically) to decide how to unwind. It looks like the shim function is compiled once, which means it gets its own entry that will unwind correctly, but the jitted code still needs a way to unwind.

Unless you generate/copy the shim as part of the rest of the function and it's all contiguous, then both frames really would just be a single frame (and you can probably also just copy the function entry from the shim to apply to the rest of the code, since the unwinding procedure will be identical).

There's some relevant code in https://github.com/microsoft/python-etwtrace/blob/main/src/etwtrace/_etwtrace.c that does this for x64 and ARM64, if you prefer a real example. Note that _thunk is never called directly by this code - that's the most not-obvious part.

brandtbucher · 2024-11-20T16:04:09Z

Yeah, we haven't checked either approach on platforms other than Linux yet.

It looks like the shim function is compiled once, which means it gets its own entry that will unwind correctly, but the jitted code still needs a way to unwind.

Unless you generate/copy the shim as part of the rest of the function and it's all contiguous, then both frames really would just be a single frame (and you can probably also just copy the function entry from the shim to apply to the rest of the code, since the unwinding procedure will be identical).

It's sort of a mix of the two. We JIT a copy of the shim for each trace, but if one trace jumps into another trace the shim frame from the first remains above it, and the second trace's shim is never used.

So if trace A side-exits to B which side-exits to C, then the actual stack will be [..., _PyEval_EvalFrameDefault, <shim A>, <trace C>, ...]. With the approach in my comment, an unwinder would just see <shim A>'s frame pointer in rbp upon unwinding to <trace C>. Which means it would see [..., _PyEval_EvalFrameDefault, <shim A>, ...].

Not sure if that helps at all?

brandtbucher · 2024-11-20T16:10:06Z

This may not be the case on Windows (I trust you on other platforms), where the IP is used to look up a table embedded in the executable (or registered dynamically) to decide how to unwind.

Based on our experiments, this seems to be how other unwinding tools work on Linux (@pablogsal can correct me if I'm wrong). If DWARF unwind info for the IP has been registered with the tool (either from loading the executable or through explicit runtime APIs), it will use that. Otherwise, many will attempt to use frame pointers as a fallback just to get through that frame, which is the entire reason why this fix works, even when the rest of the interpreter doesn't have frame pointers.

zooba · 2024-11-20T16:38:47Z

With the approach in my comment, an unwinder would just see <shim A>'s frame pointer in rbp upon unwinding to <trace C>. Which means it would see [..., _PyEval_EvalFrameDefault, <shim A>, ...].

This is fine, I'm sure. It makes it a little harder to figure out exactly which Python code led to the code being executed, but stack unwinding should work (and unwinding is far more important).

Otherwise, many will attempt to use frame pointers as a fallback just to get through that frame, which is the entire reason why this fix works, even when the rest of the interpreter doesn't have frame pointers.

I'm 95% sure Windows doesn't fall back to frame pointers, because as I mentioned earlier they're not even enabled by the system compilers. I'm only 50% sure that registering (through those runtime APIs) that a function does use a frame pointer will even work, but at least in that case I'm prepared to report it to the OS as a bug and get it fixed.

pablogsal · 2024-11-20T17:00:56Z

With the approach in my comment, an unwinder would just see <shim A>'s frame pointer in rbp upon unwinding to <trace C>. Which means it would see [..., _PyEval_EvalFrameDefault, <shim A>, ...].

This is fine, I'm sure. It makes it a little harder to figure out exactly which Python code led to the code being executed, but stack unwinding should work (and unwinding is far more important).

Otherwise, many will attempt to use frame pointers as a fallback just to get through that frame, which is the entire reason why this fix works, even when the rest of the interpreter doesn't have frame pointers.

I'm 95% sure Windows doesn't fall back to frame pointers, because as I mentioned earlier they're not even enabled by the system compilers. I'm only 50% sure that registering (through those runtime APIs) that a function does use a frame pointer will even work, but at least in that case I'm prepared to report it to the OS as a bug and get it fixed.

I spent some time investigating windows and trying stuff. Seems that there are two options:

Construct some RUNTIME_FUNCTION data structures up front, and then call RtlAddFunctionTable.
Call RtlInstallFunctionTableCallback, and then construct RUNTIME_FUNCTION data structures lazily on demand.

So very similar to __register_frame in Linux.

brandtbucher · 2024-11-20T19:28:19Z

Our plan now is:

I'll open an issue about the reserved frame pointers fix for LLVM's x86 backend, just so we can get an idea of the timeline / likelihood of a fix.
We'll test that the reserved frame pointers work on AArch64 Linux with an unmodified LLVM 19. If not, we'll dig into why. If they do work (expected), we'll benchmark it and open a PR with a test (assuming performance doesn't take a hit).
In the meantime, don't make any deep changes to the JIT build that may break frame-pointer-based unwinding, since it's untested currently.

zooba · 2024-11-20T19:44:52Z

I spent some time investigating windows and trying stuff. Seems that there are two options:

For whatever horrible reason, RtlAddGrowableFunctionTable (and associated RtlGrowFunctionTable ) is actually the only one that works. It was reported over a year ago (after I spent way too much time figuring it out), so it might get fixed, but I'd start with those APIs.

(And it's probably not safe to actually grow a table, since you can't just append, you have to sort the table yourself. But you can provide the entire table and then never grow it. It's just that this is the only API that actually tells anyone that you added a table.)

Eclips4 added topic-JIT interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Nov 16, 2024

pablogsal removed the type-feature A feature request or enhancement label Nov 16, 2024

pablogsal assigned pablogsal and brandtbucher Nov 16, 2024

brandtbucher added the 3.14 new features, bugs and security fixes label Nov 16, 2024

python deleted a comment from maleycl Nov 16, 2024

pablogsal changed the title ~~Unwinding support for the JIT compiler~~ Supporting stack unwinding in the JIT compiler Nov 19, 2024

brandtbucher mentioned this issue Nov 21, 2024

Reserved frame pointers are broken on x86 llvm/llvm-project#117178

Open

github-actions bot mentioned this issue Dec 1, 2024

Monthly issue metrics report hugovk/test#88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting stack unwinding in the JIT compiler #126910

Supporting stack unwinding in the JIT compiler #126910

pablogsal commented Nov 16, 2024 •

edited

Loading

brandtbucher commented Nov 16, 2024 •

edited

Loading

brandtbucher commented Nov 16, 2024

pablogsal commented Nov 16, 2024 •

edited

Loading

brandtbucher commented Nov 16, 2024

markshannon commented Nov 18, 2024

pablogsal commented Nov 18, 2024 •

edited

Loading

cfbolz commented Nov 18, 2024

diegorusso commented Nov 18, 2024

pablogsal commented Nov 18, 2024

brandtbucher commented Nov 18, 2024

zooba commented Nov 18, 2024

pablogsal commented Nov 18, 2024

markshannon commented Nov 18, 2024

markshannon commented Nov 18, 2024

pablogsal commented Nov 18, 2024

markshannon commented Nov 18, 2024

brandtbucher commented Nov 18, 2024

pablogsal commented Nov 18, 2024 •

edited

Loading

pablogsal commented Nov 18, 2024 •

edited

Loading

pablogsal commented Nov 18, 2024 •

edited

Loading

brandtbucher commented Nov 20, 2024 •

edited

Loading

zooba commented Nov 20, 2024

brandtbucher commented Nov 20, 2024

brandtbucher commented Nov 20, 2024 •

edited

Loading

zooba commented Nov 20, 2024 •

edited

Loading

pablogsal commented Nov 20, 2024 •

edited

Loading

brandtbucher commented Nov 20, 2024 •

edited

Loading

zooba commented Nov 20, 2024 •

edited

Loading

Supporting stack unwinding in the JIT compiler #126910

Supporting stack unwinding in the JIT compiler #126910

Comments

pablogsal commented Nov 16, 2024 • edited Loading

TLDR

The issue + proposal

Unwinding Libraries and Their Capabilities

Implementation Plan for Stack Unwinding Support

Performance Impact

Additional Optimization

Why this must be activated by default

Additional information

Background on Stack Unwinding Support Requirements

Tools Requiring Stack Unwinding

Unwinding Mechanisms

Types of Analysis Tools

Unwinding Libraries and Their Capabilities

Other JITs

brandtbucher commented Nov 16, 2024 • edited Loading

brandtbucher commented Nov 16, 2024

pablogsal commented Nov 16, 2024 • edited Loading

brandtbucher commented Nov 16, 2024

markshannon commented Nov 18, 2024

pablogsal commented Nov 18, 2024 • edited Loading

cfbolz commented Nov 18, 2024

diegorusso commented Nov 18, 2024

pablogsal commented Nov 18, 2024

brandtbucher commented Nov 18, 2024

zooba commented Nov 18, 2024

pablogsal commented Nov 18, 2024

markshannon commented Nov 18, 2024

markshannon commented Nov 18, 2024

pablogsal commented Nov 18, 2024

markshannon commented Nov 18, 2024

brandtbucher commented Nov 18, 2024

pablogsal commented Nov 18, 2024 • edited Loading

pablogsal commented Nov 18, 2024 • edited Loading

pablogsal commented Nov 18, 2024 • edited Loading

brandtbucher commented Nov 20, 2024 • edited Loading

zooba commented Nov 20, 2024

brandtbucher commented Nov 20, 2024

brandtbucher commented Nov 20, 2024 • edited Loading

zooba commented Nov 20, 2024 • edited Loading

pablogsal commented Nov 20, 2024 • edited Loading

brandtbucher commented Nov 20, 2024 • edited Loading

zooba commented Nov 20, 2024 • edited Loading

pablogsal commented Nov 16, 2024 •

edited

Loading

brandtbucher commented Nov 16, 2024 •

edited

Loading

pablogsal commented Nov 16, 2024 •

edited

Loading

pablogsal commented Nov 18, 2024 •

edited

Loading

pablogsal commented Nov 18, 2024 •

edited

Loading

pablogsal commented Nov 18, 2024 •

edited

Loading

pablogsal commented Nov 18, 2024 •

edited

Loading

brandtbucher commented Nov 20, 2024 •

edited

Loading

brandtbucher commented Nov 20, 2024 •

edited

Loading

zooba commented Nov 20, 2024 •

edited

Loading

pablogsal commented Nov 20, 2024 •

edited

Loading

brandtbucher commented Nov 20, 2024 •

edited

Loading

zooba commented Nov 20, 2024 •

edited

Loading