Skip to content

Conversation

ghost
Copy link

@ghost ghost commented Jun 11, 2025

Because it frees up a register it can have performance implications on the program and because we're optimizing for speed and not debugging, I think it's a no-brainer to make this true.
In general, I expect performance with this change to be strictly better or unchanged, both at the cost of it being more difficult to obtain a stack trace but that's not the goal of ReleaseFast.

How can performance be possibly worse if you have an additional register unless the code generator's register allocator has some bug where that somehow makes it perform worse? Can't really imagine that.

Wanted to include some kind of benchmark here but from the benchmarking I was able to do on the compiler, there doesn't seem to be a difference for zig ast-check with or without the frame pointer (anything beyond -Ddev=ast_gen takes too long to build with ReleaseFast...)

Because it frees up a register it can have performance implications on
the program and because we're optimizing for speed and not debugging,
I think it's a no-brainer to make this true.
In general, I expect performance with this change to be strictly better or unchanged,
both at the cost of it being more difficult to obtain a stack trace but
that's not the goal of ReleaseFast.
red_zone: ?bool = null,
/// Whether to omit the stack frame pointer. Frees up a register and makes it
/// Whether to omit the stack frame pointer. Frees up a register but makes it
/// more difficult to obtain stack traces. Has target-dependent effects.
Copy link
Author

@ghost ghost Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any "target-dependent effects" that can possibly be bad for performance? In that case it might make sense to leave this up to the user after all, or decide based on the target.

Copy link
Contributor

@RetroDev256 RetroDev256 Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omitting the frame pointer on i386 and riscv64 appear to result in worse performance:

image
image

Here's the test code I wrote generated:
mandelbrot.zig.txt

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Well I really seem to get absolutely no difference on x86_64:

$ zig build-exe mandelbrot.zig -fno-omit-frame-pointer -OReleaseFast --name keep
$ zig build-exe mandelbrot.zig -fomit-frame-pointer -OReleaseFast --name omit
$ poop ./omit ./keep
Benchmark 1 (4 runs): ./omit
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.49s  ± 53.8ms    1.45s  … 1.57s           0 ( 0%)        0%
  peak_rss            319KB ±    0       319KB …  319KB          0 ( 0%)        0%
  cpu_cycles         3.49G  ± 6.55M     3.49G  … 3.50G           0 ( 0%)        0%
  instructions       5.94G  ± 77.0      5.94G  … 5.94G           0 ( 0%)        0%
  cache_references   55.8K  ± 45.2K     30.4K  …  124K           0 ( 0%)        0%
  cache_misses       46.9K  ± 43.4K     24.4K  …  112K           0 ( 0%)        0%
  branch_misses      56.2M  ± 43.3K     56.2M  … 56.3M           0 ( 0%)        0%
Benchmark 2 (4 runs): ./keep
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.45s  ± 48.7ms    1.40s  … 1.52s           0 ( 0%)          -  2.5% ±  6.0%
  peak_rss            319KB ±    0       319KB …  319KB          0 ( 0%)          -  0.0% ±  0.0%
  cpu_cycles         3.52G  ± 5.95M     3.51G  … 3.52G           0 ( 0%)          +  0.7% ±  0.3%
  instructions       5.94G  ±  115      5.94G  … 5.94G           1 (25%)          +  0.0% ±  0.0%
  cache_references   72.1K  ± 32.4K     28.7K  …  107K           0 ( 0%)          + 29.3% ± 122.1%
  cache_misses       58.8K  ± 24.9K     25.1K  … 84.9K           0 ( 0%)          + 25.3% ± 130.5%
  branch_misses      56.4M  ± 66.4K     56.4M  … 56.5M           0 ( 0%)          +  0.4% ±  0.2%
$ ../zig/poop ./keep ./omit
Benchmark 1 (4 runs): ./keep
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.48s  ± 71.7ms    1.39s  … 1.57s           0 ( 0%)        0%
  peak_rss            319KB ±    0       319KB …  319KB          0 ( 0%)        0%
  cpu_cycles         3.51G  ± 3.47M     3.51G  … 3.51G           0 ( 0%)        0%
  instructions       5.94G  ± 59.2      5.94G  … 5.94G           0 ( 0%)        0%
  cache_references   37.0K  ± 6.45K     31.1K  … 44.5K           0 ( 0%)        0%
  cache_misses       27.6K  ± 4.20K     22.1K  … 32.3K           0 ( 0%)        0%
  branch_misses      56.3M  ± 99.0K     56.2M  … 56.4M           0 ( 0%)        0%
Benchmark 2 (4 runs): ./omit
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.39s  ± 41.1ms    1.35s  … 1.44s           0 ( 0%)          -  5.6% ±  6.8%
  peak_rss            319KB ±    0       319KB …  319KB          0 ( 0%)          -  0.0% ±  0.0%
  cpu_cycles         3.49G  ± 1.53M     3.49G  … 3.49G           0 ( 0%)          -  0.5% ±  0.1%
  instructions       5.94G  ± 39.1      5.94G  … 5.94G           0 ( 0%)          -  0.0% ±  0.0%
  cache_references   33.4K  ± 3.26K     28.8K  … 36.5K           1 (25%)          -  9.7% ± 23.9%
  cache_misses       26.3K  ± 2.85K     23.3K  … 30.2K           0 ( 0%)          -  4.8% ± 22.5%
  branch_misses      56.2M  ± 42.9K     56.2M  … 56.3M           0 ( 0%)          -  0.2% ±  0.2%

Also are you sure that's worse? Is it still worse if you run multiple times and swap the order of the commands?

Copy link
Contributor

@RetroDev256 RetroDev256 Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test x86_64, I just tested x86 (32 bit) there. I'm pretty confident in my benchmark of x86, but I could test further.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting. And really doesn't make sense.
Does that mean they should in fact be included only for those targets by default?
From what I'm reading though, especially on 32-bit x86 it was popular to omit the frame pointer because it has far less registers. It was faster to omit.
On x86_64 though there are enough registers that it really doesn't seem to have any impact.

@190n
Copy link
Contributor

190n commented Jun 11, 2025

See #22180 which has some reasoning from when frame pointers were added by default in ReleaseFast and ReleaseSafe

@candrewlee14
Copy link
Contributor

See https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html

As I've seen frame pointers help find performance wins ranging from 5% to 500%, the typical "less than 1%" cost (or even 1% or 2% cost) is easily justified. But I'd rather the cost be zero, of course! We may get there with future technologies I'll cover later. In the meantime, frame pointers are the most practical way to find performance wins today.

@ghost
Copy link
Author

ghost commented Jun 11, 2025

Ah, well, no matter how small the overhead, there's an overhead and ReleaseFast is a kind of all-or-nothing option where you either tune everything specifically for speed or you don't. You don't have -O1, -O2, or -O3 or whatever to make it "just a bit fast" "medium level fast" or whatever. It's an all or nothing. That's what the option is. So for what the option is it only makes sense to omit frame pointers.

If you want to do "debugging and profiling work" either use debug mode or enable frame pointers on your own if you need them anyway.

In the case of that article, it seems like looking at flame graphs would be difficult if frame pointers were disabled by default in ReleaseFast. So how about how enable them if you do benchmarking with flame graphs? A lot of Zig users seem to use tools like poop to benchmark where you don't need a frame pointer. And while I have no idea how much flame graphs are used in the Zig community I at least don't remember ever seeing anyone use them.

Maybe if something like flamegraph (the tool written in Rust) were written in Zig, it could integrate with the build system and turn on frame pointers automatically.
I don't know. Doesn't really matter though because what I already said above.

@ghost
Copy link
Author

ghost commented Jun 11, 2025

Anyone here that actually uses flame graphs or some profiling tool that would break if frame pointers were omitted in ReleaseFast builds by default? And what would inconvenience you so much to enable them on your own when you do profiling?

@190n
Copy link
Contributor

190n commented Jun 11, 2025

Regarding flamegraphs, my opinion is simply that software that is released should support being profiled. For that to happen, either the toolchain needs to not omit frame pointers by default, or every distribution and every project needs to override the default. I shouldn't have to go recompile everything. This is particularly important for whole-system profiling where you are looking at every program on the system and hence they would all have to have frame pointers.

Frame pointers are also very useful in debugging (yes, sometimes you have to debug a release binary) and crash reporting. Suppose I release a Zig program which has a crash. If frame pointers are omitted by default, then the default UX is:

  • someone runs into the crash
  • they send me an unhelpful crash report
  • I have to ask them to recompile with different options, or send them a new binary, and try to repro the crash again

This is far more disruptive than the program being slightly slower.

@190n
Copy link
Contributor

190n commented Jun 11, 2025

A lot of Zig users seem to use tools like poop to benchmark where you don't need a frame pointer. And while I have no idea how much flame graphs are used in the Zig community I at least don't remember ever seeing anyone use them.

They're for different purposes. Poop, hyperfine, and similar tools are good at measuring outcomes: once I've made some change, they can tell me it is X% faster or slower. Flame graphs are a tool for investigating: they will show you which parts of a program are taking a long time to run and therefore which parts it might be profitable to try optimizing.

@ghost
Copy link
Author

ghost commented Jun 11, 2025

Yeah, I can't really see this PR being merged anyway so no point. I was just going by what the ReleaseFast option is supposed to be (an all-or-nothing).

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants