Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type-erased writer and GenericWriter #17634

Merged
merged 3 commits into from
Feb 8, 2024

Conversation

ianprime0509
Copy link
Contributor

@ianprime0509 ianprime0509 commented Oct 20, 2023

This is a companion to #17344 to apply the same change to the std.io.Writer interface.


Latest benchmarks here: #17634 (comment) Old benchmarks follow:

Benchmarks

Building tetris

I tried three different variations of this, shown below in the wrapper script I used. The first variation yields similar results as the compiler benchmark I tried in my comment linked above, while subsequent variations do not show as much difference (implying that the main performance hit here might be coming from the build runner and not the actual compiler itself).

#!/bin/sh
cd "$1"
rm -rf zig-cache zig-out
# 1. Build with zig build:
# "$2"/build/stage3/bin/zig build
# 2. Compile directly with zig build-exe:
# "$2"/build/stage3/bin/zig build-exe src/main.zig -lc -lglfw -lepoxy
# 3. Same as above, but no not emit anything:
#"$2"/build/stage3/bin/zig build-exe src/main.zig -lc -lglfw -lepoxy -fno-emit-bin
Benchmark 1 (3 runs): ./bench /var/home/ian/src/tetris /var/home/ian/src/zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          5.50s  ± 27.1ms    5.48s  … 5.53s           0 ( 0%)        0%
  peak_rss            380MB ±  103KB     380MB …  380MB          0 ( 0%)        0%
  cpu_cycles         20.1G  ± 79.4M     20.0G  … 20.2G           0 ( 0%)        0%
  instructions       26.5G  ± 7.56M     26.5G  … 26.5G           0 ( 0%)        0%
  cache_references   2.08G  ± 4.26M     2.07G  … 2.08G           0 ( 0%)        0%
  cache_misses        490M  ± 1.26M      488M  …  491M           0 ( 0%)        0%
  branch_misses       134M  ±  195K      134M  …  134M           0 ( 0%)        0%
Benchmark 2 (3 runs): ./bench /var/home/ian/src/tetris /var/home/ian/src/zig-worktrees/type-erased-writer
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          5.84s  ± 11.8ms    5.83s  … 5.85s           0 ( 0%)        💩+  6.2% ±  0.9%
  peak_rss            396MB ±  332KB     396MB …  396MB          0 ( 0%)        💩+  4.2% ±  0.1%
  cpu_cycles         21.5G  ± 74.3M     21.5G  … 21.6G           0 ( 0%)        💩+  7.3% ±  0.9%
  instructions       28.2G  ± 5.91M     28.2G  … 28.2G           0 ( 0%)        💩+  6.3% ±  0.1%
  cache_references   2.19G  ± 5.56M     2.18G  … 2.19G           0 ( 0%)        💩+  5.4% ±  0.5%
  cache_misses        513M  ± 1.62M      511M  …  514M           0 ( 0%)        💩+  4.7% ±  0.7%
  branch_misses       142M  ±  216K      142M  …  143M           0 ( 0%)        💩+  6.5% ±  0.3%
zig build
Benchmark 1 (4 runs): ./bench /var/home/ian/src/tetris /var/home/ian/src/zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.55s  ± 10.3ms    1.54s  … 1.56s           0 ( 0%)        0%
  peak_rss            237MB ±  196KB     236MB …  237MB          0 ( 0%)        0%
  cpu_cycles         5.39G  ± 11.7M     5.38G  … 5.40G           0 ( 0%)        0%
  instructions       7.68G  ± 1.61M     7.68G  … 7.68G           0 ( 0%)        0%
  cache_references    527M  ± 1.03M      526M  …  528M           0 ( 0%)        0%
  cache_misses        120M  ±  436K      120M  …  121M           0 ( 0%)        0%
  branch_misses      36.4M  ± 51.3K     36.3M  … 36.4M           1 (25%)        0%
Benchmark 2 (4 runs): ./bench /var/home/ian/src/tetris /var/home/ian/src/zig-worktrees/type-erased-writer
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.57s  ± 17.6ms    1.55s  … 1.59s           0 ( 0%)          +  0.9% ±  1.6%
  peak_rss            238MB ±  269KB     238MB …  238MB          1 (25%)          +  0.6% ±  0.2%
  cpu_cycles         5.54G  ± 6.51M     5.53G  … 5.54G           0 ( 0%)        💩+  2.6% ±  0.3%
  instructions       7.86G  ± 1.11M     7.85G  … 7.86G           0 ( 0%)        💩+  2.3% ±  0.0%
  cache_references    536M  ±  795K      535M  …  537M           0 ( 0%)        💩+  1.7% ±  0.3%
  cache_misses        122M  ±  152K      122M  …  123M           1 (25%)        💩+  2.0% ±  0.5%
  branch_misses      37.2M  ± 41.4K     37.2M  … 37.3M           0 ( 0%)        💩+  2.3% ±  0.2%
zig build-exe
Benchmark 1 (12 runs): ./bench /var/home/ian/src/tetris /var/home/ian/src/zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           449ms ± 17.9ms     429ms …  484ms          0 ( 0%)        0%
  peak_rss            153MB ±  380KB     152MB …  154MB          0 ( 0%)        0%
  cpu_cycles         1.25G  ± 5.55M     1.24G  … 1.26G           1 ( 8%)        0%
  instructions       2.28G  ±  623K     2.28G  … 2.28G           0 ( 0%)        0%
  cache_references    127M  ±  384K      127M  …  128M           2 (17%)        0%
  cache_misses       12.0M  ±  149K     11.7M  … 12.2M           1 ( 8%)        0%
  branch_misses      4.93M  ± 17.4K     4.90M  … 4.96M           0 ( 0%)        0%
Benchmark 2 (11 runs): ./bench /var/home/ian/src/tetris /var/home/ian/src/zig-worktrees/type-erased-writer
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           457ms ± 17.2ms     435ms …  499ms          0 ( 0%)          +  1.9% ±  3.4%
  peak_rss            153MB ±  400KB     152MB …  154MB          0 ( 0%)          +  0.1% ±  0.2%
  cpu_cycles         1.27G  ± 4.70M     1.26G  … 1.28G           0 ( 0%)        💩+  1.5% ±  0.4%
  instructions       2.33G  ±  463K     2.33G  … 2.33G           0 ( 0%)        💩+  2.3% ±  0.0%
  cache_references    128M  ±  369K      127M  …  128M           0 ( 0%)          +  0.2% ±  0.3%
  cache_misses       12.0M  ±  123K     11.8M  … 12.2M           0 ( 0%)          -  0.0% ±  1.0%
  branch_misses      4.98M  ± 14.5K     4.95M  … 5.00M           0 ( 0%)          +  1.0% ±  0.3%
zig build-exe -fno-emit-bin

Artificial benchmarks

I'd be happy to try out other benchmark ideas here as well. Both of these were compiled with ReleaseFast.

Write 1GB to a heap buffer one byte at a time

In my first attempt at this one, I was just writing 'A' bytes, but there wasn't a noticeable difference there (from what I could tell from the disassembly of that first version, LLVM had optimized both versions to the equivalent of a memset, with the "old" version being what appeared to be an inlined vectorized memset and the "new" version being literally a call to memset).

const std = @import("std");

pub fn main() !void {
    const buf = try std.heap.page_allocator.alloc(u8, 1 << 30);
    defer std.heap.page_allocator.free(buf);
    var buf_stream = std.io.fixedBufferStream(buf);
    var writer = buf_stream.writer();
    for (0..buf.len) |i| {
        try writer.writeByte(@intCast('A' + (i % 26)));
    }
}
Benchmark 1 (4 runs): ./write_bytes_old
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.47s  ±  862us    1.47s  … 1.48s           0 ( 0%)        0%
  peak_rss           1.07GB ±    0      1.07GB … 1.07GB          0 ( 0%)        0%
  cpu_cycles         4.15G  ± 1.82M     4.15G  … 4.15G           0 ( 0%)        0%
  instructions       12.9G  ± 91.4      12.9G  … 12.9G           0 ( 0%)        0%
  cache_references   25.5M  ±  134K     25.3M  … 25.6M           0 ( 0%)        0%
  cache_misses       19.9K  ± 1.13K     18.3K  … 20.6K           1 (25%)        0%
  branch_misses       263K  ± 25.5       263K  …  264K           0 ( 0%)        0%
Benchmark 2 (3 runs): ./write_bytes_new
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.68s  ± 17.9ms    1.66s  … 1.69s           0 ( 0%)        💩+ 13.7% ±  1.5%
  peak_rss           1.07GB ±    0      1.07GB … 1.07GB          0 ( 0%)          -  0.0% ±  0.0%
  cpu_cycles         5.03G  ±  525K     5.03G  … 5.03G           0 ( 0%)        💩+ 21.1% ±  0.1%
  instructions       15.0G  ±  116      15.0G  … 15.0G           0 ( 0%)        💩+ 16.7% ±  0.0%
  cache_references   25.1M  ± 24.4K     25.1M  … 25.1M           0 ( 0%)          -  1.6% ±  0.8%
  cache_misses       21.9K  ±  328      21.5K  … 22.2K           0 ( 0%)          +  9.6% ±  8.9%
  branch_misses       264K  ± 67.9       264K  …  264K           0 ( 0%)          +  0.2% ±  0.0%
write_bytes

Write a 1MB file one byte at a time

Most of the impact the writer would have is likely dwarfed by the cost of the write syscalls. I'm just including this because it's something I initially thought might yield some useful results.

const std = @import("std");

pub fn main() !void {
    var file = try std.fs.cwd().createFile("tmp", .{});
    defer file.close();
    const writer = file.writer();
    for (0..(1 << 20)) |_| {
        try writer.writeByte('A');
    }
    try file.sync();
}
Benchmark 1 (3 runs): ./write_file_old
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          3.04s  ± 32.1ms    3.00s  … 3.07s           0 ( 0%)        0%
  peak_rss            131KB ±    0       131KB …  131KB          0 ( 0%)        0%
  cpu_cycles         88.8M  ± 1.43M     87.2M  … 90.0M           0 ( 0%)        0%
  instructions       51.4M  ± 9.17      51.4M  … 51.4M           0 ( 0%)        0%
  cache_references   5.05M  ±  442K     4.57M  … 5.44M           0 ( 0%)        0%
  cache_misses        432   ±  115       341   …  561            0 ( 0%)        0%
  branch_misses      2.10M  ± 6.24      2.10M  … 2.10M           0 ( 0%)        0%
Benchmark 2 (3 runs): ./write_file_new
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          3.02s  ± 18.6ms    3.01s  … 3.04s           0 ( 0%)          -  0.5% ±  2.0%
  peak_rss            131KB ±    0       131KB …  131KB          0 ( 0%)          -  0.0% ±  0.0%
  cpu_cycles         87.9M  ± 1.54M     86.7M  … 89.6M           0 ( 0%)          -  1.0% ±  3.8%
  instructions       54.5M  ± 7.64      54.5M  … 54.5M           0 ( 0%)        💩+  6.1% ±  0.0%
  cache_references   5.17M  ± 27.8K     5.14M  … 5.20M           0 ( 0%)          +  2.3% ± 14.1%
  cache_misses       2.31K  ± 1.59K      475   … 3.31K           0 ( 0%)          +434.2% ± 591.2%
  branch_misses      2.10M  ± 6.51      2.10M  … 2.10M           0 ( 0%)          +  0.0% ±  0.0%
write_file

@Hejsil
Copy link
Contributor

Hejsil commented Oct 20, 2023

One optimisation that could be done is for allocPrint. Currently this function will do a generic instantiation of fmt.format twice. Once with a counting writer, another with a fixed buffer writer.

Type erasing the writer should eliminate one of these instantiations. Pro is hopefully better compile times (the zig compiler calls allocPrint quite a bit). Cons is maybe runtime performance.

@ianprime0509
Copy link
Contributor Author

Thanks, that's a great idea! To get an idea of the benefit of such a change, I applied the following patch (#17458 is relevant here: the any function being public is a footgun, but I'm not sure adding anyReader and anyWriter functions everywhere is the correct solution either, which is why this patch is not part of the PR yet):

diff --git a/lib/std/fmt.zig b/lib/std/fmt.zig
index 611737161..31ce66b7c 100644
--- a/lib/std/fmt.zig
+++ b/lib/std/fmt.zig
@@ -1993,7 +1993,10 @@ pub const BufPrintError = error{
 /// Returns a slice of the bytes printed to.
 pub fn bufPrint(buf: []u8, comptime fmt: []const u8, args: anytype) BufPrintError![]u8 {
     var fbs = std.io.fixedBufferStream(buf);
-    try format(fbs.writer(), fmt, args);
+    format(fbs.writer().any(), fmt, args) catch |e| switch (e) {
+        error.NoSpaceLeft => return error.NoSpaceLeft,
+        else => unreachable,
+    };
     return fbs.getWritten();
 }
 
@@ -2005,7 +2008,7 @@ pub fn bufPrintZ(buf: []u8, comptime fmt: []const u8, args: anytype) BufPrintErr
 /// Count the characters needed for format. Useful for preallocating memory
 pub fn count(comptime fmt: []const u8, args: anytype) u64 {
     var counting_writer = std.io.countingWriter(std.io.null_writer);
-    format(counting_writer.writer(), fmt, args) catch |err| switch (err) {};
+    format(counting_writer.writer().any(), fmt, args) catch unreachable;
     return counting_writer.bytes_written;
 }
 

The results are certainly positive: it seems like this patch is able to make up for much of the compilation performance regression noted in the PR. Some benchmark results:

Benchmark 1 (3 runs): ./bench /var/home/ian/src/tetris /var/home/ian/src/zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          5.77s  ±  128ms    5.62s  … 5.85s           0 ( 0%)        0%
  peak_rss            381MB ±  155KB     381MB …  382MB          0 ( 0%)        0%
  cpu_cycles         20.8G  ±  468M     20.3G  … 21.2G           0 ( 0%)        0%
  instructions       26.6G  ± 1.39M     26.6G  … 26.6G           0 ( 0%)        0%
  cache_references   2.07G  ± 2.93M     2.07G  … 2.08G           0 ( 0%)        0%
  cache_misses        489M  ± 1.90M      488M  …  491M           0 ( 0%)        0%
  branch_misses       134M  ±  309K      134M  …  135M           0 ( 0%)        0%
Benchmark 2 (3 runs): ./bench /var/home/ian/src/tetris /var/home/ian/src/zig-worktrees/type-erased-writer-old
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          6.07s  ± 18.2ms    6.05s  … 6.09s           0 ( 0%)        💩+  5.3% ±  3.6%
  peak_rss            397MB ± 68.9KB     397MB …  397MB          0 ( 0%)        💩+  4.0% ±  0.1%
  cpu_cycles         22.1G  ± 65.3M     22.0G  … 22.1G           0 ( 0%)        💩+  6.1% ±  3.6%
  instructions       28.3G  ± 7.90M     28.3G  … 28.3G           0 ( 0%)        💩+  6.3% ±  0.0%
  cache_references   2.19G  ±  979K     2.19G  … 2.19G           0 ( 0%)        💩+  5.5% ±  0.2%
  cache_misses        517M  ± 1.41M      516M  …  519M           0 ( 0%)        💩+  5.7% ±  0.8%
  branch_misses       144M  ±  217K      143M  …  144M           0 ( 0%)        💩+  7.0% ±  0.5%
tetris zig build - master vs unpatched PR
Benchmark 1 (3 runs): ./bench /var/home/ian/src/tetris /var/home/ian/src/zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          5.75s  ± 82.7ms    5.66s  … 5.83s           0 ( 0%)        0%
  peak_rss            381MB ±  373KB     381MB …  382MB          0 ( 0%)        0%
  cpu_cycles         20.8G  ±  358M     20.3G  … 21.0G           0 ( 0%)        0%
  instructions       26.6G  ± 1.13M     26.6G  … 26.6G           0 ( 0%)        0%
  cache_references   2.07G  ± 2.81M     2.07G  … 2.07G           0 ( 0%)        0%
  cache_misses        488M  ±  871K      488M  …  489M           0 ( 0%)        0%
  branch_misses       134M  ±  228K      134M  …  134M           0 ( 0%)        0%
Benchmark 2 (3 runs): ./bench /var/home/ian/src/tetris /var/home/ian/src/zig-worktrees/type-erased-writer
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          5.71s  ± 41.2ms    5.68s  … 5.76s           0 ( 0%)          -  0.8% ±  2.6%
  peak_rss            381MB ±  324KB     381MB …  382MB          0 ( 0%)          +  0.1% ±  0.2%
  cpu_cycles         20.5G  ±  215M     20.3G  … 20.7G           0 ( 0%)          -  1.3% ±  3.2%
  instructions       26.9G  ± 2.09M     26.9G  … 26.9G           0 ( 0%)          +  1.0% ±  0.0%
  cache_references   2.07G  ± 1.28M     2.07G  … 2.07G           0 ( 0%)          +  0.0% ±  0.2%
  cache_misses        487M  ± 1.83M      485M  …  489M           0 ( 0%)          -  0.3% ±  0.7%
  branch_misses       135M  ±  341K      135M  …  136M           0 ( 0%)          +  1.0% ±  0.5%
tetris zig build - master vs patched PR
Benchmark 1 (3 runs): ./bench-compiler /var/home/ian/src/zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          90.9s  ±  924ms    89.8s  … 91.5s           0 ( 0%)        0%
  peak_rss           5.30GB ± 4.86MB    5.29GB … 5.30GB          0 ( 0%)        0%
  cpu_cycles          358G  ± 4.69G      352G  …  362G           0 ( 0%)        0%
  instructions        442G  ± 5.78G      438G  …  448G           0 ( 0%)        0%
  cache_references   33.8G  ±  337M     33.6G  … 34.2G           0 ( 0%)        0%
  cache_misses       8.13G  ± 91.2M     8.07G  … 8.24G           0 ( 0%)        0%
  branch_misses      2.18G  ± 33.5M     2.16G  … 2.22G           0 ( 0%)        0%
Benchmark 2 (3 runs): ./bench-compiler /var/home/ian/src/zig-worktrees/type-erased-writer-old
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          95.2s  ±  817ms    94.7s  … 96.2s           0 ( 0%)        💩+  4.8% ±  2.2%
  peak_rss           5.57GB ± 1.08MB    5.56GB … 5.57GB          0 ( 0%)        💩+  5.1% ±  0.2%
  cpu_cycles          379G  ± 2.76G      377G  …  382G           0 ( 0%)        💩+  6.0% ±  2.4%
  instructions        473G  ± 2.09G      471G  …  475G           0 ( 0%)        💩+  7.0% ±  2.2%
  cache_references   35.7G  ± 58.5M     35.7G  … 35.8G           0 ( 0%)        💩+  5.8% ±  1.6%
  cache_misses       8.57G  ± 12.3M     8.56G  … 8.59G           0 ( 0%)        💩+  5.4% ±  1.8%
  branch_misses      2.36G  ± 13.0M     2.35G  … 2.37G           0 ( 0%)        💩+  8.0% ±  2.6%
stage4 compiler build - master vs unpatched PR
Benchmark 1 (3 runs): ./bench-compiler /var/home/ian/src/zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          89.9s  ±  843ms    88.9s  … 90.4s           0 ( 0%)        0%
  peak_rss           5.29GB ±  899KB    5.29GB … 5.29GB          0 ( 0%)        0%
  cpu_cycles          353G  ± 3.09G      349G  …  355G           0 ( 0%)        0%
  instructions        438G  ± 71.7M      438G  …  439G           0 ( 0%)        0%
  cache_references   33.7G  ± 78.5M     33.6G  … 33.8G           0 ( 0%)        0%
  cache_misses       8.10G  ± 27.5M     8.08G  … 8.13G           0 ( 0%)        0%
  branch_misses      2.17G  ± 8.18M     2.16G  … 2.18G           0 ( 0%)        0%
Benchmark 2 (3 runs): ./bench-compiler /var/home/ian/src/zig-worktrees/type-erased-writer
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          92.0s  ± 1.52s     90.4s  … 93.5s           0 ( 0%)          +  2.3% ±  3.1%
  peak_rss           5.32GB ±  256KB    5.32GB … 5.32GB          0 ( 0%)          +  0.6% ±  0.0%
  cpu_cycles          358G  ± 1.98G      356G  …  360G           0 ( 0%)          +  1.5% ±  1.7%
  instructions        446G  ±  147M      445G  …  446G           0 ( 0%)        💩+  1.6% ±  0.1%
  cache_references   33.7G  ± 61.6M     33.7G  … 33.8G           0 ( 0%)          +  0.2% ±  0.5%
  cache_misses       8.11G  ± 37.2M     8.06G  … 8.13G           0 ( 0%)          +  0.1% ±  0.9%
  branch_misses      2.21G  ± 6.77M     2.21G  … 2.22G           0 ( 0%)        💩+  2.2% ±  0.8%
stage4 compiler build - master vs patched PR

What I'd really like to do is erase the writer type for format in general, rather than just doing it for these two functions. The issue there is specifying an error set while maintaining support for user-defined format functions: an initial solution might look like this:

pub fn format(writer: anytype, comptime fmt: []const u8, args: anytype) @TypeOf(writer).Error!void {
    return @errorCast(formatImpl(writer.any(), fmt, args));
}

fn formatImpl(writer: io.AnyWriter, comptime fmt: []const u8, args: anytype) anyerror!void {
    // Original format implementation goes here
}

But with this solution, it is safety-checked illegal behavior for any user-defined format function to return any error outside the error set of the writer, which is a regression from the current behavior of allowing this (and one which would not be caught at compile time). I personally don't like this, and feel like it would be more worthwhile to invest time in a better version of std.fmt designed to minimize code/compile bloat not only from the writer, but from the format string as well (inspiration: #7948 (comment)).

@andrewrk
Copy link
Member

andrewrk commented Oct 24, 2023

But with this solution, it is safety-checked illegal behavior for any user-defined format function to return any error outside the error set of the writer, which is a regression from the current behavior of allowing this

Maybe I'm missing something obvious but couldn't you make the format function continue to have an inferred error set like status quo?

Edit: oh, I see the problem now - the inferred error set is anyerror so the caller of format would start seeing anyerror instead of an inferred error set.

@andrewrk
Copy link
Member

I bet the C backend would benefit from a type-erased writer too because there are a lot of functions like this:

zig/src/codegen/c.zig

Lines 1771 to 1779 in 22a6a5d

fn renderTypeAndName(
dg: *DeclGen,
w: anytype,
ty: Type,
name: CValue,
qualifiers: CQualifiers,
alignment: Alignment,
kind: CType.Kind,
) error{ OutOfMemory, AnalysisFail }!void {

And they are currently being instantiated 2x.

@ianprime0509
Copy link
Contributor Author

ianprime0509 commented Oct 28, 2023

After thinking for a while about how to integrate a type-erased writer into the C backend without making the error handling unmanageable (anyerror is infectious when used with try), I decided to try an ErrorOnlyGenericWriter which wraps an AnyWriter with a predefined error set. This acts as a partial GenericWriter where only the error set is generic, and works well in the case of the C backend because both writer types being used before had the same error set.

Types of writers passed to renderTypeAndName before:

@as(type, io.writer.Writer(*array_list.ArrayListAligned(u8,null),error{OutOfMemory},(function 'appendWrite')))
@as(type, io.writer.Writer(*codegen.c.IndentWriter(io.writer.Writer(*array_list.ArrayListAligned(u8,null),error{OutOfMemory},(function 'appendWrite'))),error{OutOfMemory},(function 'write')))

After:

@as(type, io.GenericWriter(io.Writer,error{OutOfMemory},(function 'write')))

Compared to current master, this does appear to have some benefit (this benchmark is for an only-c compiler build):

Benchmark 1 (3 runs): ./bench-compiler /var/home/ian/src/zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          17.2s  ± 12.1ms    17.2s  … 17.3s           0 ( 0%)        0%
  peak_rss            670MB ±  918KB     669MB …  671MB          0 ( 0%)        0%
  cpu_cycles         72.6G  ±  188M     72.4G  … 72.8G           0 ( 0%)        0%
  instructions        115G  ± 6.39M      115G  …  115G           0 ( 0%)        0%
  cache_references   5.65G  ± 8.46M     5.65G  … 5.66G           0 ( 0%)        0%
  cache_misses       1.08G  ± 5.93M     1.08G  … 1.09G           0 ( 0%)        0%
  branch_misses       402M  ±  545K      401M  …  402M           0 ( 0%)        0%
Benchmark 2 (3 runs): ./bench-compiler /var/home/ian/src/zig-worktrees/type-erased-writer
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          17.3s  ±  263ms    17.1s  … 17.6s           0 ( 0%)          +  0.6% ±  2.4%
  peak_rss            649MB ±  309KB     648MB …  649MB          0 ( 0%)        ⚡-  3.2% ±  0.2%
  cpu_cycles         72.4G  ±  792M     71.6G  … 73.1G           0 ( 0%)          -  0.3% ±  1.8%
  instructions        114G  ± 4.32M      114G  …  114G           0 ( 0%)        ⚡-  1.1% ±  0.0%
  cache_references   5.42G  ± 4.49M     5.41G  … 5.42G           0 ( 0%)        ⚡-  4.2% ±  0.3%
  cache_misses       1.05G  ± 5.64M     1.05G  … 1.06G           0 ( 0%)        ⚡-  2.8% ±  1.2%
  branch_misses       394M  ± 1.25M      393M  …  395M           0 ( 0%)        ⚡-  1.9% ±  0.5%

I also added the std.fmt changes to reduce generic instantiations with allocPrint to the PR after all, since the benefits of doing so are nice as a partial fix on the way to a more efficient std.fmt overall.


Unfortunately, one of the std.json tests for comptime stringify calls has broken after I rebased this branch (all tests were passing before):

/var/home/ian/src/zig-worktrees/type-erased-writer/lib/std/testing.zig:29:9: error: 
                                                                                    ====== expected this output: =========
                                                                                    
        @compileError(std.fmt.comptimePrint(fmt, args));
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/var/home/ian/src/zig-worktrees/type-erased-writer/lib/std/testing.zig:629:14: note: called from here
        print("\n====== expected this output: =========\n", .{});
        ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/var/home/ian/src/zig-worktrees/type-erased-writer/lib/std/json/stringify_test.zig:366:35: note: called from here
    try testing.expectEqualStrings(expected, got);
        ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
/var/home/ian/src/zig-worktrees/type-erased-writer/lib/std/json/stringify_test.zig:399:35: note: called from here
    comptime testStringifyMaxDepth("[{\"foo\":42},{\"foo\":100},{\"foo\":1000}]", [_]MyStruct{
             ~~~~~~~~~~~~~~~~~~~~~^

This is the only failing test; if I comment it out, zig build test -Dskip-non-native -Dskip-release runs successfully. I'll try to reduce this further and figure out what the underlying issue is.

Edit: bisected to 405ba26 (applying the first commit in this PR at each step), though this doesn't particularly help me reduce the issue yet.

@andrewrk
Copy link
Member

I have a strong suspicion that a type erased writer is a good idea and we just need to find the handful problematic places where some code is reading 1 byte at a time through a function pointer

@ianprime0509
Copy link
Contributor Author

I agree. Based on the benchmarks I did in the PR description, the bulk of the performance hit from the initial change was in the build runner, and invoking zig build-exe directly circumvented most of that, so that's where I'm planning to start looking as soon as I can figure out what's gone wrong with the std.json comptime stringify test.

@ianprime0509
Copy link
Contributor Author

I rebased on master to resolve conflicts with the latest changes to Writer.writeInt.

Current benchmark results (including all commits in this PR so far) don't seem to show a significant performance difference vs master:

Benchmark 1 (3 runs): ./bench /var/home/ian/src/tetris /var/home/ian/src/zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          5.59s  ± 37.9ms    5.55s  … 5.63s           0 ( 0%)        0%
  peak_rss            368MB ±  511KB     368MB …  369MB          0 ( 0%)        0%
  cpu_cycles         19.9G  ±  111M     19.8G  … 20.0G           0 ( 0%)        0%
  instructions       26.2G  ± 4.89M     26.2G  … 26.2G           0 ( 0%)        0%
  cache_references   2.01G  ± 2.89M     2.01G  … 2.01G           0 ( 0%)        0%
  cache_misses        480M  ±  916K      479M  …  481M           0 ( 0%)        0%
  branch_misses       132M  ±  181K      132M  …  132M           0 ( 0%)        0%
Benchmark 2 (3 runs): ./bench /var/home/ian/src/tetris /var/home/ian/src/zig-worktrees/type-erased-writer
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          5.56s  ± 51.3ms    5.52s  … 5.62s           0 ( 0%)          -  0.4% ±  1.8%
  peak_rss            367MB ±  289KB     367MB …  367MB          0 ( 0%)          -  0.3% ±  0.3%
  cpu_cycles         20.1G  ± 93.3M     20.0G  … 20.2G           0 ( 0%)          +  0.7% ±  1.2%
  instructions       26.4G  ± 1.28M     26.4G  … 26.4G           0 ( 0%)          +  0.7% ±  0.0%
  cache_references   2.00G  ± 7.50M     1.99G  … 2.01G           0 ( 0%)          -  0.5% ±  0.6%
  cache_misses        475M  ±  880K      474M  …  476M           0 ( 0%)          -  1.2% ±  0.4%
  branch_misses       132M  ±  336K      132M  …  133M           0 ( 0%)          +  0.1% ±  0.5%
Tetris zig build
Benchmark 1 (3 runs): ./bench-compiler /var/home/ian/src/zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          91.4s  ± 1.37s     90.4s  … 92.9s           0 ( 0%)        0%
  peak_rss           5.23GB ± 3.53MB    5.23GB … 5.24GB          0 ( 0%)        0%
  cpu_cycles          360G  ± 7.76G      355G  …  369G           0 ( 0%)        0%
  instructions        444G  ± 6.05G      440G  …  451G           0 ( 0%)        0%
  cache_references   33.6G  ±  172M     33.4G  … 33.8G           0 ( 0%)        0%
  cache_misses       8.11G  ± 20.2M     8.09G  … 8.13G           0 ( 0%)        0%
  branch_misses      2.20G  ± 29.2M     2.18G  … 2.23G           0 ( 0%)        0%
Benchmark 2 (3 runs): ./bench-compiler /var/home/ian/src/zig-worktrees/type-erased-writer
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          92.3s  ± 3.89s     88.9s  … 96.5s           0 ( 0%)          +  1.0% ±  7.2%
  peak_rss           5.22GB ±  753KB    5.22GB … 5.22GB          0 ( 0%)          -  0.2% ±  0.1%
  cpu_cycles          358G  ± 8.13G      349G  …  366G           0 ( 0%)          -  0.6% ±  5.0%
  instructions        441G  ± 1.66G      440G  …  443G           0 ( 0%)          -  0.5% ±  2.3%
  cache_references   33.5G  ±  211M     33.3G  … 33.7G           0 ( 0%)          -  0.3% ±  1.3%
  cache_misses       8.15G  ± 55.3M     8.10G  … 8.21G           0 ( 0%)          +  0.5% ±  1.2%
  branch_misses      2.20G  ± 17.1M     2.19G  … 2.22G           0 ( 0%)          +  0.1% ±  2.5%
Compiler zig build

I tried using callgrind to look for performance hotspots related to writer usage, to see if there was anything being done inefficiently with writers when performing zig build. I didn't notice anything particularly out of the ordinary that I thought could be fixed in a natural way; the current std.fmt implementation naturally results in a decent number of calls to write, which I'm not sure is inherently bad, because they're writing to a buffer or in-memory list, such as this:

zig/lib/std/Build/Cache.zig

Lines 869 to 876 in f24ceec

try writer.print("{d} {d} {d} {} {d} {s}\n", .{
file.stat.size,
file.stat.inode,
file.stat.mtime,
fmt.fmtSliceHexLower(&file.bin_digest),
file.prefixed_path.?.prefix,
file.prefixed_path.?.sub_path,
});

This call to writer.print can lead to quite a few write calls under the hood: one for each formatted number and string, one for each substring between format specifiers, and n / 2 calls for fmtSliceHexLower (where n is the length of the digest in bytes):

zig/lib/std/fmt.zig

Lines 818 to 839 in f24ceec

fn formatSliceHexImpl(comptime case: Case) type {
const charset = "0123456789" ++ if (case == .upper) "ABCDEF" else "abcdef";
return struct {
pub fn formatSliceHexImpl(
bytes: []const u8,
comptime fmt: []const u8,
options: std.fmt.FormatOptions,
writer: anytype,
) !void {
_ = fmt;
_ = options;
var buf: [2]u8 = undefined;
for (bytes) |c| {
buf[0] = charset[c >> 4];
buf[1] = charset[c & 15];
try writer.writeAll(&buf);
}
}
};
}
Again, though, I'm not sure this in itself is a bad thing, since it's still writing in memory in this case.

@Vexu Vexu mentioned this pull request Nov 5, 2023
@ikskuh
Copy link
Contributor

ikskuh commented Nov 29, 2023

Current benchmark results (including all commits in this PR so far) don't seem to show a significant performance difference vs master:

Could you also share the code sizes of the generated executables? This is an important benchmarking size here as well. You can query them with the size tool, just use the GNU output scheme (the short one)

It might make sense to decide into either always use generic reader/writer in print() for ReleaseSmall, but use several instantiations in ReleaseFast

@ianprime0509
Copy link
Contributor Author

Sorry for the delay. I rebased on the latest master (to resolve a minor conflict) and took some size measurements of the results. The differences look fairly small, although the type-erased writer for some reason seems to lead to slightly larger sizes than the generic version:

Tetris (comparing zig build -Doptimize=Release{Fast,Small}):

      text       data        bss      total filename
    770822     541338      77360    1389520 master-releasefast/bin/tetris
    776342     541510      77360    1395212 pr-releasefast/bin/tetris
    175273     463079      77284     715636 master-releasesmall/bin/tetris
    175289     463187      77284     715760 pr-releasesmall/bin/tetris

Zig (comparing zig build -Doptimize=Release{Fast,Small} -Dno-lib -Denable-llvm):

      text       data        bss      total filename
  12395693    3351171       9861   15756725 master-releasefast/bin/zig
  12429869    3371811       9861   15811541 pr-releasefast/bin/zig
   7370284    3196159       9847   10576290 master-releasesmall/bin/zig
   7420796    3222175       9847   10652818 pr-releasesmall/bin/zig

I also made a very simple implementation of cat for the purpose of attempting to compare symbol sizes: https://gist.github.com/ianprime0509/b74e6681919d4396d5a1127e954cec03 Unfortunately, the many anonymous symbols make it difficult to do a good comparison even with this small example.

I included a comparison of the disassembly for one of the writeAll calls before and after in a comment of the gist linked above; as far as I can tell, the increase in the number of instructions (and hence the size) of the call is due to the extra pointer manipulation required to set up a call to the new non-generic writeAll rather than the old generic writeAll.

It's also interesting to note how many functions depend on writer functionality even though it might not seem like it at first glance, such as os.realpath (symbol size 578 -> 618 with this change). So, what I'm suspecting is that while the non-generic version may decrease the overall size of the writer implementation code, the (small) additional overhead contributed by each call to a write function adds up and weighs the size in the other direction.

As a final observation on size comparisons, it might be relevant to point out that the helper functions in the Writer interface are smaller and fewer than those in Reader, implying that the space saving potential by deduplicating copies of those functions is less.

@andrewrk
Copy link
Member

Keeping open since it's blocking on #17761. I bumped the priority of #17761 up.

This is a companion to ziglang#17344 to apply the same change to the
`std.io.Writer` interface.
Since `bufPrint` and `count` both control the writers used internally,
they can leverage type-erased writers while maintaining correct error
handling. This reduces generic instantiations when using `allocPrint`,
which calls both `count` and `bufPrint` internally.
This reduces generic instantiations of several write functions.

Before:

```
@as(type, io.writer.Writer(*array_list.ArrayListAligned(u8,null),error{OutOfMemory},(function 'appendWrite')))
@as(type, io.writer.Writer(*codegen.c.IndentWriter(io.writer.Writer(*array_list.ArrayListAligned(u8,null),error{OutOfMemory},(function 'appendWrite'))),error{OutOfMemory},(function 'write')))
```

After:

```
@as(type, io.GenericWriter(io.Writer,error{OutOfMemory},(function 'write')))
```
@ianprime0509
Copy link
Contributor Author

This is now unblocked thanks to #18729 🚀

@andrewrk
Copy link
Member

andrewrk commented Feb 8, 2024

Thanks for all the investigations into this. Let's move forward with it. I think it's nice to have the option to provide a non-generic API. For example, I think I want to use it in std.crypto.tls.Client.

@andrewrk andrewrk merged commit 3122fd0 into ziglang:master Feb 8, 2024
10 checks passed
@ianprime0509 ianprime0509 deleted the type-erased-writer branch March 26, 2024 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants