nextest seems to handle stdio-intensive tests less efficiently than cargo test #1068

HadrienG2 · 2023-10-26T09:02:48Z

I have some tests around which assert that panics occur when invalid inputs are sent to certain functions. These tests run several times slower under nextest than under cargo test when RUST_BACKTRACE=1, and a quick perf profile suggests that they spend most of their time waiting for nextest to read out the data that they have sent down the stdout/stderr pipe.

The most likely reason why these tests are the only ones affected is that the default Rust panic handler sends lots of text down the pipe when backtraces are enabled, so these are probably my most chatty tests by a large margin.

In this particular situation, I can sidestep the problem entirely by replacing the Rust panic handler with a no-op during execution of these tests, which will eliminate the constant stdio traffic linked to the constant printing of backtraces. But it might be worth investigating whether nextest can handle heavy stdio traffic more efficiently by e.g. using larger pipe buffer sizes.

The text was updated successfully, but these errors were encountered:

sunshowers · 2023-10-26T20:32:37Z

Thanks for the report!

Huh, that's interesting -- how big are stdout and stderr for your tests? The default Linux x86_64 pipe size is 65536 bytes. Regardless, 700ms+ just in write and read is a lot and is definitely worth fixing.

sunshowers · 2023-10-27T03:58:08Z

To be honest, I can't reproduce this. With a Ryzen 7950x and Linux kernel 6.5.6 and Rust 1.73.0, with this test that writes roughly 500MiB, I get:

cargo test: 0.903 seconds
cargo nextest run: 0.551 seconds

Replacing eprintln! with std::io::stderr().write_all() (which isn't intercepted by libtest by default) doesn't change things either -- cargo nextest run takes the same amount of time, while cargo test 2>test-file takes 0.740 seconds. In both cases, cargo nextest run is faster than cargo test.

I do believe the flamegraph, I'm just trying to figure out what might be going on.

sunshowers · 2023-10-27T04:39:13Z

I do see from perf that nextest is spending most of its time reading from the pipe, that's true.

I also tried applying this patch (Linux only) to make pipe buffers bigger and it didn't appear to make things faster or slower -- I suspect the limit is not that.

This post suggests that tweaking buffer sizes etc will not make a difference, and the main way to make things faster than read and write is to use vmsplice -- which wouldn't seem to help because vmsplice is equivalent to read when transferring data from a pipe to user memory.

Unless there are some easy wins elsewhere I don't think we can make this better unfortunately.

NobodyXu · 2023-10-27T04:42:46Z

Does nextest use single-thread tokio runtime and does it spawn the wait tasks so that it can be run in parallel?

sunshowers · 2023-10-27T04:45:31Z

Nextest uses a multithreaded runtime, and each test is its own task. However, reading from stdout and stderr are not split up into their own tasks.

NobodyXu · 2023-10-27T05:38:44Z

I think splitting reading stdout/stderr might be beneficial if the tests writes a lot to both stdout and stderr.

HadrienG2 · 2023-10-27T06:36:48Z

First things first : for a reproducer you can...

Install the libhwloc development package of your preferred package manager
Clone the repo of my project
Checkout commit 1a78fc223aebefe351566d76da1b91df55c8e6e6 (current master at the time of writing, empty panic hook hasn't been merged yet)

...then run the following commands:

# Only observable with backtraces, which greatly increase the stdout/stderr output
export RUST_BACKTRACE=1

# Takes 2.3s on an AMD R5 5600G
cargo test ffi::int::tests -- --test-threads=1

# Takes 18.3s on the same CPU as above
cargo nextest run -j1 -E 'test(/ffi::int::tests/)'

I'm using only 1 test thread here because it makes timings more reproducible and reduces the perf workload during profiling (which enables going to higher sampling rates).

To be clear, the slowdown of enabling backtraces is already massive with cargo test (the same test run takes 0.3s without), and although it has implications that I don't like when running tests over multiple threads, replacing the standard panic hook with a no-op while running those tests is an acceptable workaround.

What I find suspicious here is that the slowdown is almost an order of magnitude higher with nextest than with cargo test, even though in this particular scenario neither of them should be doing anything with the test output, they should be just collecting it in some buffer in order to display it in case of failure. Somehow, cargo test must use a different test output collection strategy that's a lot more efficient.

Trying to investigate, a perf profile of the cargo test run (which unfortunately has much lower-quality data than my former profile, probably because I'm missing some debug info I can't identify on this machine) shows that cargo has almost no CPU activity during this test run.

I wonder how they manage to avoid the pipe overhead, assuming that's the only issue here. Perhaps they redirect the test output to some tempfile instead of a pipe so the kernel can handle all the data transfers without needing round trips to user space? Or perhaps they manage to have libtest do the buffering in-process somehow, and only print the output when a failure occurs?

To answer your first question, cargo test ffi::int::tests:: -- --test-threads=1 --nocapture 2>&1 | wc -c tells me that all these tests combined produce a total of 173 MiB of output over stdout + stderr. Yikes, I know. But this is testing an integer newtype, and unfortunately integers use panic as a failure mode in a lot of different operations...

Also, knowing that modern RAM can transfer data at tens of GB/s on consumer hardware does put the ~10MiB/s data transfer rate that is implied by taking 18s to do what amounts to filling up a ~180 MB buffer in a less favorable light.

HadrienG2 · 2023-10-27T07:39:15Z

To be honest, I can't reproduce this. With a Ryzen 7950x and Linux kernel 6.5.6 and Rust 1.73.0, with this test that writes roughly 500MiB, I get:
* `cargo test`: 0.903 seconds

* `cargo nextest run`: 0.551 seconds
Replacing eprintln! with std::io::stderr().write_all() (which isn't intercepted by libtest by default) doesn't change things either -- cargo nextest run takes the same amount of time, while cargo test 2>test-file takes 0.740 seconds. In both cases, cargo nextest run is faster than cargo test.

I do believe the flamegraph, I'm just trying to figure out what might be going on.

Perhaps the difference is that the std::backtrace printer in the panic hook does finer-grained writes? Because stderr is unbuffered, its performance can vary greatly depending on write granularity.

sunshowers · 2023-10-27T16:28:31Z

Somehow, cargo test must use a different test output collection strategy that's a lot more efficient.

cargo test does its output collection in-process, without using OS pipes. That might explain the difference.

Your point about stderr being unbuffered is well-taken (and is easy to verify by looking at https://doc.rust-lang.org/src/std/io/stdio.rs.html -- Stdout has a LineWriter in it, while Stderr does not). Maybe we need to add some buffering to std::backtrace in upstream Rust.

HadrienG2 · 2023-10-28T04:24:55Z

cargo test does its output collection in-process, without using OS pipes. That might explain the difference.

Sounds pretty neat, is there a way nextest could leverage the same infrastructure? Or if not, is there a way the infrastructure could be modified to make it happen?

Maybe we need to add some buffering to std::backtrace in upstream Rust.

This could also be a fallback plan if the above, more general optimization can't be made to work. I'm wondering if this would fit the design constraints of std though: the other day I remember seeing a rust-lang issue that proposed turning OOM errors into panics, if that's the plan you probably don't want to allocate when printing the OOM backtrace...

Of course, this is assuming all other allocations can be removed from the backtrace computation, which is perhaps a little optimistic. And maybe a statically allocated or stack-allocated buffer could be made to work?

sunshowers · 2023-10-28T08:15:22Z

Sounds pretty neat, is there a way nextest could leverage the same infrastructure?

Hmm, so a thing I realized is that nextest currently always passes in --nocapture, whether it is interested in output in case the test succeeds or not, even though most of the time users aren't interested in success output for the test. A thing nextest could do is to not pass in --nocapture when nextest isn't interested in output for the test on success.

sunshowers · 2023-10-29T19:35:44Z

While writing this, I sadly realized that --nocapture is pretty load bearing for tests that time out -- we want to show whatever output the test produced by the time it was killed, which means instructing libtest not to capture any output.

I think the only way to resolve this is to add buffering upstream (though maybe that's out of the question). There's nothing that can be done on nextest's side as far as I can tell.

HadrienG2 · 2023-10-30T05:48:06Z

How would you feel about an upstream feature that dumps the stdio buffer when a certain signal (say, SIGUSR1) is received? I think it is the minimal upstream change that would preserve optimal I/O performance in the common case, while allowing for the buffer to be dumped in case of timeout.

sunshowers · 2023-10-30T18:53:45Z

What if a test sets a SIGUSR1 handler for itself? I don't know how common that is but it is a thing that's possible.

To be honest I still don't see how the effort is worth it. This seems to be an edge case that is addressed by setting your own panic handler.

If you're printing out tons of logs to stderr you're most likely doing line buffering internally.
If you're using serde_json, which writes stuff character by character, just wrap the writer in a buffer yourself.

We can document this and maybe even detect it if we're getting a ton of small reads.

HadrienG2 · 2023-10-31T10:35:08Z

I would obviously be fine with a "wontfix please silence your panics" resolution since I already have a panic handler that does this around ;)

This is a FAILED experiment: see #1068.

sunshowers · 2025-03-04T08:18:46Z

Nextest uses a multithreaded runtime, and each test is its own task. However, reading from stdout and stderr are not split up into their own tasks.

So I was actually wrong here -- nextest used to use the same task for all tests.

Curious how it is these days -- nextest actually now uses a separate task per test, and I suspect it improves your case considerably.

HadrienG2 · 2025-03-05T08:16:49Z

In the standard panic configuration that I used at the time, where a backtrace is printed even if the panic is caught (which if you ask me is still a bad default that should be changed at the std level), nextest handles the stdio spam much better than before but still significantly worse than cargo test:

Benchmark 1: cargo test --features=hwloc-latest int
  Time (mean ± σ):      2.783 s ±  0.097 s    [User: 2.943 s, System: 0.311 s]
  Range (min … max):    2.687 s …  2.995 s    10 runs
 
Benchmark 2: cargo nextest run --features=hwloc-latest int
  Time (mean ± σ):      6.281 s ±  0.110 s    [User: 11.765 s, System: 17.819 s]
  Range (min … max):    6.158 s …  6.470 s    10 runs
 
Summary
  cargo test --features=hwloc-latest int ran
    2.26 ± 0.09 times faster than cargo nextest run --features=hwloc-latest int

Of course, suppressing the backtrace as I did at the time still greatly improves performance for both test harnesses:

❯ hyperfine --warmup 3 "cargo test --features=hwloc-latest int" "cargo nextest run --features=hwloc-latest int"
Benchmark 1: cargo test --features=hwloc-latest int
  Time (mean ± σ):     175.2 ms ±   2.4 ms    [User: 417.1 ms, System: 41.0 ms]
  Range (min … max):   172.1 ms … 180.8 ms    17 runs
 
Benchmark 2: cargo nextest run --features=hwloc-latest int
  Time (mean ± σ):     310.2 ms ±  29.6 ms    [User: 541.8 ms, System: 187.4 ms]
  Range (min … max):   281.4 ms … 358.4 ms    10 runs
 
Summary
  cargo test --features=hwloc-latest int ran
    1.77 ± 0.17 times faster than cargo nextest run --features=hwloc-latest int

Interestingly, proptest themselves have introduced backtrace suppression as part of their more general "handle-panics" feature. In my experience it's less good than my tailor-made backtrace suppression hack, in the sense that it also suppresses backtraces from unexpected panics (which is bad for debugging) and it performs worse for my use case, but interestingly its different performance characteristics put cargo test and cargo nextest on an equal footing:

❯ hyperfine --warmup 3 "cargo test --features=hwloc-latest int" "cargo nextest run --features=hwloc-latest int"
Benchmark 1: cargo test --features=hwloc-latest int
  Time (mean ± σ):     329.9 ms ±  57.5 ms    [User: 551.8 ms, System: 133.9 ms]
  Range (min … max):   246.3 ms … 413.4 ms    10 runs
 
Benchmark 2: cargo nextest run --features=hwloc-latest int
  Time (mean ± σ):     380.5 ms ±  15.7 ms    [User: 682.3 ms, System: 294.0 ms]
  Range (min … max):   364.9 ms … 418.8 ms    10 runs
 
Summary
  cargo test --features=hwloc-latest int ran
    1.15 ± 0.21 times faster than cargo nextest run --features=hwloc-latest int

sunshowers · 2025-03-07T19:39:39Z

Thanks. That's probably the best we can do for now -- this is the difference between cross-process and in-process pipe reads with lots of small writes. Glad that nextest is doing better now.

HadrienG2 mentioned this issue Oct 28, 2023

Tracking Issue for panic::update_hook rust-lang/rust#92649

Open

3 tasks

sunshowers added the deferred Work deferred or blocked label Oct 29, 2023

sunshowers added a commit that referenced this issue Nov 8, 2023

[runner] don't pass in --nocapture if success output isn't needed

8550be6

This is a FAILED experiment: see #1068.

sunshowers closed this as completed Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nextest seems to handle stdio-intensive tests less efficiently than cargo test #1068

nextest seems to handle stdio-intensive tests less efficiently than cargo test #1068

HadrienG2 commented Oct 26, 2023 •

edited

Loading

sunshowers commented Oct 26, 2023

sunshowers commented Oct 27, 2023 •

edited

Loading

sunshowers commented Oct 27, 2023 •

edited

Loading

NobodyXu commented Oct 27, 2023

sunshowers commented Oct 27, 2023

NobodyXu commented Oct 27, 2023

HadrienG2 commented Oct 27, 2023 •

edited

Loading

HadrienG2 commented Oct 27, 2023 •

edited

Loading

sunshowers commented Oct 27, 2023

HadrienG2 commented Oct 28, 2023

sunshowers commented Oct 28, 2023 •

edited

Loading

sunshowers commented Oct 29, 2023 •

edited

Loading

HadrienG2 commented Oct 30, 2023

sunshowers commented Oct 30, 2023 •

edited

Loading

HadrienG2 commented Oct 31, 2023

sunshowers commented Mar 4, 2025 •

edited

Loading

HadrienG2 commented Mar 5, 2025 •

edited

Loading

sunshowers commented Mar 7, 2025

nextest seems to handle stdio-intensive tests less efficiently than cargo test #1068

nextest seems to handle stdio-intensive tests less efficiently than cargo test #1068

Comments

HadrienG2 commented Oct 26, 2023 • edited Loading

sunshowers commented Oct 26, 2023

sunshowers commented Oct 27, 2023 • edited Loading

sunshowers commented Oct 27, 2023 • edited Loading

NobodyXu commented Oct 27, 2023

sunshowers commented Oct 27, 2023

NobodyXu commented Oct 27, 2023

HadrienG2 commented Oct 27, 2023 • edited Loading

HadrienG2 commented Oct 27, 2023 • edited Loading

sunshowers commented Oct 27, 2023

HadrienG2 commented Oct 28, 2023

sunshowers commented Oct 28, 2023 • edited Loading

sunshowers commented Oct 29, 2023 • edited Loading

HadrienG2 commented Oct 30, 2023

sunshowers commented Oct 30, 2023 • edited Loading

HadrienG2 commented Oct 31, 2023

sunshowers commented Mar 4, 2025 • edited Loading

HadrienG2 commented Mar 5, 2025 • edited Loading

sunshowers commented Mar 7, 2025

HadrienG2 commented Oct 26, 2023 •

edited

Loading

sunshowers commented Oct 27, 2023 •

edited

Loading

sunshowers commented Oct 27, 2023 •

edited

Loading

HadrienG2 commented Oct 27, 2023 •

edited

Loading

HadrienG2 commented Oct 27, 2023 •

edited

Loading

sunshowers commented Oct 28, 2023 •

edited

Loading

sunshowers commented Oct 29, 2023 •

edited

Loading

sunshowers commented Oct 30, 2023 •

edited

Loading

sunshowers commented Mar 4, 2025 •

edited

Loading

HadrienG2 commented Mar 5, 2025 •

edited

Loading