Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nextest seems to handle stdio-intensive tests less efficiently than cargo test #1068

Closed
HadrienG2 opened this issue Oct 26, 2023 · 18 comments
Closed
Labels
deferred Work deferred or blocked

Comments

@HadrienG2
Copy link

HadrienG2 commented Oct 26, 2023

I have some tests around which assert that panics occur when invalid inputs are sent to certain functions. These tests run several times slower under nextest than under cargo test when RUST_BACKTRACE=1, and a quick perf profile suggests that they spend most of their time waiting for nextest to read out the data that they have sent down the stdout/stderr pipe.

The most likely reason why these tests are the only ones affected is that the default Rust panic handler sends lots of text down the pipe when backtraces are enabled, so these are probably my most chatty tests by a large margin.

In this particular situation, I can sidestep the problem entirely by replacing the Rust panic handler with a no-op during execution of these tests, which will eliminate the constant stdio traffic linked to the constant printing of backtraces. But it might be worth investigating whether nextest can handle heavy stdio traffic more efficiently by e.g. using larger pipe buffer sizes.

@sunshowers
Copy link
Member

Thanks for the report!

Huh, that's interesting -- how big are stdout and stderr for your tests? The default Linux x86_64 pipe size is 65536 bytes. Regardless, 700ms+ just in write and read is a lot and is definitely worth fixing.

@sunshowers
Copy link
Member

sunshowers commented Oct 27, 2023

To be honest, I can't reproduce this. With a Ryzen 7950x and Linux kernel 6.5.6 and Rust 1.73.0, with this test that writes roughly 500MiB, I get:

  • cargo test: 0.903 seconds
  • cargo nextest run: 0.551 seconds

Replacing eprintln! with std::io::stderr().write_all() (which isn't intercepted by libtest by default) doesn't change things either -- cargo nextest run takes the same amount of time, while cargo test 2>test-file takes 0.740 seconds. In both cases, cargo nextest run is faster than cargo test.

I do believe the flamegraph, I'm just trying to figure out what might be going on.

@sunshowers
Copy link
Member

sunshowers commented Oct 27, 2023

I do see from perf that nextest is spending most of its time reading from the pipe, that's true.

I also tried applying this patch (Linux only) to make pipe buffers bigger and it didn't appear to make things faster or slower -- I suspect the limit is not that.

This post suggests that tweaking buffer sizes etc will not make a difference, and the main way to make things faster than read and write is to use vmsplice -- which wouldn't seem to help because vmsplice is equivalent to read when transferring data from a pipe to user memory.

Unless there are some easy wins elsewhere I don't think we can make this better unfortunately.

@NobodyXu
Copy link
Contributor

Does nextest use single-thread tokio runtime and does it spawn the wait tasks so that it can be run in parallel?

@sunshowers
Copy link
Member

Nextest uses a multithreaded runtime, and each test is its own task. However, reading from stdout and stderr are not split up into their own tasks.

@NobodyXu
Copy link
Contributor

I think splitting reading stdout/stderr might be beneficial if the tests writes a lot to both stdout and stderr.

@HadrienG2
Copy link
Author

HadrienG2 commented Oct 27, 2023

First things first : for a reproducer you can...

  • Install the libhwloc development package of your preferred package manager
  • Clone the repo of my project
  • Checkout commit 1a78fc223aebefe351566d76da1b91df55c8e6e6 (current master at the time of writing, empty panic hook hasn't been merged yet)

...then run the following commands:

# Only observable with backtraces, which greatly increase the stdout/stderr output
export RUST_BACKTRACE=1

# Takes 2.3s on an AMD R5 5600G
cargo test ffi::int::tests -- --test-threads=1

# Takes 18.3s on the same CPU as above
cargo nextest run -j1 -E 'test(/ffi::int::tests/)'

I'm using only 1 test thread here because it makes timings more reproducible and reduces the perf workload during profiling (which enables going to higher sampling rates).


To be clear, the slowdown of enabling backtraces is already massive with cargo test (the same test run takes 0.3s without), and although it has implications that I don't like when running tests over multiple threads, replacing the standard panic hook with a no-op while running those tests is an acceptable workaround.

What I find suspicious here is that the slowdown is almost an order of magnitude higher with nextest than with cargo test, even though in this particular scenario neither of them should be doing anything with the test output, they should be just collecting it in some buffer in order to display it in case of failure. Somehow, cargo test must use a different test output collection strategy that's a lot more efficient.

Trying to investigate, a perf profile of the cargo test run (which unfortunately has much lower-quality data than my former profile, probably because I'm missing some debug info I can't identify on this machine) shows that cargo has almost no CPU activity during this test run.

I wonder how they manage to avoid the pipe overhead, assuming that's the only issue here. Perhaps they redirect the test output to some tempfile instead of a pipe so the kernel can handle all the data transfers without needing round trips to user space? Or perhaps they manage to have libtest do the buffering in-process somehow, and only print the output when a failure occurs?


To answer your first question, cargo test ffi::int::tests:: -- --test-threads=1 --nocapture 2>&1 | wc -c tells me that all these tests combined produce a total of 173 MiB of output over stdout + stderr. Yikes, I know. But this is testing an integer newtype, and unfortunately integers use panic as a failure mode in a lot of different operations...

Also, knowing that modern RAM can transfer data at tens of GB/s on consumer hardware does put the ~10MiB/s data transfer rate that is implied by taking 18s to do what amounts to filling up a ~180 MB buffer in a less favorable light.

@HadrienG2
Copy link
Author

HadrienG2 commented Oct 27, 2023

To be honest, I can't reproduce this. With a Ryzen 7950x and Linux kernel 6.5.6 and Rust 1.73.0, with this test that writes roughly 500MiB, I get:

* `cargo test`: 0.903 seconds

* `cargo nextest run`: 0.551 seconds

Replacing eprintln! with std::io::stderr().write_all() (which isn't intercepted by libtest by default) doesn't change things either -- cargo nextest run takes the same amount of time, while cargo test 2>test-file takes 0.740 seconds. In both cases, cargo nextest run is faster than cargo test.

I do believe the flamegraph, I'm just trying to figure out what might be going on.

Perhaps the difference is that the std::backtrace printer in the panic hook does finer-grained writes? Because stderr is unbuffered, its performance can vary greatly depending on write granularity.

@sunshowers
Copy link
Member

Somehow, cargo test must use a different test output collection strategy that's a lot more efficient.

cargo test does its output collection in-process, without using OS pipes. That might explain the difference.

Your point about stderr being unbuffered is well-taken (and is easy to verify by looking at https://doc.rust-lang.org/src/std/io/stdio.rs.html -- Stdout has a LineWriter in it, while Stderr does not). Maybe we need to add some buffering to std::backtrace in upstream Rust.

@HadrienG2
Copy link
Author

cargo test does its output collection in-process, without using OS pipes. That might explain the difference.

Sounds pretty neat, is there a way nextest could leverage the same infrastructure? Or if not, is there a way the infrastructure could be modified to make it happen?

Maybe we need to add some buffering to std::backtrace in upstream Rust.

This could also be a fallback plan if the above, more general optimization can't be made to work. I'm wondering if this would fit the design constraints of std though: the other day I remember seeing a rust-lang issue that proposed turning OOM errors into panics, if that's the plan you probably don't want to allocate when printing the OOM backtrace...

Of course, this is assuming all other allocations can be removed from the backtrace computation, which is perhaps a little optimistic. And maybe a statically allocated or stack-allocated buffer could be made to work?

@sunshowers
Copy link
Member

sunshowers commented Oct 28, 2023

Sounds pretty neat, is there a way nextest could leverage the same infrastructure?

Hmm, so a thing I realized is that nextest currently always passes in --nocapture, whether it is interested in output in case the test succeeds or not, even though most of the time users aren't interested in success output for the test. A thing nextest could do is to not pass in --nocapture when nextest isn't interested in output for the test on success.

@sunshowers
Copy link
Member

sunshowers commented Oct 29, 2023

While writing this, I sadly realized that --nocapture is pretty load bearing for tests that time out -- we want to show whatever output the test produced by the time it was killed, which means instructing libtest not to capture any output.

I think the only way to resolve this is to add buffering upstream (though maybe that's out of the question). There's nothing that can be done on nextest's side as far as I can tell.

@sunshowers sunshowers added the deferred Work deferred or blocked label Oct 29, 2023
@HadrienG2
Copy link
Author

How would you feel about an upstream feature that dumps the stdio buffer when a certain signal (say, SIGUSR1) is received? I think it is the minimal upstream change that would preserve optimal I/O performance in the common case, while allowing for the buffer to be dumped in case of timeout.

@sunshowers
Copy link
Member

sunshowers commented Oct 30, 2023

What if a test sets a SIGUSR1 handler for itself? I don't know how common that is but it is a thing that's possible.

To be honest I still don't see how the effort is worth it. This seems to be an edge case that is addressed by setting your own panic handler.

  • If you're printing out tons of logs to stderr you're most likely doing line buffering internally.
  • If you're using serde_json, which writes stuff character by character, just wrap the writer in a buffer yourself.

We can document this and maybe even detect it if we're getting a ton of small reads.

@HadrienG2
Copy link
Author

I would obviously be fine with a "wontfix please silence your panics" resolution since I already have a panic handler that does this around ;)

sunshowers added a commit that referenced this issue Nov 8, 2023
@sunshowers
Copy link
Member

sunshowers commented Mar 4, 2025

Nextest uses a multithreaded runtime, and each test is its own task. However, reading from stdout and stderr are not split up into their own tasks.

So I was actually wrong here -- nextest used to use the same task for all tests.

Curious how it is these days -- nextest actually now uses a separate task per test, and I suspect it improves your case considerably.

@HadrienG2
Copy link
Author

HadrienG2 commented Mar 5, 2025

In the standard panic configuration that I used at the time, where a backtrace is printed even if the panic is caught (which if you ask me is still a bad default that should be changed at the std level), nextest handles the stdio spam much better than before but still significantly worse than cargo test:

Benchmark 1: cargo test --features=hwloc-latest int
  Time (mean ± σ):      2.783 s ±  0.097 s    [User: 2.943 s, System: 0.311 s]
  Range (min … max):    2.687 s …  2.995 s    10 runs
 
Benchmark 2: cargo nextest run --features=hwloc-latest int
  Time (mean ± σ):      6.281 s ±  0.110 s    [User: 11.765 s, System: 17.819 s]
  Range (min … max):    6.158 s …  6.470 s    10 runs
 
Summary
  cargo test --features=hwloc-latest int ran
    2.26 ± 0.09 times faster than cargo nextest run --features=hwloc-latest int

Of course, suppressing the backtrace as I did at the time still greatly improves performance for both test harnesses:

❯ hyperfine --warmup 3 "cargo test --features=hwloc-latest int" "cargo nextest run --features=hwloc-latest int"
Benchmark 1: cargo test --features=hwloc-latest int
  Time (mean ± σ):     175.2 ms ±   2.4 ms    [User: 417.1 ms, System: 41.0 ms]
  Range (min … max):   172.1 ms … 180.8 ms    17 runs
 
Benchmark 2: cargo nextest run --features=hwloc-latest int
  Time (mean ± σ):     310.2 ms ±  29.6 ms    [User: 541.8 ms, System: 187.4 ms]
  Range (min … max):   281.4 ms … 358.4 ms    10 runs
 
Summary
  cargo test --features=hwloc-latest int ran
    1.77 ± 0.17 times faster than cargo nextest run --features=hwloc-latest int

Interestingly, proptest themselves have introduced backtrace suppression as part of their more general "handle-panics" feature. In my experience it's less good than my tailor-made backtrace suppression hack, in the sense that it also suppresses backtraces from unexpected panics (which is bad for debugging) and it performs worse for my use case, but interestingly its different performance characteristics put cargo test and cargo nextest on an equal footing:

❯ hyperfine --warmup 3 "cargo test --features=hwloc-latest int" "cargo nextest run --features=hwloc-latest int"
Benchmark 1: cargo test --features=hwloc-latest int
  Time (mean ± σ):     329.9 ms ±  57.5 ms    [User: 551.8 ms, System: 133.9 ms]
  Range (min … max):   246.3 ms … 413.4 ms    10 runs
 
Benchmark 2: cargo nextest run --features=hwloc-latest int
  Time (mean ± σ):     380.5 ms ±  15.7 ms    [User: 682.3 ms, System: 294.0 ms]
  Range (min … max):   364.9 ms … 418.8 ms    10 runs
 
Summary
  cargo test --features=hwloc-latest int ran
    1.15 ± 0.21 times faster than cargo nextest run --features=hwloc-latest int

@sunshowers
Copy link
Member

Thanks. That's probably the best we can do for now -- this is the difference between cross-process and in-process pipe reads with lots of small writes. Glad that nextest is doing better now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deferred Work deferred or blocked
Projects
None yet
Development

No branches or pull requests

3 participants