-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM/stack overflow crash observed when porting tokio 0.1 app to tokio 1.14 (uses tokio::process and grpcio) #4309
Comments
Can you elaborate on where exactly this |
@ipetkov This appears to be related to |
It looks effectively like this: fn entrypoint() -> Result<(), AppError> {
let tokio_runtime = tokio::runtime::Runtime::new()?;
...
grpcio_server.start();
let shutdown_fut = ctrlc_rx.map(|_| Ok(()));
tokio_runtime.block_on(shutdown_fut)
}
fn main() {
std::process::exit(match entrypoint() {
Ok(_) => 0,
Err(_) => ...
})
} |
Right, so |
How are you reimplementing |
Here is what was working (exact code): https://gist.github.com/iitalics/710401de72725fa4b6db27742124ae24 |
The broken version using Anyway, it seems like one difference between the two snippets is that your original snippet using |
Right, that code is paraphrased, I forgot to make it borrow check compliant. I fix the borrow checking issue by using an async move block so its actually like this: let fut = Box::pin(async move { stdin.write_all(input.as_bytes()).await });
fut.and_then(move |_| child.wait_with_output()) |
A version of |
First thing that comes to mind: have you tried splitting out the stdin.write and stdout.read into two separate futures? I suppose this is what you're essentially doing with the My guidance for piping input/output of a child process is always do them independently, otherwise if the child hangs because its stdout buffer is full (parent isn't reading from it yet!) it may stop reading its input, in which case the parent could get stuck writing to it. And if the parent is buffering data to send to the child with no back-pressure it could be holding that data for too long and running out of memory. Could you try changing the code as follows and see if the OOM reproduces still? let cmd = tokio::process::Command::new(...)...;
futures::future::lazy(|_| cmd.spawn())
.and_then(|mut child| {
let mut stdin = child.stdin.take().unwrap();
tokio::spawn(async move { stdin.write_all(...).unwrap() });
child.wait_with_output()
})
.then(|result| ...); |
I'm testing this right now. But the input buffer is completely in memory beforehand (see how my |
I've just observed the crash again when using the following: async move {
tokio::spawn(async move {
stdin.write_all(input.as_bytes()).await
});
child.wait_with_output().await
} |
Okay, thanks for testing that. Next thing that comes to mind: do you have an idea of how much data the child is writing back on stdout/stderr? Do you need all of that data present within the parent process or could that be routed elsewhere?
My first recommendation is to avoid having the parent read any streams it doesn't care about (e.g. redirect stderr to /dev/null or to a file directly). My second recommendation would be to change the application so that it can process the data in a more streaming fashion instead of capturing all the output up front, so that it can discard data as quickly as possible |
I mentioned before that this crash only appeared after having upgraded the tokio infrastructure of the application. Before, there is no problem with collecting all subprocess output -- behavior I'm not interested in changing. |
We aren't saying that your application is doing something incorrect here — it certainly sounds like a bug in Tokio, but it would be helpful to narrow down the issue if you can test that the issue still happens even if you stop reading from the pipe once the vector becomes super large. Something like this: let read_stderr_fut = async {
let mut buf = Vec::new();
while buf.len() < 50_000_000 {
stderr.read_buf(&mut buf).await?;
}
std::io::Result::<_>::Ok(buf)
}; |
I have opened a PR that changes |
Running some tests today but its taking forever because the problem tends to go away when I add more logs. FWIW the largest stderr we ever see is <256 bytes. |
Update: I was running a new test which was doing loop {
let n = stderr.read_buf(&mut buf).await?;
warn!(...);
if n == 0 {
drop(stderr); // (implicit)
return Ok(buf);
}
} The logging probably added a delay that made it so |
Version
Tokio 1.14.0
Platform
Linux [...] SMP Thu Dec 9 04:33:29 UTC 2021 armv7l GNU/Linux
Rust 1.55
Description
I'm going to do my best to report an issue I've been trying to track down for about the last month. I've been porting a medium size application from tokio 0.1 (rustc 1.46) to tokio 1.14 (rustc 1.55), and am now seeing rare but reoccuring OOM crashes; it looks like a stack overflow, but I have little to go by from looking at the core dumps (comes from a running armv7 device). The port itself is not very substantial since most of the API is the same, but this crash is new. It happens infrequently but common enough to suggest something is wrong, so I'm inclined to believe a race condition is involved.
I have been able to identify the moment it crashes is during
tokio::process::Child::wait_with_output()
, which I invoke like this:I've observed that if I reimplement
wait_with_output()
, but drop the stdout/stderr handles afterwait()
has finished, the crash stops. But Rust community Discord users have suggested to me that this is most likely just a symptom of the underlying problem.One factor that may be contributing to this crash is that I am using gprcio alongside tokio in the application. I've attempted to take care to not mix executors, yet I am still seeing this issue. I'm going to try to explain how grpcio is used for some extra context in diagnosing the problem.
I am instantiating the tokio runtime using
tokio::runtime::Runtime::new()
, and then passing theHandle
to the gRPC service handler, which handles requests like this (simplified):The gRPC server is started by calling
grpcio::Server::start()
inmain()
(not using#[tokio::main]
), which then blocks to run the event loop by passing afutures::channel::oneshot
toRutime::block_on()
that is only ever completed after ctrlc signal is caught.Let me know if you need any other details and I will try my best to provide them.
Misc. notes:
unsafe
code.The text was updated successfully, but these errors were encountered: