-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
appender: WorkerGuard flush guarantee breach #1120
Comments
cc @zekisherif |
Was able to reproduce the problem on my local machine. Looking into fixing it now. |
The problem appears to be that the worker_thread that was spawned is sometimes shutting down before it receives calling Adding a sleep within Drop would fix the problem in most scenarios and wouldn't risk creating a deadlock. It would have been nice to have a JoinHandle:: @hawkw Do you think it's fine just putting a sleep here? All other scenarios I can think of, join() or worker_thread informing the guard that it received the shutdown would introduce a possible deadlock. |
oh looks like in crossbeam if you use a zero-capacity channel, the sender will wait until a |
## Motivation Fixes the race condition outlined in #1120 . ## Solution `Worker` now uses a 2 stage shutdown approach. The first shutdown signal is sent through the main message channel to the `Worker` from `WorkerGuard` when it is dropped. Then `WorkerGuard` sends a second signal on a second channel that is zero-capacity. This means It will only succeed a `send()` when a `recv()` is called on the other end. This guarantees that the `Worker` has flushed all it's messages before the `WorkerGuard` can continue with its drop. With this solution I'm not able to reproduce the race anymore using the provided code sample from #1120 Co-authored-by: Zeki Sherif <zekshi@amazon.com>
I believe #1125 fixed this. |
## Motivation Fixes the race condition outlined in #1120 . ## Solution `Worker` now uses a 2 stage shutdown approach. The first shutdown signal is sent through the main message channel to the `Worker` from `WorkerGuard` when it is dropped. Then `WorkerGuard` sends a second signal on a second channel that is zero-capacity. This means It will only succeed a `send()` when a `recv()` is called on the other end. This guarantees that the `Worker` has flushed all it's messages before the `WorkerGuard` can continue with its drop. With this solution I'm not able to reproduce the race anymore using the provided code sample from #1120 Co-authored-by: Zeki Sherif <zekshi@amazon.com>
## Motivation Fixes the race condition outlined in #1120 . ## Solution `Worker` now uses a 2 stage shutdown approach. The first shutdown signal is sent through the main message channel to the `Worker` from `WorkerGuard` when it is dropped. Then `WorkerGuard` sends a second signal on a second channel that is zero-capacity. This means It will only succeed a `send()` when a `recv()` is called on the other end. This guarantees that the `Worker` has flushed all it's messages before the `WorkerGuard` can continue with its drop. With this solution I'm not able to reproduce the race anymore using the provided code sample from #1120 Co-authored-by: Zeki Sherif <zekshi@amazon.com>
## Motivation Fixes the race condition outlined in #1120 . ## Solution `Worker` now uses a 2 stage shutdown approach. The first shutdown signal is sent through the main message channel to the `Worker` from `WorkerGuard` when it is dropped. Then `WorkerGuard` sends a second signal on a second channel that is zero-capacity. This means It will only succeed a `send()` when a `recv()` is called on the other end. This guarantees that the `Worker` has flushed all it's messages before the `WorkerGuard` can continue with its drop. With this solution I'm not able to reproduce the race anymore using the provided code sample from #1120 Co-authored-by: Zeki Sherif <zekshi@amazon.com>
## Motivation Can be though of as a continuation to #1120 and #1125. Example with problematic racy behavior: ``` use std::io::Write; struct TestDrop<T: Write>(T); impl<T: Write> Drop for TestDrop<T> { fn drop(&mut self) { println!("Dropped"); } } impl<T: Write> Write for TestDrop<T> { fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> { self.0.write(buf) } fn flush(&mut self) -> std::io::Result<()> { self.0.flush() } } fn main() { let writer = TestDrop(std::io::stdout()); let (non_blocking, _guard) = tracing_appender::non_blocking(writer); tracing_subscriber::fmt().with_writer(non_blocking).init(); } ``` Running this test case in a loop with `while ./test | grep Dropped; do done`, it can be seen that sometimes writer (`TestDrop`) is not dropped and the message is not printed. I suppose that proper destruction of non-blocking writer should properly destroy underlying writer. ## Solution Solution involves joining `Worker` thread (that owns writer) after waiting for it to almost finish avoiding potential deadlock (see #1120 (comment))
Can be though of as a continuation to #1120 and #1125. Example with problematic racy behavior: ``` use std::io::Write; struct TestDrop<T: Write>(T); impl<T: Write> Drop for TestDrop<T> { fn drop(&mut self) { println!("Dropped"); } } impl<T: Write> Write for TestDrop<T> { fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> { self.0.write(buf) } fn flush(&mut self) -> std::io::Result<()> { self.0.flush() } } fn main() { let writer = TestDrop(std::io::stdout()); let (non_blocking, _guard) = tracing_appender::non_blocking(writer); tracing_subscriber::fmt().with_writer(non_blocking).init(); } ``` Running this test case in a loop with `while ./test | grep Dropped; do done`, it can be seen that sometimes writer (`TestDrop`) is not dropped and the message is not printed. I suppose that proper destruction of non-blocking writer should properly destroy underlying writer. Solution involves joining `Worker` thread (that owns writer) after waiting for it to almost finish avoiding potential deadlock (see #1120 (comment))
Bug Report
Version
tokio-rs/tracing commit 302d4a9 (which is the latest commit on the
master
branch at the time of writing).The
tracing
crates involvled (all from the above commit) are:tracing
tracing-appender
tracing-subscriber
Platform
Crates
tracing-appender
Description
The documentation for
tracing_appender::non_blocking
indicates that theWorkerGuard
drop guard guarantees that "logs will be flushed during a program's termination, in a panic or otherwise", as long as the guard has not been dropped prematurely.The behavior I'm seeing with the below program is that the logs are not reliably flushed when the program exits, in this case when returning successfully (
Ok(())
) frommain()
.Because the
guard
variable is held until the end of `main(``), the behavior I would expect from the program is that all events would be emitted to the log file prior to exiting. The full output expected is that it emits a "P: Starting" message to stdout, a handful of event messages to the configured log file, and then a "P: Completed" message on stdout.What I'm seeing instead is that the messages to the log file are sometimes never written.
Some runs of the program emit all of the event messages in the log. Other runs of the program result in an empty log. For this particular program, it seems to be an all-or-nothing scenario (I'm guessing due to the small number of events, avoiding the first trigger to flush them).
Whether it works as expected is influenced by the log level. It is easier to make it fail when the level is
WARN
orERROR
. To see the program sometimes fail and sometimes succeed within five attempts, usingWARN
seems to be the sweet spot on my machine.I first noticed the behavior when creating a default ("lossy")
NonBlocking
, but I see the same behavior using a non-lossyNonBlocking
, as well.Workaround (partial)
Interestingly enough, sleeping for even 1 millisecond on the main thread seems to be enough to "jiggle the handle" and shake all of the event messages to the log file. At least in my testing thus far, I have not seen it fail when that is the case. So that's my partial workaound for the moment. I'm concerned with what might be lost during a panic, though.
Example output
I run the below program with like this:
$ rm -f /tmp/reg-log-with-level.log ; cargo run --manifest-path ./Cargo.toml && cat /tmp/reg-log-with-level.log
With a successful run, I'll see the following in my terminal (stdout messages followed by the content from the log file):
With an unsuccessful run, I'll see the following (no content from log file):
Also, after an unsuccessful run the log file will exist, but will be zero bytes in size:
Files
File: 'Cargo.toml':
File: 'src/main.rs':
Possibly related
I see that the non-blocking appender originates with:
and tests were added with :
There is discussion in PR #701 about a fixed race related to
WorderGuard
(which was found as part of the work for #678 above):The text was updated successfully, but these errors were encountered: