appender: WorkerGuard flush guarantee breach #1120

salewski · 2020-11-24T21:02:28Z

Bug Report

Version

tokio-rs/tracing commit 302d4a9 (which is the latest commit on the master branch at the time of writing).

The tracing crates involvled (all from the above commit) are:

tracing
tracing-appender
tracing-subscriber

Platform

    $ uname -srvom
    Linux 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 GNU/Linux

Crates

tracing-appender

Description

The documentation for tracing_appender::non_blocking indicates that the WorkerGuard drop guard guarantees that "logs will be flushed during a program's termination, in a panic or otherwise", as long as the guard has not been dropped prematurely.

The behavior I'm seeing with the below program is that the logs are not reliably flushed when the program exits, in this case when returning successfully (Ok(())) from main().

Because the guard variable is held until the end of `main(``), the behavior I would expect from the program is that all events would be emitted to the log file prior to exiting. The full output expected is that it emits a "P: Starting" message to stdout, a handful of event messages to the configured log file, and then a "P: Completed" message on stdout.

What I'm seeing instead is that the messages to the log file are sometimes never written.

Some runs of the program emit all of the event messages in the log. Other runs of the program result in an empty log. For this particular program, it seems to be an all-or-nothing scenario (I'm guessing due to the small number of events, avoiding the first trigger to flush them).

Whether it works as expected is influenced by the log level. It is easier to make it fail when the level is WARN or ERROR. To see the program sometimes fail and sometimes succeed within five attempts, using WARN seems to be the sweet spot on my machine.

I first noticed the behavior when creating a default ("lossy") NonBlocking, but I see the same behavior using a non-lossy NonBlocking, as well.

Workaround (partial)

Interestingly enough, sleeping for even 1 millisecond on the main thread seems to be enough to "jiggle the handle" and shake all of the event messages to the log file. At least in my testing thus far, I have not seen it fail when that is the case. So that's my partial workaound for the moment. I'm concerned with what might be lost during a panic, though.

Example output

I run the below program with like this:

    $ rm -f /tmp/reg-log-with-level.log ; cargo run --manifest-path ./Cargo.toml && cat /tmp/reg-log-with-level.log

With a successful run, I'll see the following in my terminal (stdout messages followed by the content from the log file):

        Finished dev [unoptimized + debuginfo] target(s) in 0.02s
         Running `target/debug/reg-log-with-level`
    P: Starting.
    P: Completed.
    2020-11-24T20:10:33.465909Z  WARN reg_log_with_level: event warn inside my_fn_with_events!
    2020-11-24T20:10:33.465954Z ERROR reg_log_with_level: event error inside my_fn_with_events!
    2020-11-24T20:10:33.465963Z ERROR reg_log_with_level: E: Completed.

With an unsuccessful run, I'll see the following (no content from log file):

        Finished dev [unoptimized + debuginfo] target(s) in 0.01s
         Running `target/debug/reg-log-with-level`
    P: Starting.
    P: Completed.

Also, after an unsuccessful run the log file will exist, but will be zero bytes in size:

    $ ls -l /tmp/reg-log-with-level.log
    -rw-r--r-- 1 ads ads 0  2020-11-24 15:11:51 /tmp/reg-log-with-level.log

Files

File: 'Cargo.toml':

[workspace]

[package]
name = 'reg-log-with-level'
description = 'Bug? Timing based non-emit of messages to RollingFileAppender'
version = '0.1.0'
authors = ['Alan D. Salewski <ads@salewski.email>']
edition = '2018'

[dependencies]
# b1baa6c2ef4877bd7f11de5991234fc81a025947
# 302d4a92dfee951736bd1a3663224694e2cd73cd
tracing          = { git = 'ssh://git@github.com/tokio-rs/tracing.git', rev = '302d4a92dfee951736bd1a3663224694e2cd73cd' }
tracing-appender = { git = 'ssh://git@github.com/tokio-rs/tracing.git', rev = '302d4a92dfee951736bd1a3663224694e2cd73cd' }

[dependencies.tracing-subscriber]
git = 'ssh://git@github.com/tokio-rs/tracing.git'
rev = '302d4a92dfee951736bd1a3663224694e2cd73cd'
default-features = false
features = [
    'fmt',
]

File: 'src/main.rs':

//! reg-log-with-level: shows events sometimes not emitted to log upon shutdown
//!
//! Run with:
//!     $ rm -f /tmp/reg-log-with-level.log ; cargo run --manifest-path ./Cargo.toml && cat /tmp/reg-log-with-level.log
//!
use std::error::Error;

use tracing::{
    self,
    event,
    Level,
};

// self only used here when creating a lossy NonBlocking
#[allow(unused_imports)]
use tracing_appender::non_blocking::{
    self,
    NonBlockingBuilder,
};

use tracing_subscriber::{
    self,
    filter::LevelFilter,
    subscribe::CollectExt,  // for with(...)
};

fn my_fn_with_events( _my_arg: usize) {
    event!( Level::TRACE, "event trace inside my_fn_with_events!" );
    event!( Level::DEBUG, "event debug inside my_fn_with_events!" );
    event!( Level::INFO,  "event info inside my_fn_with_events!"  );
    event!( Level::WARN,  "event warn inside my_fn_with_events!"  );
    event!( Level::ERROR, "event error inside my_fn_with_events!" );
}

fn main() -> Result<(), Box<dyn Error>> {

    println!("P: Starting.");

    let log_file_appender = {
        let log_dir = "/tmp";
        let log_fname = "reg-log-with-level.log";
        tracing_appender::rolling::never( log_dir, log_fname )  // *.log (no rotation freq. suffix)
    };

    // // lossy
    // let ( nb_log_file_appender, _guard ) = tracing_appender::non_blocking( log_file_appender );

    // non-lossy, yet not all events are flushed when shutting down
    #[allow(unused_variables)]  // Do not complain about guard, even though it is neither '_' nor '_guard'
    let ( nb_log_file_appender, guard ) = NonBlockingBuilder::default()
        .lossy( false )
        .finish( log_file_appender );

    let subscriber = tracing_subscriber::fmt::Subscriber::default().with_writer( nb_log_file_appender );

    let collector = tracing_subscriber::registry()
        // .with( LevelFilter::TRACE )
        // .with( LevelFilter::DEBUG )
        // .with( LevelFilter::INFO  )
        .with( LevelFilter::WARN  )
        // .with( LevelFilter::ERROR )
        .with( subscriber );

    tracing::collect::set_global_default( collector )
        .expect("Unable to set a global collector");

    my_fn_with_events( 42 );

    event!( Level::ERROR, "E: Completed." );
    println!("P: Completed.");

    // Jiggle the handle to shake-out any remaing log messages that have been
    // buffered.
    //
    // AFAICT, any non-zero sleep time (including just 1 millisecond) is
    // enough to have the intended effect.

    // std::thread::sleep( std::time::Duration::from_millis( 1000 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(  500 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(  250 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(  125 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(  100 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(   50 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(   25 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(   10 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(    5 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(    3 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(    1 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(    0 ));  // often fails

    Ok(())
}

Possibly related

I see that the non-blocking appender originates with:

PR #673: "Introduce a non blocking file appender"

and tests were added with :

PR #678: "Tracing appender tests"

There is discussion in PR #701 about a fixed race related to WorderGuard (which was found as part of the work for #678 above):

PR #701: "Appender: Use channel to signal shutdown of worker thread"

The text was updated successfully, but these errors were encountered:

hawkw · 2020-11-24T22:45:31Z

cc @zekisherif

zekisherif · 2020-11-25T22:31:49Z

Was able to reproduce the problem on my local machine. Looking into fixing it now.

zekisherif · 2020-11-25T22:59:21Z

The problem appears to be that the worker_thread that was spawned is sometimes shutting down before it receives Msg::Shutdown.

calling .join() within the WorkerGuard::drop() fixes the problem but this would introduce a possible deadlock since if the worker thread never receives the shutdown signal, we won't be able to complete the Drop of the guard.

Adding a sleep within Drop would fix the problem in most scenarios and wouldn't risk creating a deadlock. It would have been nice to have a JoinHandle::try_join(timout: Duration) rust-lang/rfcs#1404.

@hawkw Do you think it's fine just putting a sleep here? All other scenarios I can think of, join() or worker_thread informing the guard that it received the shutdown would introduce a possible deadlock.

zekisherif · 2020-11-25T23:03:32Z

oh looks like in crossbeam if you use a zero-capacity channel, the sender will wait until a recv is called on the other end. If we add a specific channel for shutdown handling then this could be a better option than sleep

## Motivation Fixes the race condition outlined in #1120 . ## Solution `Worker` now uses a 2 stage shutdown approach. The first shutdown signal is sent through the main message channel to the `Worker` from `WorkerGuard` when it is dropped. Then `WorkerGuard` sends a second signal on a second channel that is zero-capacity. This means It will only succeed a `send()` when a `recv()` is called on the other end. This guarantees that the `Worker` has flushed all it's messages before the `WorkerGuard` can continue with its drop. With this solution I'm not able to reproduce the race anymore using the provided code sample from #1120 Co-authored-by: Zeki Sherif <zekshi@amazon.com>

hawkw · 2020-12-14T18:28:48Z

I believe #1125 fixed this.

## Motivation Fixes the race condition outlined in #1120 . ## Solution `Worker` now uses a 2 stage shutdown approach. The first shutdown signal is sent through the main message channel to the `Worker` from `WorkerGuard` when it is dropped. Then `WorkerGuard` sends a second signal on a second channel that is zero-capacity. This means It will only succeed a `send()` when a `recv()` is called on the other end. This guarantees that the `Worker` has flushed all it's messages before the `WorkerGuard` can continue with its drop. With this solution I'm not able to reproduce the race anymore using the provided code sample from #1120 Co-authored-by: Zeki Sherif <zekshi@amazon.com>

## Motivation Can be though of as a continuation to #1120 and #1125. Example with problematic racy behavior: ``` use std::io::Write; struct TestDrop<T: Write>(T); impl<T: Write> Drop for TestDrop<T> { fn drop(&mut self) { println!("Dropped"); } } impl<T: Write> Write for TestDrop<T> { fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> { self.0.write(buf) } fn flush(&mut self) -> std::io::Result<()> { self.0.flush() } } fn main() { let writer = TestDrop(std::io::stdout()); let (non_blocking, _guard) = tracing_appender::non_blocking(writer); tracing_subscriber::fmt().with_writer(non_blocking).init(); } ``` Running this test case in a loop with `while ./test | grep Dropped; do done`, it can be seen that sometimes writer (`TestDrop`) is not dropped and the message is not printed. I suppose that proper destruction of non-blocking writer should properly destroy underlying writer. ## Solution Solution involves joining `Worker` thread (that owns writer) after waiting for it to almost finish avoiding potential deadlock (see #1120 (comment))

Can be though of as a continuation to #1120 and #1125. Example with problematic racy behavior: ``` use std::io::Write; struct TestDrop<T: Write>(T); impl<T: Write> Drop for TestDrop<T> { fn drop(&mut self) { println!("Dropped"); } } impl<T: Write> Write for TestDrop<T> { fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> { self.0.write(buf) } fn flush(&mut self) -> std::io::Result<()> { self.0.flush() } } fn main() { let writer = TestDrop(std::io::stdout()); let (non_blocking, _guard) = tracing_appender::non_blocking(writer); tracing_subscriber::fmt().with_writer(non_blocking).init(); } ``` Running this test case in a loop with `while ./test | grep Dropped; do done`, it can be seen that sometimes writer (`TestDrop`) is not dropped and the message is not printed. I suppose that proper destruction of non-blocking writer should properly destroy underlying writer. Solution involves joining `Worker` thread (that owns writer) after waiting for it to almost finish avoiding potential deadlock (see #1120 (comment))

zekisherif mentioned this issue Dec 1, 2020

tracing_apender: Fix race condition for logging on shutdown #1125

Merged

hawkw closed this as completed Dec 14, 2020

trtt mentioned this issue Nov 10, 2021

appender: fix WorkerGuard not waiting for writer destruction #1713

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

appender: WorkerGuard flush guarantee breach #1120

appender: WorkerGuard flush guarantee breach #1120

salewski commented Nov 24, 2020 •

edited

Loading

hawkw commented Nov 24, 2020

zekisherif commented Nov 25, 2020

zekisherif commented Nov 25, 2020

zekisherif commented Nov 25, 2020

hawkw commented Dec 14, 2020

appender: WorkerGuard flush guarantee breach #1120

appender: WorkerGuard flush guarantee breach #1120

Comments

salewski commented Nov 24, 2020 • edited Loading

Bug Report

Version

Platform

Crates

Description

Workaround (partial)

Example output

Files

Possibly related

hawkw commented Nov 24, 2020

zekisherif commented Nov 25, 2020

zekisherif commented Nov 25, 2020

zekisherif commented Nov 25, 2020

hawkw commented Dec 14, 2020

salewski commented Nov 24, 2020 •

edited

Loading