Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

appender: WorkerGuard flush guarantee breach #1120

Closed
salewski opened this issue Nov 24, 2020 · 5 comments
Closed

appender: WorkerGuard flush guarantee breach #1120

salewski opened this issue Nov 24, 2020 · 5 comments

Comments

@salewski
Copy link
Contributor

salewski commented Nov 24, 2020

Bug Report

Version

tokio-rs/tracing commit 302d4a9 (which is the latest commit on the master branch at the time of writing).

The tracing crates involvled (all from the above commit) are:

  • tracing
  • tracing-appender
  • tracing-subscriber

Platform

    $ uname -srvom
    Linux 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 GNU/Linux

Crates

  • tracing-appender

Description

The documentation for tracing_appender::non_blocking indicates that the WorkerGuard drop guard guarantees that "logs will be flushed during a program's termination, in a panic or otherwise", as long as the guard has not been dropped prematurely.

The behavior I'm seeing with the below program is that the logs are not reliably flushed when the program exits, in this case when returning successfully (Ok(())) from main().

Because the guard variable is held until the end of `main(``), the behavior I would expect from the program is that all events would be emitted to the log file prior to exiting. The full output expected is that it emits a "P: Starting" message to stdout, a handful of event messages to the configured log file, and then a "P: Completed" message on stdout.

What I'm seeing instead is that the messages to the log file are sometimes never written.

Some runs of the program emit all of the event messages in the log. Other runs of the program result in an empty log. For this particular program, it seems to be an all-or-nothing scenario (I'm guessing due to the small number of events, avoiding the first trigger to flush them).

Whether it works as expected is influenced by the log level. It is easier to make it fail when the level is WARN or ERROR. To see the program sometimes fail and sometimes succeed within five attempts, using WARN seems to be the sweet spot on my machine.

I first noticed the behavior when creating a default ("lossy") NonBlocking, but I see the same behavior using a non-lossy NonBlocking, as well.

Workaround (partial)

Interestingly enough, sleeping for even 1 millisecond on the main thread seems to be enough to "jiggle the handle" and shake all of the event messages to the log file. At least in my testing thus far, I have not seen it fail when that is the case. So that's my partial workaound for the moment. I'm concerned with what might be lost during a panic, though.

Example output

I run the below program with like this:

    $ rm -f /tmp/reg-log-with-level.log ; cargo run --manifest-path ./Cargo.toml && cat /tmp/reg-log-with-level.log

With a successful run, I'll see the following in my terminal (stdout messages followed by the content from the log file):

        Finished dev [unoptimized + debuginfo] target(s) in 0.02s
         Running `target/debug/reg-log-with-level`
    P: Starting.
    P: Completed.
    2020-11-24T20:10:33.465909Z  WARN reg_log_with_level: event warn inside my_fn_with_events!
    2020-11-24T20:10:33.465954Z ERROR reg_log_with_level: event error inside my_fn_with_events!
    2020-11-24T20:10:33.465963Z ERROR reg_log_with_level: E: Completed.

With an unsuccessful run, I'll see the following (no content from log file):

        Finished dev [unoptimized + debuginfo] target(s) in 0.01s
         Running `target/debug/reg-log-with-level`
    P: Starting.
    P: Completed.

Also, after an unsuccessful run the log file will exist, but will be zero bytes in size:

    $ ls -l /tmp/reg-log-with-level.log
    -rw-r--r-- 1 ads ads 0  2020-11-24 15:11:51 /tmp/reg-log-with-level.log

Files

File: 'Cargo.toml':

[workspace]

[package]
name = 'reg-log-with-level'
description = 'Bug? Timing based non-emit of messages to RollingFileAppender'
version = '0.1.0'
authors = ['Alan D. Salewski <ads@salewski.email>']
edition = '2018'

[dependencies]
# b1baa6c2ef4877bd7f11de5991234fc81a025947
# 302d4a92dfee951736bd1a3663224694e2cd73cd
tracing          = { git = 'ssh://git@github.com/tokio-rs/tracing.git', rev = '302d4a92dfee951736bd1a3663224694e2cd73cd' }
tracing-appender = { git = 'ssh://git@github.com/tokio-rs/tracing.git', rev = '302d4a92dfee951736bd1a3663224694e2cd73cd' }

[dependencies.tracing-subscriber]
git = 'ssh://git@github.com/tokio-rs/tracing.git'
rev = '302d4a92dfee951736bd1a3663224694e2cd73cd'
default-features = false
features = [
    'fmt',
]

File: 'src/main.rs':

//! reg-log-with-level: shows events sometimes not emitted to log upon shutdown
//!
//! Run with:
//!     $ rm -f /tmp/reg-log-with-level.log ; cargo run --manifest-path ./Cargo.toml && cat /tmp/reg-log-with-level.log
//!
use std::error::Error;

use tracing::{
    self,
    event,
    Level,
};

// self only used here when creating a lossy NonBlocking
#[allow(unused_imports)]
use tracing_appender::non_blocking::{
    self,
    NonBlockingBuilder,
};

use tracing_subscriber::{
    self,
    filter::LevelFilter,
    subscribe::CollectExt,  // for with(...)
};

fn my_fn_with_events( _my_arg: usize) {
    event!( Level::TRACE, "event trace inside my_fn_with_events!" );
    event!( Level::DEBUG, "event debug inside my_fn_with_events!" );
    event!( Level::INFO,  "event info inside my_fn_with_events!"  );
    event!( Level::WARN,  "event warn inside my_fn_with_events!"  );
    event!( Level::ERROR, "event error inside my_fn_with_events!" );
}

fn main() -> Result<(), Box<dyn Error>> {

    println!("P: Starting.");

    let log_file_appender = {
        let log_dir = "/tmp";
        let log_fname = "reg-log-with-level.log";
        tracing_appender::rolling::never( log_dir, log_fname )  // *.log (no rotation freq. suffix)
    };

    // // lossy
    // let ( nb_log_file_appender, _guard ) = tracing_appender::non_blocking( log_file_appender );

    // non-lossy, yet not all events are flushed when shutting down
    #[allow(unused_variables)]  // Do not complain about guard, even though it is neither '_' nor '_guard'
    let ( nb_log_file_appender, guard ) = NonBlockingBuilder::default()
        .lossy( false )
        .finish( log_file_appender );

    let subscriber = tracing_subscriber::fmt::Subscriber::default().with_writer( nb_log_file_appender );

    let collector = tracing_subscriber::registry()
        // .with( LevelFilter::TRACE )
        // .with( LevelFilter::DEBUG )
        // .with( LevelFilter::INFO  )
        .with( LevelFilter::WARN  )
        // .with( LevelFilter::ERROR )
        .with( subscriber );

    tracing::collect::set_global_default( collector )
        .expect("Unable to set a global collector");

    my_fn_with_events( 42 );

    event!( Level::ERROR, "E: Completed." );
    println!("P: Completed.");

    // Jiggle the handle to shake-out any remaing log messages that have been
    // buffered.
    //
    // AFAICT, any non-zero sleep time (including just 1 millisecond) is
    // enough to have the intended effect.

    // std::thread::sleep( std::time::Duration::from_millis( 1000 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(  500 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(  250 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(  125 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(  100 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(   50 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(   25 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(   10 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(    5 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(    3 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(    1 ));  // seems to always work
    // std::thread::sleep( std::time::Duration::from_millis(    0 ));  // often fails

    Ok(())
}

Possibly related

I see that the non-blocking appender originates with:

  • PR #673: "Introduce a non blocking file appender"

and tests were added with :

  • PR #678: "Tracing appender tests"

There is discussion in PR #701 about a fixed race related to WorderGuard (which was found as part of the work for #678 above):

  • PR #701: "Appender: Use channel to signal shutdown of worker thread"
@hawkw
Copy link
Member

hawkw commented Nov 24, 2020

cc @zekisherif

@zekisherif
Copy link
Contributor

Was able to reproduce the problem on my local machine. Looking into fixing it now.

@zekisherif
Copy link
Contributor

The problem appears to be that the worker_thread that was spawned is sometimes shutting down before it receives Msg::Shutdown.

calling .join() within the WorkerGuard::drop() fixes the problem but this would introduce a possible deadlock since if the worker thread never receives the shutdown signal, we won't be able to complete the Drop of the guard.

Adding a sleep within Drop would fix the problem in most scenarios and wouldn't risk creating a deadlock. It would have been nice to have a JoinHandle::try_join(timout: Duration) rust-lang/rfcs#1404.

@hawkw Do you think it's fine just putting a sleep here? All other scenarios I can think of, join() or worker_thread informing the guard that it received the shutdown would introduce a possible deadlock.

@zekisherif
Copy link
Contributor

oh looks like in crossbeam if you use a zero-capacity channel, the sender will wait until a recv is called on the other end. If we add a specific channel for shutdown handling then this could be a better option than sleep

hawkw pushed a commit that referenced this issue Dec 11, 2020
## Motivation

Fixes the race condition outlined in #1120 . 

## Solution

`Worker` now uses a 2 stage shutdown approach. The first shutdown signal
is sent through the main message channel to the `Worker` from
`WorkerGuard` when it is dropped. Then `WorkerGuard` sends a second
signal on a second channel that is zero-capacity. This means It will
only succeed a `send()` when a `recv()` is called on the other end. This
guarantees that the `Worker` has flushed all it's messages before the
`WorkerGuard` can continue with its drop.

With this solution I'm not able to reproduce the race anymore using the
provided code sample from #1120 

Co-authored-by: Zeki Sherif <zekshi@amazon.com>
@hawkw
Copy link
Member

hawkw commented Dec 14, 2020

I believe #1125 fixed this.

@hawkw hawkw closed this as completed Dec 14, 2020
hawkw pushed a commit that referenced this issue Dec 14, 2020
## Motivation

Fixes the race condition outlined in #1120 . 

## Solution

`Worker` now uses a 2 stage shutdown approach. The first shutdown signal
is sent through the main message channel to the `Worker` from
`WorkerGuard` when it is dropped. Then `WorkerGuard` sends a second
signal on a second channel that is zero-capacity. This means It will
only succeed a `send()` when a `recv()` is called on the other end. This
guarantees that the `Worker` has flushed all it's messages before the
`WorkerGuard` can continue with its drop.

With this solution I'm not able to reproduce the race anymore using the
provided code sample from #1120 

Co-authored-by: Zeki Sherif <zekshi@amazon.com>
hawkw pushed a commit that referenced this issue Dec 14, 2020
## Motivation

Fixes the race condition outlined in #1120 . 

## Solution

`Worker` now uses a 2 stage shutdown approach. The first shutdown signal
is sent through the main message channel to the `Worker` from
`WorkerGuard` when it is dropped. Then `WorkerGuard` sends a second
signal on a second channel that is zero-capacity. This means It will
only succeed a `send()` when a `recv()` is called on the other end. This
guarantees that the `Worker` has flushed all it's messages before the
`WorkerGuard` can continue with its drop.

With this solution I'm not able to reproduce the race anymore using the
provided code sample from #1120 

Co-authored-by: Zeki Sherif <zekshi@amazon.com>
hawkw pushed a commit that referenced this issue Dec 28, 2020
## Motivation

Fixes the race condition outlined in #1120 . 

## Solution

`Worker` now uses a 2 stage shutdown approach. The first shutdown signal
is sent through the main message channel to the `Worker` from
`WorkerGuard` when it is dropped. Then `WorkerGuard` sends a second
signal on a second channel that is zero-capacity. This means It will
only succeed a `send()` when a `recv()` is called on the other end. This
guarantees that the `Worker` has flushed all it's messages before the
`WorkerGuard` can continue with its drop.

With this solution I'm not able to reproduce the race anymore using the
provided code sample from #1120 

Co-authored-by: Zeki Sherif <zekshi@amazon.com>
hawkw pushed a commit that referenced this issue Nov 13, 2021
## Motivation

Can be though of as a continuation to #1120 and #1125.

Example with problematic racy behavior:
```
use std::io::Write;

struct TestDrop<T: Write>(T);

impl<T: Write> Drop for TestDrop<T> {
    fn drop(&mut self) {
        println!("Dropped");
    }
}

impl<T: Write> Write for TestDrop<T> {
    fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
        self.0.write(buf)
    }
    fn flush(&mut self) -> std::io::Result<()> {
        self.0.flush()
    }
}

fn main() {
    let writer = TestDrop(std::io::stdout());
    let (non_blocking, _guard) = tracing_appender::non_blocking(writer);
    tracing_subscriber::fmt().with_writer(non_blocking).init();
}
```

Running this test case in a loop with `while ./test | grep Dropped; do
done`, it can be seen that sometimes writer (`TestDrop`) is not dropped
and the message is not printed. I suppose that proper destruction of
non-blocking writer should properly destroy underlying writer.

## Solution

Solution involves joining `Worker` thread (that owns writer) after
waiting for it to almost finish avoiding potential deadlock (see
#1120 (comment))
davidbarsky pushed a commit that referenced this issue Nov 17, 2023
Can be though of as a continuation to #1120 and #1125.

Example with problematic racy behavior:
```
use std::io::Write;

struct TestDrop<T: Write>(T);

impl<T: Write> Drop for TestDrop<T> {
    fn drop(&mut self) {
        println!("Dropped");
    }
}

impl<T: Write> Write for TestDrop<T> {
    fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
        self.0.write(buf)
    }
    fn flush(&mut self) -> std::io::Result<()> {
        self.0.flush()
    }
}

fn main() {
    let writer = TestDrop(std::io::stdout());
    let (non_blocking, _guard) = tracing_appender::non_blocking(writer);
    tracing_subscriber::fmt().with_writer(non_blocking).init();
}
```

Running this test case in a loop with `while ./test | grep Dropped; do
done`, it can be seen that sometimes writer (`TestDrop`) is not dropped
and the message is not printed. I suppose that proper destruction of
non-blocking writer should properly destroy underlying writer.

Solution involves joining `Worker` thread (that owns writer) after
waiting for it to almost finish avoiding potential deadlock (see
#1120 (comment))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants