Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

appender: fix WorkerGuard not waiting for writer destruction #1713

Merged
merged 7 commits into from
Nov 13, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions tracing-appender/src/non_blocking.rs
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ pub const DEFAULT_BUFFERED_LINES_LIMIT: usize = 128_000;
#[must_use]
#[derive(Debug)]
pub struct WorkerGuard {
_guard: Option<JoinHandle<()>>,
handle: Option<JoinHandle<()>>,
sender: Sender<Msg>,
shutdown: Sender<()>,
}
Expand Down Expand Up @@ -259,7 +259,7 @@ impl<'a> MakeWriter<'a> for NonBlocking {
impl WorkerGuard {
fn new(handle: JoinHandle<()>, sender: Sender<Msg>, shutdown: Sender<()>) -> Self {
WorkerGuard {
_guard: Some(handle),
handle: Some(handle),
sender,
shutdown,
}
Expand All @@ -277,7 +277,16 @@ impl Drop for WorkerGuard {
// when the `Worker` calls `recv()` on a zero-capacity channel. Use `send_timeout`
// so that drop is not blocked indefinitely.
// TODO: Make timeout configurable.
let _ = self.shutdown.send_timeout((), Duration::from_millis(1000));
match self.shutdown.send_timeout((), Duration::from_millis(1000)) {
Err(SendTimeoutError::Disconnected(_)) => (),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we get the Disconnected error, this means that the Receiver side of the channel has already been dropped...but that doesn't necessarily mean that the worker thread has terminated yet. should we still call join on the JoinHandle in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we get the Disconnected error, this means that the Receiver side of the channel has already been dropped...but that doesn't necessarily mean that the worker thread has terminated yet.

right, this can be racy also

If Disconnected can occur only during worker thread (which owns Receiver) destruction, I think it is safe to use join here.
Went ahead and merged Disconnected with Ok case.

Err(SendTimeoutError::Timeout(e)) => {
println!("Failed to wait for logging worker shutdown. Error: {:?}", e);
trtt marked this conversation as resolved.
Show resolved Hide resolved
}
Ok(_) => {
// At this point it is safe to wait for `Worker` destruction without blocking
self.handle.take().map(std::thread::JoinHandle::join);
hawkw marked this conversation as resolved.
Show resolved Hide resolved
}
}
}
Err(SendTimeoutError::Disconnected(_)) => (),
Err(SendTimeoutError::Timeout(e)) => println!(
Expand Down
5 changes: 1 addition & 4 deletions tracing-appender/src/worker.rs
Original file line number Diff line number Diff line change
Expand Up @@ -74,16 +74,13 @@ impl<T: Write + Send + Sync + 'static> Worker<T> {
Ok(WorkerState::Continue) | Ok(WorkerState::Empty) => {}
Ok(WorkerState::Shutdown) | Ok(WorkerState::Disconnected) => {
let _ = self.shutdown.recv();
break;
return;
}
Err(_) => {
// TODO: Expose a metric for IO Errors, or print to stderr
}
}
}
if let Err(e) = self.writer.flush() {
eprintln!("Failed to flush. Error: {}", e);
}
Comment on lines -84 to -86
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm assuming the flush here was removed because we will already always flush on shutdown in Worker::work()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's why I felt safe to remove it

but the actual reason was that it could block (or introduce undefined delay)(#1125 (comment)) when joining the worker. To be honest, I would've just joined the thread, but the IO argument does make sense and I just followed with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realised that within this logic it might be a good idea to drop writer as well: 45da512. There could be some IO there in destructor.

})
}
}