-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix detection of closed connections in cluster connection pools #146
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to properly understand the new error handling/shutdown process.
But heres the feedback so far
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not investigated this thoroughly but could we implement it like this:
pub fn spawn_from_stream<C: Codec + 'static>(
codec: &C,
stream: TcpStream,
) -> UnboundedSender<Request> {
let (read, write) = stream.into_split();
let (out_tx, out_rx) = tokio::sync::mpsc::unbounded_channel::<Request>();
let (return_tx, return_rx) = tokio::sync::mpsc::unbounded_channel::<Request>();
let codec_clone1 = codec.clone();
let codec_clone2 = codec.clone();
tokio::spawn(async move {
tokio::select! {
result = tx_process(write, out_rx, return_tx, codec_clone1) => if let Err(e) = result {
trace!("connection write-closed with error: {:?}", e);
} else {
trace!("connection write-closed gracefully");
},
result = rx_process(read, return_rx, codec_clone2) => if let Err(e) = result {
trace!("connection read-closed with error: {:?}", e);
} else {
trace!("connection read-closed gracefully");
}
}
});
out_tx
}
Or do need each in a separate task for performance reasons or something?
let (read, write) = stream.into_split(); | ||
let (out_tx, out_rx) = tokio::sync::mpsc::unbounded_channel::<Request>(); | ||
let (return_tx, return_rx) = tokio::sync::mpsc::unbounded_channel::<Request>(); | ||
let (shutdown_tx, shutdown_rx) = tokio::sync::oneshot::channel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets name it remote_connection_closed_tx, remote_connection_closed_rx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it as closed_tx/rx
instead because it's not necessary a "remote" thing? i.e. we don't know if the rx_process finished gracefully or with errors. 🤔
My understanding is it's for performance reasons, to allow duplex - send & receive on separate threads at the same time. I haven't checked whether duplex is actually possible or implemented correctly, but it would seem like a good reason to keep them separate? |
ah! |
Seems not because then we would be blocking in spawn_from_stream. My next question is do we need to handle shutting down the reader? Or is that already handled by some other mechanism? |
When the transform closes the connection |
ah, so my understanding is shutdown of rx_process task works like this:
Is this correct? |
It turns out I've misread the comment I quoted above (which is still a bit confusing to read), but yes, you are correct.
Dug through the kernel to confirm how it's implemented: https://github.com/torvalds/linux/blob/6e764bcd1cf72a2846c0e53d3975a09b242c04c9/net/socket.c#L2231 A FIN gets sent when a TCP socket is shutdown on the write side. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, thats a thorough investigation, its a lot clearer now, thanks!
While investigating the problem I found this Abortable type provided by tokio which provides slightly more type safety at the cost of flexibility, we can use it as follows:
pub fn spawn_from_stream<C: Codec + 'static>(
codec: &C,
stream: TcpStream,
) -> UnboundedSender<Request> {
let (read, write) = stream.into_split();
let (out_tx, out_rx) = tokio::sync::mpsc::unbounded_channel::<Request>();
let (return_tx, return_rx) = tokio::sync::mpsc::unbounded_channel::<Request>();
use futures::future::{AbortHandle, Abortable};
let (handle, registration) = AbortHandle::new_pair();
let codec_clone = codec.clone();
tokio::spawn(
Abortable::new(
async move {
if let Err(e) = tx_process(write, out_rx, return_tx, codec_clone).await {
trace!("connection write-closed with error: {:?}", e);
} else {
trace!("connection write-closed gracefully");
}
},
registration
)
);
let codec_clone = codec.clone();
tokio::spawn(async move {
if let Err(e) = rx_process(read, return_rx, codec_clone).await {
trace!("connection read-closed with error: {:?}", e);
} else {
trace!("connection read-closed gracefully");
}
// Signal the writer to also exit, which then closes `out_tx` - what we consider 'the connection'.
handle.abort()
});
out_tx
}
However, I dont really see much value in using it, tokio::select!
is more flexible if our needs change.
Feel free to use it if you want, but I think we should just proceed as is, and maybe delete the TODO comment.
LGTM
When the remote peer closes a connection, this causes the rx task to stop, but the tx side keeps running until it is used to send a message, which is guaranteed to fail. The fix is to add a 'closed' signal to tell the tx task to stop after the rx task finishes.
32870e7
to
6821cd4
Compare
So to confirm we split into two different tasks for perf reasons. |
I also spent some time looking through this and figured I needed a test / some code to better understand behaviour and I don't know if it's quite doing what we think it will. With the test below I got the following trace:
Which shows that on graceful shutdown the write side of the socket gets a logged error as the tx_process returns an error (from the foward), rather than getting the shutdown from the oneshot (this may be a timing thing and on higher latency connections behave as expected). I also would have expected the other end of the channel returned from @XA21X could you investigate a little further and maybe include a test to show that shutdown is more or less doing what we expect and that we can't send on the channel if the socket backing it has been dropped / shutdown. mod test {
use crate::message::Message;
use crate::protocols::redis_codec::RedisCodec;
use crate::protocols::RawFrame;
use crate::transforms::util::cluster_connection_pool::{spawn_from_stream, ConnectionPool};
use crate::transforms::util::Request;
use anyhow::anyhow;
use std::net::SocketAddr;
use tokio::io::AsyncReadExt;
use tokio::net::TcpListener;
use tokio::net::TcpStream;
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
async fn test_shutdown() {
let (non_blocking, guard) = tracing_appender::non_blocking(std::io::stdout());
let builder = tracing_subscriber::fmt()
.with_writer(non_blocking)
.with_env_filter("TRACE")
.with_filter_reloading();
let handle = builder.reload_handle();
// To avoid unit tests that run in the same excutable from blowing up when they try to reinitialize tracing we ignore the result returned by try_init.
// Currently the implementation of try_init will only fail when it is called multiple times.
builder.try_init().ok();
let listener = TcpListener::bind("127.0.0.1:2222".parse::<SocketAddr>().unwrap())
.await
.unwrap();
let (closed_tx, closed_rx) = tokio::sync::oneshot::channel();
let remote = tokio::spawn(async move {
let (mut socket, _) = listener.accept().await.unwrap();
//read 1 byte, then read another and drop
tokio::select! {
result = socket.read_u8() => if let Err(_) = result {
return Err(anyhow!("uh oh"))
} else {
return Err(anyhow!("uh oh"))
},
_ = closed_rx => {
return Ok(())
},
}
});
let mut stream = TcpStream::connect("127.0.0.1:2222").await.unwrap();
let codec = RedisCodec::new(true, 3);
let sender = spawn_from_stream(&codec, stream);
closed_tx.send(1).unwrap();
assert!(remote.await.is_ok());
assert!(
sender
.send(Request {
messages: Message::new_bypass(RawFrame::None),
return_chan: None,
message_id: None
})
.is_err(),
"channel still up"
);
}
} |
I was able to reproduce the error with @benbromhead's test, so I've added a test for signalling in each direction. When the remote Adding a 10 millisecond delay before the send was sufficient to allow the read process to detect the closed connection, and tell the write process to shutdown, closing the other side of the sender, before the test attempts a send. It turns out we can block on that condition ( Test Output - Remote shutdown closes local sender:
Test Output - Closing local sender causes remote shutdown:
|
Changed logging level in tests to INFO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I did a quick ninja fix to change the log level in the tests to INFO. I'll merge once the test rerun is green
Add a shutdown signal so the Rx process can tell the Tx process to stop, when the remote side has closed the connection.