-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for sending error codes on session close #121
base: master
Are you sure you want to change the base?
Conversation
9d54c73
to
f480b10
Compare
f480b10
to
efa52bc
Compare
af1ac71
to
18a75f1
Compare
Is there an issue describing how this will be used? I usually want to send errors when closing a stream, less when closing a connection. Is the plan to use this in the connection manager? |
Apologies! The specs are here: libp2p/specs#623 In go-libp2p we will mostly use Connection Close error codes from the connection manager. Applications can define their error codes. Stream error codes will be introduced in a separate PR. |
18a75f1
to
4b262c0
Compare
4aed194
to
8adb9a8
Compare
Oh, I see. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it probably works, but blocking on connection close is "new" (as far as I know) so we need to make sure it's not going to cause issues with other parts of the code.
// The GoAway may not actually be sent depending on the semantics of the underlying net.Conn. | ||
// For TCP connections, it may be dropped depending on LINGER value or if there's unread data in the kernel | ||
// receive buffer. | ||
func (s *Session) CloseWithError(errCode uint32) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we updated the connection manager to be able to deal with potential blocking here? Also, we should probably document it.
Blocking on a connection is unfortunate. The "correct" way here would be to send the error code with the RST packet. The latest TCP RFC also recommends this: https://www.rfc-editor.org/rfc/rfc9293.html#name-reset-processing
But no implementation provides this API at the moment. |
We can consider running this Async in a different goroutine. That'll use more memory and mess up the resource manager accounting for a short duration, but it'll cause less issues with existing code. The current implementation wouldn't block in most of the cases. If there's receive window available at the remote end, it wont block. |
So, we usually only close connections from the connection manager, right? IMO, we should consider:
On the other hand.... spawning a goroutine isn't terrible. The resource consumption is fairly minimal in modern go and blocking will tie up resources just the same. |
Actually... we already have a goroutine. Can we reuse it? Are we not accounting for that one in the resource manager? |
A common case will also be the resource manager closing connections after seeing some limit has been reached. |
We can but that will not account for the resources held up in the background goroutine. The way it works currently from go-libp2p is:
The alternative, clean up the resources and mark the scope done in a background goroutine, feels worse to me. If you're closing a connection because you're short of resources, this doesn't help at all. If you're not closing the connection because you're short of resources, you can afford to wait, or do the whole thing in a separate goroutine without blocking the routine calling |
181bc90
to
ede18a5
Compare
It does help because it prevents you from using resources you don't have. |
My understanding is that this has the same behavior as before where
Right. We should not be telling the resource manager we are done until after we are done. Otherwise we are open ourselves to attacks the resource manager is meant to prevent. Maybe when we use this in go-libp2p we should set some threshold of conns we are willing to play nice for and send errors, but above that threshold we just hard reset with no error in order as to not stall at reclaiming resources. This should solve the common case of informing well-behaved peers about your error state, as well as avoiding the scenario where a malicious peer hogs up resources by being slow to receive the error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good. We should test these on a libp2p node before cutting a release here. But we can do that after we merge if you like.
session.go
Outdated
@@ -46,9 +46,9 @@ var nullMemoryManager = &nullMemoryManagerImpl{} | |||
type Session struct { | |||
rtt int64 // to be accessed atomically, in nanoseconds | |||
|
|||
// remoteGoAway indicates the remote side does | |||
// remoteGoAwayNormal indicates the remote side does | |||
// not want futher connections. Must be first for alignment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What depends on this alignment? Apparently nothing as this has not been first for years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing. I don't need the field either. I have removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the present semantics are, if you call session.GoAway()
, existing streams keep working but it prevents you from creating new streams. These two flags remoteGoAway
and localGoAway
implement this.
So far as I know, we don't use it anywhere in go-libp2p. Since this will be a major version upgrade, we can change the semantics to make GoAway reset all the streams as a session.Close
does.
const.go
Outdated
// ErrRemoteGoAway is used when we get a go away from the other side | ||
ErrRemoteGoAway = &Error{msg: "remote end is not accepting connections"} | ||
// ErrRemoteGoAwayNormal is used when we get a go away from the other side | ||
ErrRemoteGoAwayNormal = &GoAwayError{Remote: true, ErrorCode: goAwayNormal} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some text to explain what "normal" is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means a goAway frame was received with error code goAwayNormal
. I changed the name back to the original to avoid changing a Public name.
s.exitErr(err) | ||
// Prefer the recvLoop error over the sendLoop error. The receive loop might have the error code | ||
// received in a GoAway frame received just before the TCP RST that closed the sendLoop | ||
s.shutdownLock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment as to why you are holding the shutdownLock around this section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comment, can you review once more?
af8e895
to
b523bdd
Compare
Sure, but that's not what I'm arguing. We have two design choices. From a users perspective, who cares about libp2p connections:
In terms of accounting both the approaches are correct they only mark resources as freed when they're actually freed. In the Non Blocking implementation, how will a user who closes the connection because it wants to use the resource somewhere else be notified that the resources have been freed? This will require extending more API surface to signal this cleanup completion. The inverse situation for the blocking approach is how will a user not block when closing a connection with error codes? This is much simpler. You simply close in a goroutine, and if you need to, signal on completion with a channel. One way we can use this is, as @MarcoPolo suggests:
To be clear, the slowness is for our pending writes and peer has no control over it. We are only blocked by our in progress writes. The blocking for just writing the GoAwayFrame is much lower(100 ms). |
Let's merge this along with libp2p/go-libp2p#2927. This will not be used without libp2p/go-libp2p#2927, and this is not an actively worked on repo so there's not much rebasing work to do. |
Correct. |
b523bdd
to
3eaea39
Compare
The peer controls what has been ack'd. Or am I missing something? |
Why would they want this information? Users generally aren't managing connections that way, that's the job of the connection manager. I.e.:
Ideally with some wiggle room to prevent blocking. Note: "waiting until the connection closes" isn't actually useful from a resource management perspective anyways because other connections could have been created in the meantime. |
I completely agree, we should just close in a goroutine. It's informational anyways. My point is just that:
Your point about setting some form of limit seems reasonable, although I figured we could just use the resource limiter for that. Really, an alternative solution is to just be best-effort: instead of trying to re-use the goroutine, attempt to (non-blocking) allocate new resources with the resource manager to cover the new goroutine to send the error code. If that fails, just don't send the error code because we're clearly really low on resources and should just walk away ASAP. |
I've implemented something like this:
In go-libp2p here: libp2p/go-libp2p@3c53bf6#diff-99b3819033614bf0bec2ee2aa8f71c99a9ebc98cbfd887a6242aac406904fe7d See the changes to net/upgrader/conn.go When not resource constrained, we will send the error code. When we are resource constrained, we will terminate immediately. |
Gah. That works but it's kind of nasty. Taking a fresh look at this... blocking on I'm also going to bow out of this conversation because I feel that I might be causing more confusion than anything. |
Your comments have been really helpful Steven! Thanks for the help here. |
Thanks! I'm just not paying enough attention to this conversation so I don't want to keep coming back once every few months with conflicting opinions, adding confusion to the discussion. |
This adds support for sending error codes on closing a session.
The error isn't guaranteed to be sent to remote. It depends on the LINGER value for the TCP socket and also on whether there's any unread data in the receive buffer when close was called. In both these situations, the GoAway frame might be dropped.
To reliably send error codes, we'd have to send a TCP FIN packet and wait for remote its half of the connection. This would also require sending everything that's pending in the kernel write buffer. To not introduce this 1RTT delay of closing, I've opted to make this a best effort implementation.