-
Notifications
You must be signed in to change notification settings - Fork 2.6k
No longer actively open legacy substreams #7076
Conversation
Switching to non-draft, as the point is to run the CI. |
I've now realized that, since we no longer open legacy substreams, all connections between peers would always close after 60 seconds. At the moment, the objective of this PR is to prove that #7075 works well. I'll fix that after #7075 is approved or merged. |
@tomaka let me know once you would like another review on this pull request. |
Ready for review. I've removed the timeout system from the legacy substream entirely. With notification protocols, the listening side is pro-actively trying to open substreams, which, if they get refused, will result in the keep-alive system closing the connection. The reason for the existing timeout system for the legacy substream comes from the situation where the dialer doesn't support notification substreams, and we don't know whether it intends to open a legacy substream. |
Needs a burnin, but after the release of 0.8.24. |
ProtocolState::Disabled { .. } | ProtocolState::Poisoned | | ||
ProtocolState::KillAsap => KeepAlive::No, | ||
ProtocolState::Init { .. } | ProtocolState::Normal { .. } => KeepAlive::Yes, | ||
ProtocolState::Opening { .. } | ProtocolState::Disabled { .. } | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opening
is now No
because of the removal of the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a burnin, but after the release of 0.8.24.
👍
num: Option<usize>, | ||
err: ProtocolsHandlerUpgrErr<EitherError<NotificationsHandshakeError, io::Error>> | ||
num: usize, | ||
err: ProtocolsHandlerUpgrErr<NotificationsHandshakeError> | ||
) { | ||
match (err, num) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it still makes sense to match on num
, right?
Let's wait until Monday to start the burnin, so that more nodes have upgraded. |
Burnin' report:
Force-closing a connection in general is almost always caused by the dialing side not opening a legacy substream in time when the listening side has reserved a slot (see #7074). As expected, the node with this PR consequently almost doesn't force-close connections anymore. I don't really have a strong explanation for the dialing errors caused by an invalid PeerId, other that, since the outgoing slots aren't full, the local node tries to repeatedly try connect to older nodes, which it wouldn't normally have to do. In other words, so far the observation is consistent with what is expected. It's disappointing that only ~25% of the network (roughly guessing from looking at the telemetry) seems to be using 0.8.24, which is not enough to even fill all the slots of that single burnin node. |
Filled slots now at 85%, which probably follows the nodes upgrading to 0.8.24. |
The node looks like it is behaving normally. |
I've been informed that the increase in |
As far as I understand this pull request requires #7075 to be widely deployed. #7075 is only part of Polkadot v0.8.24. When we merge this pull-request now it will be part of Polkadot v0.8.25. Are we sure right now that v0.8.24 will be widely deployed once v0.8.25 is released? |
We had a couple of announcements asking validators to upgrade to at least 0.8.24. |
bot merge |
Trying merge. |
Tackles bullet number 2 in this comment.
Based on top of #7075
Shouldn't be merged now, as we need to publish a version between #7075 and this.
The intention in this PR is to check whether CI is green to make sure that #7075 is working properly. If we merge a broken version of #7075, then we will have to wait again for a bugfix release.
This PR changes
legacy.rs
to no longer pro-actively open a legacy substream.A consequence of this change is that we can no longer establish outgoing connections to nodes that don't have #7075 (hence the need for a release). However we can still receive incoming connections from nodes that don't have #7075.
The diff is quite large because of all the side clean-ups, but the core part of the changes is that
legacy.rs
no longer emitsOutboundSubstreamRequest
.I've opted to keep the timeout system on the listening side as long as #7074 isn't resolved. After 60 seconds of inactivity on the legacy substream, the connection is force-closed, thereby freeing the peerset slot.