Fix gossipsub race condition for heartbeat #188
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In using gossipsub on a system with a number of ephemeral peers, we noticed messages would occasionally fail to route to their intended targets even though a subscribe had been received.
In a basic example, given there are 2 peers connected, PeerA & PeerB. For TopicA, PeerA is not in the mesh, but has seen and gossiped it before to PeerC, which is now disconnected and no longer subscribed.
When a Subscribe request for TopicA is issued from PeerB, then PeerA adds that peer onto the PubSub.topics map:
go-libp2p-pubsub/pubsub.go
Lines 566 to 573 in 49274b0
However, its not that map that is used for publishing messages back out from PeerA when its not in the mesh:
go-libp2p-pubsub/gossipsub.go
Lines 233 to 242 in 49274b0
Line
#233
there usesGossipSubRouter.fanout
as its map, which is updated during the heartbeat process:go-libp2p-pubsub/gossipsub.go
Lines 441 to 448 in 49274b0
In between the 1 second of a heartbeat, these two maps can be out of date with each other, which is to be expected. However, on
go-libp2p-pubsub/gossipsub.go
Line 234 in 49274b0
!ok
only works as a fallback for the initial iteration, if the map is empty because all other Peers have been unsubscribed (PeerC) then it doesn't fallback togetPeers
.This PR is to change that conditional to check for empty map cases so that the fallback still happens. The other option is to make the key for the topic be set back to nil instead of an empty map inside the heartbeat process.