-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
p2p: peer state init too late and pex message too soon #3361
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #3361 +/- ##
===========================================
+ Coverage 64.29% 64.48% +0.19%
===========================================
Files 213 213
Lines 17447 17462 +15
===========================================
+ Hits 11217 11260 +43
+ Misses 5308 5290 -18
+ Partials 922 912 -10
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this have to happen before peer is started?
I see. The peer is started and thus might receive a message before AddPeer is called |
What if we let Receive call AddPeer, and add some atomic integer so AddPeer only gets called once? |
If we do like this, when the reactor want to send message to the peer first, but it have to wait until it receive some message? |
No we'd still have AddPeer get called by the switch, but if it turns out a message is received before the AddPeer, Receive could make sure AddPeer gets called first. Then the other AddPeer called by the Switch should have no effect |
Maybe , I will submit another commit , and let's see if it make sense |
I can try and rewrite the code to use |
I agree with @guagualvcha |
Can you
|
updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, sorry for the late reply. Recently catching up with the backlog. Thanks preparing this change and addressing #3338. It would be helpful if you could provide a flow chart or some event ordering which highlights what sequence triggers the issue.
For the change itself we introduce a new method on the already large reactor which adds more peer lifecycle management responsibiliies outside of the p2p package which only seem to affect consensus and pex. It is cocneivable to not have this a major blocker for this change, but something I find worth pointing out. Maybe there is a more isolated fix.
@@ -681,6 +681,9 @@ func (sw *Switch) addPeer(p Peer) error { | |||
} | |||
sw.metrics.Peers.Add(float64(1)) | |||
|
|||
for _, reactor := range sw.reactors { | |||
p = reactor.InitPeer(p) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call this directly before the AddPeer
loop, what is the upside of this? Why can't it happen when AddPeer
is performed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keyPoint is : there is not handshake that tell other peer I am ready to receive message. So do some init work before the mconnection start. @xla
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly is that the peer's mconnection is started too early, but the introduction of InitPeer
won't help with that as p.Start()
is called befpre that.
The only sensible way forward is to find a way to write a failing test for this scenario and then verify that any proposed solution solves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with @xla here, not clear the fix will work. Curious, have you tried to actually delay the peer reconnecting? (e.g. in switch.go:reconnectToPeer() bring the sw.randomSleep() just at the beginning of the first for loop.)
My understanding is that the root cause of the issue is the peer reconnecting too fast, before RemovePeer() has a chance to cleanup. I think it would be ok to delay the reconnection of a peer that was stopped for error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the root cause of the issue is the peer reconnecting too fast, before RemovePeer() has a chance to cleanup.
the core issue here is that reactor.Receive
might be called before reactor.AddPeer
, which will lead to reactor panicking if it depends on data fromreactor.AddPeer
. The proposed solution is to move data initialization to a new reactor.InitPeer
function, which is guaranteed to be executed before reactor.Receive
is called.
Mostly agree with this. Shouldn't necessarily block making this fix, but we should at least open a new issue to track this problem and see if there's a better solution that doesn't require the new method.
@guagualvcha is it conceivable to write a test for this ? |
sure, I will keep working on thins. |
@guagualvcha Thanks for taking the time and following up on this. Currently the test-cases don't fail when the |
definitely right. Added in latest commit @xla |
b4043fc
to
410aec5
Compare
@guagualvcha Please don't force push as it makes it hard to review latest changes. We have a poliyc of squash merging, so long commit histories are not a problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the test-cases only fail sometimes without InitPeer
and sometimes fail even with the new IniPeer
makes me believe that it's the issue is not addressed properly.
We already have quite some cases with non-determinism in our tests and I like to avoid adding more flakiness.
@@ -35,6 +37,77 @@ func TestPEXReactorBasic(t *testing.T) { | |||
assert.NotEmpty(t, r.GetChannels()) | |||
} | |||
|
|||
func TestPEXReactorStopPeerWithOutInitPeer(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test doesn't always fail when InitiPeer
is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sometimes failed with InitPeer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue I fixed is also not always happen. It depends on the schedule of goroutine, as I describe:
this is happen sometimes. That is why the case run 10000 times to make it happen. I could delete this case later, but it can prove without InitPeer, things goes wrong.
Another reason is that not only I add initPeer but also add such code , delete from lastReceivedRequests before stop peer :
if now.Sub(lastReceived) < minInterval {
r.lastReceivedRequests.Delete(id)
return fmt.Errorf("Peer (%v) sent next PEX request too soon. lastReceived: %v, now: %v, minInterval: %v. Disconnecting", return fmt.Errorf("Peer (%v) sent next PEX request too soon. lastReceived: %v, now: %v, minInterval: %v. Disconnecting",
src.ID(), src.ID(),
lastReceived, lastReceived,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the success of the fix is sensistive to the inner workings of the runtme we need to find a way to never fail, no matter how the routines are scheduled.
assert.NotEqual(t, stopForError, 0) | ||
} | ||
|
||
func TestPEXReactorDoNotStopReconnectionPeer(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this test, it only fails sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do this case
@@ -40,6 +40,24 @@ func NewPeer(ip net.IP) *Peer { | |||
return mp | |||
} | |||
|
|||
func NewFixIdPeer(ip net.IP, id p2p.ID) *Peer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use NewPeer
? Why need to provide an external ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we overwrite ID in any case?
peer = mock.NewPeer(nil)
peer.ID = nodeID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an option, but we should expose ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets do that. it's a mock peer anyway
Peer does not have a state yet. We set it in AddPeer. We need an new interface before mconnection is started
fix reconnection pex send too fast, error is caused lastReceivedRequests is still not deleted when a peer reconnected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I agree with the concerns raised by previous reviewers, this PR seems to at least improve the situation! My suggestion is to merge it as is and capture the remaining problems in a follow-up issue (links to the discussions here).
@@ -40,6 +40,24 @@ func NewPeer(ip net.IP) *Peer { | |||
return mp | |||
} | |||
|
|||
func NewFixIdPeer(ip net.IP, id p2p.ID) *Peer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets do that. it's a mock peer anyway
|
||
for i := 0; i < 100; i++ { | ||
peer := mock.NewFixIdPeer(nil, nodeId) | ||
err := p2p.AddPeerToSwitch(sw, peer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to initialize switch & add peer to it in order to test that reactor does not panic anymore. we can just call reactor.InitPeer() here and write a comment (CONTRACT) saying that if InitPeer is not called, Reactor will panic!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that
@@ -35,6 +37,73 @@ func TestPEXReactorBasic(t *testing.T) { | |||
assert.NotEmpty(t, r.GetChannels()) | |||
} | |||
|
|||
func TestPEXReactorStopPeerWithOutInitPeer(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rename this to something which reflects the test's purpose?
I don't understand what the test is testing right now.
assert.Equal(t, stopForError, 99) | ||
} | ||
|
||
func TestPEXReactorDoNotStopReconnectionPeer(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rename this to something which reflects the test's purpose?
I don't understand what the test is testing right now.
Co-Authored-By: guagualvcha <baifudong@lancai.cn>
Co-Authored-By: guagualvcha <baifudong@lancai.cn>
I am going to take over this. |
Closing in favor of #3634. |
…its logic to Lock (tendermint#3361) This PR partially reverts the backport of tendermint#3314 into the recently released `v0.38.8` (and `v0.37.7`). With this change the `Mempool` interface is the same as in previous versions. The reason is that we do not want to break the public API. We still keep in the code the feature that tendermint#3314 introduced by moving it inside the existing `Lock` method. We also keep the `RecheckFull bool` field that we added to `ErrMempoolIsFull`. <!-- Please add a reference to the issue that this PR addresses and indicate which files are most critical to review. If it fully addresses a particular issue, please include "Closes #XXX" (where "XXX" is the issue number). If this PR is non-trivial/large/complex, please ensure that you have either created an issue that the team's had a chance to respond to, or had some discussion with the team prior to submitting substantial pull requests. The team can be reached via GitHub Discussions or the Cosmos Network Discord server in the #cometbft channel. GitHub Discussions is preferred over Discord as it allows us to keep track of conversations topically. https://github.com/cometbft/cometbft/discussions If the work in this PR is not aligned with the team's current priorities, please be advised that it may take some time before it is merged - especially if it has not yet been discussed with the team. See the project board for the team's current priorities: https://github.com/orgs/cometbft/projects/1 --> --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
fix two issue #3346 and
#3338
1、fix reconnection pex send too fast,
error is caused lastReceivedRequests is still
not deleted when a peer reconnected
2、Peer does not have a state yet. We set it in AddPeer.
We need an new interface before mconnection is started