Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p: peer state init too late and pex message too soon #3361

Closed
wants to merge 7 commits into from

Conversation

unclezoro
Copy link
Contributor

@unclezoro unclezoro commented Feb 28, 2019

fix two issue #3346 and
#3338
1、fix reconnection pex send too fast,
error is caused lastReceivedRequests is still
not deleted when a peer reconnected

2、Peer does not have a state yet. We set it in AddPeer.
We need an new interface before mconnection is started

  • Updated all relevant documentation in docs
  • Updated all code comments where relevant
  • Wrote tests
  • Updated CHANGELOG_PENDING.md

@unclezoro
Copy link
Contributor Author

unclezoro commented Feb 28, 2019

@ebuchman @melekes
#3338 and #3346
fix here

@codecov-io
Copy link

codecov-io commented Feb 28, 2019

Codecov Report

Merging #3361 into develop will increase coverage by 0.19%.
The diff coverage is 63.15%.

@@             Coverage Diff             @@
##           develop    #3361      +/-   ##
===========================================
+ Coverage    64.29%   64.48%   +0.19%     
===========================================
  Files          213      213              
  Lines        17447    17462      +15     
===========================================
+ Hits         11217    11260      +43     
+ Misses        5308     5290      -18     
+ Partials       922      912      -10
Impacted Files Coverage Δ
p2p/test_util.go 62.14% <0%> (-1.83%) ⬇️
p2p/base_reactor.go 63.63% <100%> (+3.63%) ⬆️
p2p/pex/pex_reactor.go 83.09% <100%> (+0.68%) ⬆️
p2p/switch.go 64.28% <100%> (+0.22%) ⬆️
consensus/reactor.go 72.3% <71.42%> (+1.46%) ⬆️
libs/db/remotedb/grpcdb/server.go 0% <0%> (ø) ⬆️
p2p/pex/addrbook.go 68% <0%> (+0.5%) ⬆️
blockchain/reactor.go 72.42% <0%> (+0.93%) ⬆️
blockchain/pool.go 81.25% <0%> (+0.98%) ⬆️
... and 3 more

Copy link
Contributor

@ebuchman ebuchman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this have to happen before peer is started?

p2p/pex/pex_reactor.go Show resolved Hide resolved
@ebuchman
Copy link
Contributor

ebuchman commented Mar 2, 2019

I see. The peer is started and thus might receive a message before AddPeer is called

@ebuchman
Copy link
Contributor

ebuchman commented Mar 2, 2019

What if we let Receive call AddPeer, and add some atomic integer so AddPeer only gets called once?

@unclezoro
Copy link
Contributor Author

What if we let Receive call AddPeer, and add some atomic integer so AddPeer only gets called once?

If we do like this, when the reactor want to send message to the peer first, but it have to wait until it receive some message?

@ebuchman
Copy link
Contributor

ebuchman commented Mar 4, 2019

If we do like this, when the reactor want to send message to the peer first, but it have to wait until it receive some message?

No we'd still have AddPeer get called by the switch, but if it turns out a message is received before the AddPeer, Receive could make sure AddPeer gets called first. Then the other AddPeer called by the Switch should have no effect

@unclezoro
Copy link
Contributor Author

If we do like this, when the reactor want to send message to the peer first, but it have to wait until it receive some message?

No we'd still have AddPeer get called by the switch, but if it turns out a message is received before the AddPeer, Receive could make sure AddPeer gets called first. Then the other AddPeer called by the Switch should have no effect

Maybe , I will submit another commit , and let's see if it make sense

@melekes
Copy link
Contributor

melekes commented Mar 11, 2019

I can try and rewrite the code to use sync.Once so we can compare two versions.

@unclezoro
Copy link
Contributor Author

@melekes @ebuchman any news for this pr?
You may complain I add new interface InitAddPeer to Reactor. in my opinion, this implement is simple, easy to understand, and it works.

consensus/reactor.go Outdated Show resolved Hide resolved
p2p/base_reactor.go Outdated Show resolved Hide resolved
@melekes
Copy link
Contributor

melekes commented Mar 23, 2019

I agree with @guagualvcha

@melekes
Copy link
Contributor

melekes commented Mar 24, 2019

Can you

  1. add a changelog entry in CHANGELOG_PENDING.md?
BUGFIXES:
- [p2p] \#3346 Init data structures for peer before starting it (previously was done in AddPeer) (fixes #3346 and #3338; @guagualvcha)
  1. reformat code with gofmt -s -w .?

@unclezoro
Copy link
Contributor Author

Can you

  1. add a changelog entry in CHANGELOG_PENDING.md?
BUGFIXES:
- [p2p] \#3346 Init data structures for peer before starting it (previously was done in AddPeer) (fixes #3346 and #3338; @guagualvcha)
  1. reformat code with gofmt -s -w .?

updated

@unclezoro
Copy link
Contributor Author

@ebuchman @xla could you like to review this?

Copy link
Contributor

@xla xla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, sorry for the late reply. Recently catching up with the backlog. Thanks preparing this change and addressing #3338. It would be helpful if you could provide a flow chart or some event ordering which highlights what sequence triggers the issue.

For the change itself we introduce a new method on the already large reactor which adds more peer lifecycle management responsibiliies outside of the p2p package which only seem to affect consensus and pex. It is cocneivable to not have this a major blocker for this change, but something I find worth pointing out. Maybe there is a more isolated fix.

consensus/reactor.go Show resolved Hide resolved
@@ -681,6 +681,9 @@ func (sw *Switch) addPeer(p Peer) error {
}
sw.metrics.Peers.Add(float64(1))

for _, reactor := range sw.reactors {
p = reactor.InitPeer(p)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call this directly before the AddPeer loop, what is the upside of this? Why can't it happen when AddPeer is performed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor Author

@unclezoro unclezoro Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keyPoint is : there is not handshake that tell other peer I am ready to receive message. So do some init work before the mconnection start. @xla

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly is that the peer's mconnection is started too early, but the introduction of InitPeer won't help with that as p.Start() is called befpre that.

The only sensible way forward is to find a way to write a failing test for this scenario and then verify that any proposed solution solves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @xla here, not clear the fix will work. Curious, have you tried to actually delay the peer reconnecting? (e.g. in switch.go:reconnectToPeer() bring the sw.randomSleep() just at the beginning of the first for loop.)
My understanding is that the root cause of the issue is the peer reconnecting too fast, before RemovePeer() has a chance to cleanup. I think it would be ok to delay the reconnection of a peer that was stopped for error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the root cause of the issue is the peer reconnecting too fast, before RemovePeer() has a chance to cleanup.

the core issue here is that reactor.Receive might be called before reactor.AddPeer, which will lead to reactor panicking if it depends on data fromreactor.AddPeer. The proposed solution is to move data initialization to a new reactor.InitPeer function, which is guaranteed to be executed before reactor.Receive is called.

@ebuchman
Copy link
Contributor

ebuchman commented Apr 8, 2019

For the change itself we introduce a new method on the already large reactor which adds more peer lifecycle management responsibiliies outside of the p2p package which only seem to affect consensus and pex. It is cocneivable to not have this a major blocker for this change, but something I find worth pointing out. Maybe there is a more isolated fix.

Mostly agree with this. Shouldn't necessarily block making this fix, but we should at least open a new issue to track this problem and see if there's a better solution that doesn't require the new method.

The only sensible way forward is to find a way to write a failing test for this scenario and then verify that any proposed solution solves.

@guagualvcha is it conceivable to write a test for this ?

@unclezoro
Copy link
Contributor Author

For the change itself we introduce a new method on the already large reactor which adds more peer lifecycle management responsibiliies outside of the p2p package which only seem to affect consensus and pex. It is cocneivable to not have this a major blocker for this change, but something I find worth pointing out. Maybe there is a more isolated fix.

Mostly agree with this. Shouldn't necessarily block making this fix, but we should at least open a new issue to track this problem and see if there's a better solution that doesn't require the new method.

The only sensible way forward is to find a way to write a failing test for this scenario and then verify that any proposed solution solves.

@guagualvcha is it conceivable to write a test for this ?

sure, I will keep working on thins.

@xla xla changed the title peer state init too late and pex message too soon p2p: peer state init too late and pex message too soon Apr 10, 2019
consensus/reactor_test.go Outdated Show resolved Hide resolved
consensus/reactor_test.go Outdated Show resolved Hide resolved
p2p/pex/pex_reactor_test.go Outdated Show resolved Hide resolved
@unclezoro
Copy link
Contributor Author

@xla @ebuchman I have finished the test case work. Please have a review then.

@xla
Copy link
Contributor

xla commented Apr 19, 2019

@guagualvcha Thanks for taking the time and following up on this. Currently the test-cases don't fail when the InitPeer changes are removed, which leads me to believe that they are not proving the change is required or the tests don't fail in the right way. So we still need to find a way to show that without this change messages can be send too soon.

@unclezoro
Copy link
Contributor Author

unclezoro commented Apr 19, 2019

@guagualvcha Thanks for taking the time and following up on this. Currently the test-cases don't fail when the InitPeer changes are removed, which leads me to believe that they are not proving the change is required or the tests don't fail in the right way. So we still need to find a way to show that without this change messages can be send too soon.

definitely right. Added in latest commit @xla

@unclezoro unclezoro force-pushed the p2perror branch 2 times, most recently from b4043fc to 410aec5 Compare April 19, 2019 10:46
@xla
Copy link
Contributor

xla commented Apr 20, 2019

@guagualvcha Please don't force push as it makes it hard to review latest changes. We have a poliyc of squash merging, so long commit histories are not a problem.

Copy link
Contributor

@xla xla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the test-cases only fail sometimes without InitPeer and sometimes fail even with the new IniPeer makes me believe that it's the issue is not addressed properly.

We already have quite some cases with non-determinism in our tests and I like to avoid adding more flakiness.

@@ -35,6 +37,77 @@ func TestPEXReactorBasic(t *testing.T) {
assert.NotEmpty(t, r.GetChannels())
}

func TestPEXReactorStopPeerWithOutInitPeer(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't always fail when InitiPeer is removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sometimes failed with InitPeer.

Copy link
Contributor Author

@unclezoro unclezoro Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I fixed is also not always happen. It depends on the schedule of goroutine, as I describe:
image
this is happen sometimes. That is why the case run 10000 times to make it happen. I could delete this case later, but it can prove without InitPeer, things goes wrong.
Another reason is that not only I add initPeer but also add such code , delete from lastReceivedRequests before stop peer :

if now.Sub(lastReceived) < minInterval {
		r.lastReceivedRequests.Delete(id)
		return fmt.Errorf("Peer (%v) sent next PEX request too soon. lastReceived: %v, now: %v, minInterval: %v. Disconnecting",			return fmt.Errorf("Peer (%v) sent next PEX request too soon. lastReceived: %v, now: %v, minInterval: %v. Disconnecting",
			src.ID(),				src.ID(),
			lastReceived,				lastReceived,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the success of the fix is sensistive to the inner workings of the runtme we need to find a way to never fail, no matter how the routines are scheduled.

assert.NotEqual(t, stopForError, 0)
}

func TestPEXReactorDoNotStopReconnectionPeer(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this test, it only fails sometimes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do this case

consensus/reactor.go Show resolved Hide resolved
consensus/reactor_test.go Outdated Show resolved Hide resolved
@@ -40,6 +40,24 @@ func NewPeer(ip net.IP) *Peer {
return mp
}

func NewFixIdPeer(ip net.IP, id p2p.ID) *Peer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use NewPeer? Why need to provide an external ID?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we overwrite ID in any case?

peer = mock.NewPeer(nil)
peer.ID = nodeID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an option, but we should expose ID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets do that. it's a mock peer anyway

p2p/pex/pex_reactor.go Outdated Show resolved Hide resolved
p2p/pex/pex_reactor_test.go Show resolved Hide resolved
p2p/pex/pex_reactor_test.go Outdated Show resolved Hide resolved
p2p/pex/pex_reactor_test.go Outdated Show resolved Hide resolved
Peer does not have a state yet. We set it in AddPeer.
We need an new interface before mconnection is started
fix reconnection pex send too fast,
error is caused lastReceivedRequests is still
not deleted when a peer reconnected
@unclezoro
Copy link
Contributor Author

unclezoro commented Apr 26, 2019

@melekes @xla Sorry for later reply, busy with other work. I have 1. delete potentially infinite loop 2. make test case stable. Please review my latest commit.

Copy link
Contributor

@liamsi liamsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I agree with the concerns raised by previous reviewers, this PR seems to at least improve the situation! My suggestion is to merge it as is and capture the remaining problems in a follow-up issue (links to the discussions here).

consensus/reactor.go Outdated Show resolved Hide resolved
consensus/reactor_test.go Outdated Show resolved Hide resolved
@@ -40,6 +40,24 @@ func NewPeer(ip net.IP) *Peer {
return mp
}

func NewFixIdPeer(ip net.IP, id p2p.ID) *Peer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets do that. it's a mock peer anyway


for i := 0; i < 100; i++ {
peer := mock.NewFixIdPeer(nil, nodeId)
err := p2p.AddPeerToSwitch(sw, peer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to initialize switch & add peer to it in order to test that reactor does not panic anymore. we can just call reactor.InitPeer() here and write a comment (CONTRACT) saying that if InitPeer is not called, Reactor will panic!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that

p2p/pex/pex_reactor.go Show resolved Hide resolved
@@ -35,6 +37,73 @@ func TestPEXReactorBasic(t *testing.T) {
assert.NotEmpty(t, r.GetChannels())
}

func TestPEXReactorStopPeerWithOutInitPeer(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename this to something which reflects the test's purpose?

I don't understand what the test is testing right now.

assert.Equal(t, stopForError, 99)
}

func TestPEXReactorDoNotStopReconnectionPeer(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename this to something which reflects the test's purpose?

I don't understand what the test is testing right now.

melekes and others added 2 commits May 6, 2019 14:25
Co-Authored-By: guagualvcha <baifudong@lancai.cn>
Co-Authored-By: guagualvcha <baifudong@lancai.cn>
@melekes
Copy link
Contributor

melekes commented May 7, 2019

I am going to take over this.

@melekes
Copy link
Contributor

melekes commented May 7, 2019

Closing in favor of #3634.

@melekes melekes closed this May 7, 2019
iKapitonau pushed a commit to scrtlabs/tendermint that referenced this pull request Jul 16, 2024
…its logic to Lock (tendermint#3361)

This PR partially reverts the backport of tendermint#3314 into the recently
released `v0.38.8` (and `v0.37.7`). With this change the `Mempool`
interface is the same as in previous versions.

The reason is that we do not want to break the public API. We still keep
in the code the feature that tendermint#3314 introduced by moving it inside the
existing `Lock` method. We also keep the `RecheckFull bool` field that
we added to `ErrMempoolIsFull`.

<!--

Please add a reference to the issue that this PR addresses and indicate
which
files are most critical to review. If it fully addresses a particular
issue,
please include "Closes #XXX" (where "XXX" is the issue number).

If this PR is non-trivial/large/complex, please ensure that you have
either
created an issue that the team's had a chance to respond to, or had some
discussion with the team prior to submitting substantial pull requests.
The team
can be reached via GitHub Discussions or the Cosmos Network Discord
server in
the #cometbft channel. GitHub Discussions is preferred over Discord as
it
allows us to keep track of conversations topically.
https://github.com/cometbft/cometbft/discussions

If the work in this PR is not aligned with the team's current
priorities, please
be advised that it may take some time before it is merged - especially
if it has
not yet been discussed with the team.

See the project board for the team's current priorities:
https://github.com/orgs/cometbft/projects/1

-->

---

#### PR checklist

- [ ] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [ ] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants