p2p: peer state init too late and pex message too soon #3361

unclezoro · 2019-02-28T10:12:44Z

fix two issue #3346 and
#3338
1、fix reconnection pex send too fast,
error is caused lastReceivedRequests is still
not deleted when a peer reconnected

2、Peer does not have a state yet. We set it in AddPeer.
We need an new interface before mconnection is started

Updated all relevant documentation in docs
Updated all code comments where relevant
Wrote tests
Updated CHANGELOG_PENDING.md

unclezoro · 2019-02-28T10:14:13Z

@ebuchman @melekes
#3338 and #3346
fix here

codecov-io · 2019-02-28T10:22:09Z

Codecov Report

Merging #3361 into develop will increase coverage by 0.19%.
The diff coverage is 63.15%.

@@             Coverage Diff             @@
##           develop    #3361      +/-   ##
===========================================
+ Coverage    64.29%   64.48%   +0.19%     
===========================================
  Files          213      213              
  Lines        17447    17462      +15     
===========================================
+ Hits         11217    11260      +43     
+ Misses        5308     5290      -18     
+ Partials       922      912      -10

Impacted Files	Coverage Δ
p2p/test_util.go	`62.14% <0%> (-1.83%)`	⬇️
p2p/base_reactor.go	`63.63% <100%> (+3.63%)`	⬆️
p2p/pex/pex_reactor.go	`83.09% <100%> (+0.68%)`	⬆️
p2p/switch.go	`64.28% <100%> (+0.22%)`	⬆️
consensus/reactor.go	`72.3% <71.42%> (+1.46%)`	⬆️
libs/db/remotedb/grpcdb/server.go	`0% <0%> (ø)`	⬆️
p2p/pex/addrbook.go	`68% <0%> (+0.5%)`	⬆️
blockchain/reactor.go	`72.42% <0%> (+0.93%)`	⬆️
blockchain/pool.go	`81.25% <0%> (+0.98%)`	⬆️
... and 3 more

ebuchman

Why does this have to happen before peer is started?

p2p/pex/pex_reactor.go

ebuchman · 2019-03-02T20:27:44Z

I see. The peer is started and thus might receive a message before AddPeer is called

ebuchman · 2019-03-02T20:38:11Z

What if we let Receive call AddPeer, and add some atomic integer so AddPeer only gets called once?

unclezoro · 2019-03-04T02:47:59Z

What if we let Receive call AddPeer, and add some atomic integer so AddPeer only gets called once?

If we do like this, when the reactor want to send message to the peer first, but it have to wait until it receive some message?

ebuchman · 2019-03-04T12:06:44Z

If we do like this, when the reactor want to send message to the peer first, but it have to wait until it receive some message?

No we'd still have AddPeer get called by the switch, but if it turns out a message is received before the AddPeer, Receive could make sure AddPeer gets called first. Then the other AddPeer called by the Switch should have no effect

unclezoro · 2019-03-05T07:21:00Z

If we do like this, when the reactor want to send message to the peer first, but it have to wait until it receive some message?

No we'd still have AddPeer get called by the switch, but if it turns out a message is received before the AddPeer, Receive could make sure AddPeer gets called first. Then the other AddPeer called by the Switch should have no effect

Maybe , I will submit another commit , and let's see if it make sense

melekes · 2019-03-11T13:01:53Z

I can try and rewrite the code to use sync.Once so we can compare two versions.

unclezoro · 2019-03-22T16:47:54Z

@melekes @ebuchman any news for this pr?
You may complain I add new interface InitAddPeer to Reactor. in my opinion, this implement is simple, easy to understand, and it works.

consensus/reactor.go

p2p/base_reactor.go

melekes · 2019-03-23T12:16:11Z

I agree with @guagualvcha

melekes · 2019-03-24T17:10:03Z

Can you

add a changelog entry in CHANGELOG_PENDING.md?

BUGFIXES:
- [p2p] \#3346 Init data structures for peer before starting it (previously was done in AddPeer) (fixes #3346 and #3338; @guagualvcha)

reformat code with gofmt -s -w .?

unclezoro · 2019-03-25T01:59:46Z

Can you

add a changelog entry in CHANGELOG_PENDING.md?
BUGFIXES:
- [p2p] \#3346 Init data structures for peer before starting it (previously was done in AddPeer) (fixes #3346 and #3338; @guagualvcha)
reformat code with gofmt -s -w .?

updated

unclezoro · 2019-03-28T02:44:15Z

@ebuchman @xla could you like to review this?

xla

First of all, sorry for the late reply. Recently catching up with the backlog. Thanks preparing this change and addressing #3338. It would be helpful if you could provide a flow chart or some event ordering which highlights what sequence triggers the issue.

For the change itself we introduce a new method on the already large reactor which adds more peer lifecycle management responsibiliies outside of the p2p package which only seem to affect consensus and pex. It is cocneivable to not have this a major blocker for this change, but something I find worth pointing out. Maybe there is a more isolated fix.

consensus/reactor.go

xla · 2019-03-28T07:54:11Z

p2p/switch.go

@@ -681,6 +681,9 @@ func (sw *Switch) addPeer(p Peer) error {
 	}
 	sw.metrics.Peers.Add(float64(1))

+	for _, reactor := range sw.reactors {
+		p = reactor.InitPeer(p)
+	}


We call this directly before the AddPeer loop, what is the upside of this? Why can't it happen when AddPeer is performed?

keyPoint is : there is not handshake that tell other peer I am ready to receive message. So do some init work before the mconnection start. @xla

If I understand correctly is that the peer's mconnection is started too early, but the introduction of InitPeer won't help with that as p.Start() is called befpre that.

The only sensible way forward is to find a way to write a failing test for this scenario and then verify that any proposed solution solves.

I tend to agree with @xla here, not clear the fix will work. Curious, have you tried to actually delay the peer reconnecting? (e.g. in switch.go:reconnectToPeer() bring the sw.randomSleep() just at the beginning of the first for loop.)
My understanding is that the root cause of the issue is the peer reconnecting too fast, before RemovePeer() has a chance to cleanup. I think it would be ok to delay the reconnection of a peer that was stopped for error.

My understanding is that the root cause of the issue is the peer reconnecting too fast, before RemovePeer() has a chance to cleanup.

the core issue here is that reactor.Receive might be called before reactor.AddPeer, which will lead to reactor panicking if it depends on data fromreactor.AddPeer. The proposed solution is to move data initialization to a new reactor.InitPeer function, which is guaranteed to be executed before reactor.Receive is called.

ebuchman · 2019-04-08T15:19:11Z

For the change itself we introduce a new method on the already large reactor which adds more peer lifecycle management responsibiliies outside of the p2p package which only seem to affect consensus and pex. It is cocneivable to not have this a major blocker for this change, but something I find worth pointing out. Maybe there is a more isolated fix.

Mostly agree with this. Shouldn't necessarily block making this fix, but we should at least open a new issue to track this problem and see if there's a better solution that doesn't require the new method.

The only sensible way forward is to find a way to write a failing test for this scenario and then verify that any proposed solution solves.

@guagualvcha is it conceivable to write a test for this ?

unclezoro · 2019-04-10T03:52:55Z

For the change itself we introduce a new method on the already large reactor which adds more peer lifecycle management responsibiliies outside of the p2p package which only seem to affect consensus and pex. It is cocneivable to not have this a major blocker for this change, but something I find worth pointing out. Maybe there is a more isolated fix.

Mostly agree with this. Shouldn't necessarily block making this fix, but we should at least open a new issue to track this problem and see if there's a better solution that doesn't require the new method.

The only sensible way forward is to find a way to write a failing test for this scenario and then verify that any proposed solution solves.

@guagualvcha is it conceivable to write a test for this ?

sure, I will keep working on thins.

consensus/reactor_test.go

p2p/pex/pex_reactor_test.go

unclezoro · 2019-04-18T13:25:01Z

@xla @ebuchman I have finished the test case work. Please have a review then.

xla · 2019-04-19T05:45:10Z

@guagualvcha Thanks for taking the time and following up on this. Currently the test-cases don't fail when the InitPeer changes are removed, which leads me to believe that they are not proving the change is required or the tests don't fail in the right way. So we still need to find a way to show that without this change messages can be send too soon.

unclezoro · 2019-04-19T09:18:20Z

@guagualvcha Thanks for taking the time and following up on this. Currently the test-cases don't fail when the InitPeer changes are removed, which leads me to believe that they are not proving the change is required or the tests don't fail in the right way. So we still need to find a way to show that without this change messages can be send too soon.

definitely right. Added in latest commit @xla

p2p/pex/pex_reactor_test.go

xla · 2019-04-20T07:46:53Z

@guagualvcha Please don't force push as it makes it hard to review latest changes. We have a poliyc of squash merging, so long commit histories are not a problem.

xla

Given that the test-cases only fail sometimes without InitPeer and sometimes fail even with the new IniPeer makes me believe that it's the issue is not addressed properly.

We already have quite some cases with non-determinism in our tests and I like to avoid adding more flakiness.

xla · 2019-04-20T08:02:48Z

p2p/pex/pex_reactor_test.go

@@ -35,6 +37,77 @@ func TestPEXReactorBasic(t *testing.T) {
 	assert.NotEmpty(t, r.GetChannels())
 }

+func TestPEXReactorStopPeerWithOutInitPeer(t *testing.T) {


This test doesn't always fail when InitiPeer is removed.

It sometimes failed with InitPeer.

The issue I fixed is also not always happen. It depends on the schedule of goroutine, as I describe:

this is happen sometimes. That is why the case run 10000 times to make it happen. I could delete this case later, but it can prove without InitPeer, things goes wrong.
Another reason is that not only I add initPeer but also add such code , delete from lastReceivedRequests before stop peer :

if now.Sub(lastReceived) < minInterval { r.lastReceivedRequests.Delete(id) return fmt.Errorf("Peer (%v) sent next PEX request too soon. lastReceived: %v, now: %v, minInterval: %v. Disconnecting", return fmt.Errorf("Peer (%v) sent next PEX request too soon. lastReceived: %v, now: %v, minInterval: %v. Disconnecting", src.ID(), src.ID(), lastReceived, lastReceived,

If the success of the fix is sensistive to the inner workings of the runtme we need to find a way to never fail, no matter how the routines are scheduled.

xla · 2019-04-20T08:03:07Z

p2p/pex/pex_reactor_test.go

+	assert.NotEqual(t, stopForError, 0)
+}
+
+func TestPEXReactorDoNotStopReconnectionPeer(t *testing.T) {


Same for this test, it only fails sometimes.

So do this case

consensus/reactor.go

consensus/reactor_test.go

melekes · 2019-04-22T07:12:22Z

p2p/mock/peer.go

@@ -40,6 +40,24 @@ func NewPeer(ip net.IP) *Peer {
 	return mp
 }

+func NewFixIdPeer(ip net.IP, id p2p.ID) *Peer {


Can't we use NewPeer? Why need to provide an external ID?

Can't we overwrite ID in any case?

peer = mock.NewPeer(nil) peer.ID = nodeID

It is an option, but we should expose ID.

lets do that. it's a mock peer anyway

p2p/pex/pex_reactor.go

p2p/pex/pex_reactor_test.go

Peer does not have a state yet. We set it in AddPeer. We need an new interface before mconnection is started

fix reconnection pex send too fast, error is caused lastReceivedRequests is still not deleted when a peer reconnected

unclezoro · 2019-04-26T10:59:34Z

@melekes @xla Sorry for later reply, busy with other work. I have 1. delete potentially infinite loop 2. make test case stable. Please review my latest commit.

liamsi

Although I agree with the concerns raised by previous reviewers, this PR seems to at least improve the situation! My suggestion is to merge it as is and capture the remaining problems in a follow-up issue (links to the discussions here).

consensus/reactor.go

consensus/reactor_test.go

melekes · 2019-05-01T08:25:04Z

p2p/mock/peer.go

@@ -40,6 +40,24 @@ func NewPeer(ip net.IP) *Peer {
 	return mp
 }

+func NewFixIdPeer(ip net.IP, id p2p.ID) *Peer {


lets do that. it's a mock peer anyway

melekes · 2019-05-01T08:26:56Z

consensus/reactor_test.go

+
+	for i := 0; i < 100; i++ {
+		peer := mock.NewFixIdPeer(nil, nodeId)
+		err := p2p.AddPeerToSwitch(sw, peer)


we don't need to initialize switch & add peer to it in order to test that reactor does not panic anymore. we can just call reactor.InitPeer() here and write a comment (CONTRACT) saying that if InitPeer is not called, Reactor will panic!

I can do that

p2p/pex/pex_reactor.go

melekes · 2019-05-01T08:32:50Z

p2p/pex/pex_reactor_test.go

@@ -35,6 +37,73 @@ func TestPEXReactorBasic(t *testing.T) {
 	assert.NotEmpty(t, r.GetChannels())
 }

+func TestPEXReactorStopPeerWithOutInitPeer(t *testing.T) {


can we rename this to something which reflects the test's purpose?

I don't understand what the test is testing right now.

melekes · 2019-05-01T08:32:55Z

p2p/pex/pex_reactor_test.go

+	assert.Equal(t, stopForError, 99)
+}
+
+func TestPEXReactorDoNotStopReconnectionPeer(t *testing.T) {


can we rename this to something which reflects the test's purpose?

I don't understand what the test is testing right now.

Co-Authored-By: guagualvcha <baifudong@lancai.cn>

melekes · 2019-05-07T09:09:59Z

I am going to take over this.

melekes · 2019-05-07T09:13:16Z

Closing in favor of #3634.

…its logic to Lock (tendermint#3361) This PR partially reverts the backport of tendermint#3314 into the recently released `v0.38.8` (and `v0.37.7`). With this change the `Mempool` interface is the same as in previous versions. The reason is that we do not want to break the public API. We still keep in the code the feature that tendermint#3314 introduced by moving it inside the existing `Lock` method. We also keep the `RecheckFull bool` field that we added to `ErrMempoolIsFull`.  --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

unclezoro requested review from ebuchman and melekes as code owners February 28, 2019 10:12

unclezoro mentioned this pull request Feb 28, 2019

do not panic if the peer does not have a state in Receive #3346

Closed

4 tasks

ebuchman reviewed Mar 2, 2019

View reviewed changes

p2p/pex/pex_reactor.go Show resolved Hide resolved

melekes reviewed Mar 23, 2019

View reviewed changes

consensus/reactor.go Outdated Show resolved Hide resolved

p2p/base_reactor.go Outdated Show resolved Hide resolved

unclezoro force-pushed the p2perror branch from 978880a to 9813f45 Compare March 24, 2019 14:17

unclezoro requested a review from xla as a code owner March 24, 2019 14:17

melekes approved these changes Mar 24, 2019

View reviewed changes

unclezoro force-pushed the p2perror branch from 02efe24 to 7e11414 Compare March 28, 2019 02:43

xla suggested changes Mar 28, 2019

View reviewed changes

unclezoro force-pushed the p2perror branch from d2a55ee to e82f69b Compare April 1, 2019 06:04

melekes mentioned this pull request Apr 1, 2019

p2p: sent next PEX request too soon #3338

Closed

xla changed the title ~~peer state init too late and pex message too soon~~ p2p: peer state init too late and pex message too soon Apr 10, 2019

unclezoro force-pushed the p2perror branch from e82f69b to 6f61aa8 Compare April 18, 2019 13:23

golangcibot reviewed Apr 18, 2019

View reviewed changes

consensus/reactor_test.go Outdated Show resolved Hide resolved

consensus/reactor_test.go Outdated Show resolved Hide resolved

p2p/pex/pex_reactor_test.go Outdated Show resolved Hide resolved

unclezoro force-pushed the p2perror branch from 6f61aa8 to f80465c Compare April 18, 2019 15:16

xla requested a review from melekes April 19, 2019 05:39

golangcibot reviewed Apr 19, 2019

View reviewed changes

p2p/pex/pex_reactor_test.go Outdated Show resolved Hide resolved

unclezoro force-pushed the p2perror branch 2 times, most recently from b4043fc to 410aec5 Compare April 19, 2019 10:46

xla reviewed Apr 20, 2019

View reviewed changes

melekes reviewed Apr 22, 2019

View reviewed changes

melekes assigned ebuchman Apr 23, 2019

unclezoro added 5 commits April 26, 2019 18:55

fix peer state init to late

13ba7d3

Peer does not have a state yet. We set it in AddPeer. We need an new interface before mconnection is started

pex message to soon

73286ec

fix reconnection pex send too fast, error is caused lastReceivedRequests is still not deleted when a peer reconnected

add test case for initpeer

537bcbb

add prove case

d45c6c3

remove potentially infinite loop

4d6f94d

unclezoro force-pushed the p2perror branch from 84f23de to 4d6f94d Compare April 26, 2019 10:57

liamsi approved these changes Apr 30, 2019

View reviewed changes

melekes reviewed May 1, 2019

View reviewed changes

melekes and others added 2 commits May 6, 2019 14:25

Update consensus/reactor.go

5d93aa8

Co-Authored-By: guagualvcha <baifudong@lancai.cn>

Update consensus/reactor_test.go

2ea03a7

Co-Authored-By: guagualvcha <baifudong@lancai.cn>

melekes self-assigned this May 7, 2019

melekes mentioned this pull request May 7, 2019

p2p: peer state init too late and pex message too soon #3634

Merged

4 tasks

melekes closed this May 7, 2019

p2p: peer state init too late and pex message too soon #3361

p2p: peer state init too late and pex message too soon #3361

Conversation

unclezoro commented Feb 28, 2019 • edited Loading

unclezoro commented Feb 28, 2019 • edited Loading

codecov-io commented Feb 28, 2019 • edited Loading

Codecov Report

ebuchman left a comment

Choose a reason for hiding this comment

ebuchman commented Mar 2, 2019

ebuchman commented Mar 2, 2019

unclezoro commented Mar 4, 2019

ebuchman commented Mar 4, 2019

unclezoro commented Mar 5, 2019

melekes commented Mar 11, 2019

unclezoro commented Mar 22, 2019

melekes commented Mar 23, 2019

melekes commented Mar 24, 2019

unclezoro commented Mar 25, 2019

unclezoro commented Mar 28, 2019

xla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unclezoro Apr 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebuchman commented Apr 8, 2019

unclezoro commented Apr 10, 2019

unclezoro commented Apr 18, 2019

xla commented Apr 19, 2019

unclezoro commented Apr 19, 2019 • edited Loading

xla commented Apr 20, 2019

xla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unclezoro Apr 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unclezoro commented Apr 26, 2019 • edited Loading

liamsi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

melekes commented May 7, 2019

melekes commented May 7, 2019

unclezoro commented Feb 28, 2019 •

edited

Loading

unclezoro commented Feb 28, 2019 •

edited

Loading

codecov-io commented Feb 28, 2019 •

edited

Loading

unclezoro Apr 1, 2019 •

edited

Loading

unclezoro commented Apr 19, 2019 •

edited

Loading

unclezoro Apr 20, 2019 •

edited

Loading

unclezoro commented Apr 26, 2019 •

edited

Loading