-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/Improve syncing (2x): Adding ping mechanism to TaskManager for replacing StartHeight and PendingKnownHashes strategy #1024
Fix/Improve syncing (2x): Adding ping mechanism to TaskManager for replacing StartHeight and PendingKnownHashes strategy #1024
Conversation
…e max start height
Could you make some unit test for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yongjiema, it looks like a reasonable change in my opinion since nodes are not going to request blocks all the time, but based on the maximum startheight
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this will spam the network a little bit. Do not think so?
Could you set a variable with the percentage of nodes that (sessions
) will be asked for getblocks
in such case?
Otherwise, I am thinking that the node will start to request getblocks
to all its sessions
until it reconnects to some other session
at higher startheight
.
Maybe a random number:
- something like 20% of chance of requesting
getblocks
to a given session if such case happen:|| (Blockchain.Singleton.Height < Blockchain.Singleton.HeaderHeight && Blockchain.Singleton.Height >= maxStartHeight) && Random.Rand() <= maximumPercentageForRequestingBlocksToAGivenSessionInCaseWeAreOnTop
On the other hand, @yongjiema, if we have a disconnection mechanism, in which:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yongjiema aaaaaaa, there is also another possibility.
In such case, you ping nodes!
The ping will return the LastBlockIndex
, which will, consequently, release the request of blocks naturally, right?
@yongjiema, I fell good improvements comming. Nice job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make some UTs? How did you test this?
Tested with a build locally as we don't have some basic UTs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mannnn, @yongjiema, I just took a look at it right now, this new design looks good.
You did several good ideas simply after the last time I saw the code. Great job.
What do you think, @erikzhang?
I think that I agree with this, let me check with more time, some more couple of days. But the changes look precise.
During this time, @lock9, I will also try to think about UTs, but not sure if we are going to push here for NEO2. It would be nice, but maybe the change may reach a point that it is clearly enough to us.
But I agree that, at least, for NEO3 we should try to test basic cases of messages arriving.
@erikzhang Could you have a look at this? I think this could help fixing the consensus stuck. |
There are two "requested changes". |
@yongjiema I can still get block stuck at a lower height with this fix, following is the environment I setup. 1 seed node, 4 consensus node; and send nep5 transactions to the seed node continuously, about 3txs per second.
consensus node:
node2:
node3:
here is part of the consensus log, showing that every block contains 21 txs. |
@cloud8little Can you try send 180txs per mins? Ideally the interval is greater than 30 secs. https://github.com/neo-project/neo/blob/master/neo/Network/P2P/TaskManager.cs#L23 |
180txs per mins can be handled by this fix, correct a bit, TPS yesterday should be around 12tx/sec, today tried again, it works well, when 9.6~12 txs per second, it can recover to the best height in several minutes. The mempool is full with 50 thousands txs. Good to merge this PR. |
Not application for now. However, still need to double check if someone can spam the node with pendinghashes
, ping @yongjiema
@yongjiema, what do you think about some node just sending random hashes that will enter on the |
Will remove these later if it's persisted to blockchain instead directly inserting to neo/neo/Network/P2P/ProtocolHandler.cs Line 42 in 35e1d17
|
I see, @yongjiema, is not this a problem? |
Currently cannot stop if someone want to spam, they can generate different transactions are verified and send to network, and previous solution could be out of memory. |
@yongjiema, I am not sure. I mean:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you asap split the PRs in two different ones?
I am going to approve the part I related to the ping and the use of LastBlockIndex
. We need to move forward and this ping is surely a good thing. The StartHeight
looks bad to me.
Then, we discuss the another PR with just the 'PendingKnownHashesCollection`, namely part II.
Could you do that for both master2x and 3x?
Maybe you can just remove the PendingKnownHashesCollection
and open the new PR with it.
Leave the opened PRs with the part I.
@vncoelho Can you think more? It will not resolve the sync stuck issue without |
Perfect, @yongjiema, I am currently testing it under some scenarios. It is now available online here: https://neocompiler.io/#!/ecolab/cnnodesinfo |
@lock9, UTs for this PR will not be possible right now. I am now almost completely understanding the change and in favor of it. Currently, we are performing some test to be more sure about its possible improvements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we go, @yongjiema.
The syncing process is considerably improved.
It is like a turning water into wine. Congratulations!
UTs are not possible right now, we would need to first design them for the classes of P2P that are being modified. Even for NEO3, perhaps, the UTs will come later.
@superboyiii @cloud8little, as far as I can see, you both also agree with the change. I dismissed @lock9 requested change because the UTs are not feasible for now, even for master3x. @erikzhang, from my tests here, I never saw the network like this! Let's merge this PR. I should apologize for the long time we took to see this improvement (but, perhaps, we also improved it a little bit since the idea draft), @yongjiema. You were very precisely to the cause. |
omg, this PR is a magic @yongjiema, if @jsolman takes a look he will probably be proud of it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some comments, what's the meaning of pendingKnownHashes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vncoelho, I can't approve it because I currently don't know how to properly test this. If you can guide me on how to test it, I may be able to approve it.
@lock9, I am not going to guide you throughout these steps right now. It took many months and great part of the energy we try to put on NEO for motivating such kind of initiatives that result in this PR and knowledge evolution. It is taking us time and efforts to try to understand it, try to test it and surely a similar procedure happened with @yongjiema. Right now, I need to focus on other aspects that I and the ones surround me consider to be important. Meanwhile, fell free to design the UTs for this PR or leave it opened until you think it should. |
No description provided.