Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache up-to-date consensus payloads #949

Closed
wants to merge 104 commits into from

Conversation

vncoelho
Copy link
Member

@vncoelho vncoelho commented Jul 23, 2019

closes #788

Check this out, @erikzhang, @shargon, @jsolman and @igormcoelho. An interesting speed up for consensus payloads.

The mechanism was an idea given by SPCC guys during their GO implementation (@fabwa, Anatoly Bogatyrev and Evgeniy Stratonikov).

@vncoelho vncoelho changed the title First draft for speeding up consensus with future payloads Speeding up consensus with future payloads Jul 23, 2019
@vncoelho vncoelho changed the title Speeding up consensus with future payloads Speeding up consensus with up-to-date payloads Jul 23, 2019
@codecov-io
Copy link

codecov-io commented Jul 23, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@0a7cba7). Click here to learn what that means.
The diff coverage is 24.52%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #949   +/-   ##
=========================================
  Coverage          ?   64.83%           
=========================================
  Files             ?      199           
  Lines             ?    13700           
  Branches          ?        0           
=========================================
  Hits              ?     8883           
  Misses            ?     4817           
  Partials          ?        0
Impacted Files Coverage Δ
neo/Consensus/ConsensusService.cs 13.74% <4.22%> (ø)
neo/Consensus/ConsensusContext.cs 63.77% <65.71%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a7cba7...fb27884. Read the comment docs.

@erikzhang erikzhang added the Low-Priority Issues with lower priority label Jul 24, 2019
@vncoelho vncoelho marked this pull request as ready for review July 26, 2019 15:04
@vncoelho
Copy link
Member Author

vncoelho commented Jul 26, 2019

@erikzhang, even marked as Low Priority, I still believe it is important.
@neo-project/core, take your time to review it when you have an extra time.

As much as we have NEO3 consensus operating more efficiently a better experience we are going to have.

This PR is a simple mechanism that caches payloads that were previously discarded. However, they are useful and can surely improve consensus performance.

@vncoelho
Copy link
Member Author

Improved the mechanism for just trying to load payloads if they are from the current height.

TODO: With this last change it becomes better to just set each used payload to null.

@shargon, how to set a var of foreach to null inside the loop?

Copy link
Contributor

@igormcoelho igormcoelho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good idea. Brother, I think that Reset second parameter could default to true, and only H+1 future messages are collected. After ine height more I dont see any advantage.. right?

@vncoelho
Copy link
Member Author

In the first implementation I set the default to true, but at the end I preferred to make it more explicit, since we just have 2 calls.

H+1 Preparations, H+1 Commits? Something like that?
Makes sense also, brother.
However, we are just dealing with local RAM allocation. I believe it is better to have everything we can for now, no?
Since the payload already arrived.

lock9
lock9 previously requested changes Aug 4, 2019
Copy link
Contributor

@lock9 lock9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vncoelho, we need UT to ensure this is working properly (we can't approve PRs without testing). Any chance you can add a few?
Also, it would be good if we can have some evidence that this will increase consensus. What exactly is this change going to improve?

@vncoelho
Copy link
Member Author

vncoelho commented Aug 8, 2019

@lock9, I will not be able to add UTs to this right now and it is not my priority at the moment. But I think it is a good thing to be done and I would sure revise and review any PR related to this with pleasure.

@vncoelho
Copy link
Member Author

vncoelho commented Jan 6, 2020

@shargon @erikzhang, I think that this is ready to merge.
In fact, @cloud8little experiments shows a slithgly better performance for 15s and 5s.

However, I believe that, statistically speaking, we are only going to see real gains when we use delays, which is the case of the mainenet and testnet.

The changes are straightforward.
The trade-off of local processing power is minor compared to the benefits for the mainnet operation, even if the gain of performance is minor (which I believe is not when we scale the system and count with delays).

@erikzhang
Copy link
Member

How can we get a result of 14.9140625 seconds? I think there must be something wrong with the test.

@vncoelho
Copy link
Member Author

vncoelho commented Jan 8, 2020

It can happen, @erikzhang, in particular, if you consider PR #1345.

@erikzhang
Copy link
Member

How can it happen? I don't understand.

@vncoelho vncoelho changed the title Speeding up consensus with up-to-date payloads Cache up-to-date consensus payloads Jan 8, 2020
@vncoelho
Copy link
Member Author

vncoelho commented Jan 8, 2020

I believe that the following cases may happen:

  • If any RecoveryMessage is received by the Primary, and the Primary has not it send its PrepareRequest, it will be triggered even without a proper Timeout:

    if (!context.RequestSentOrReceived)
    {
    ConsensusPayload prepareRequestPayload = message.GetPrepareRequestPayload(context, payload);
    if (prepareRequestPayload != null)
    {
    totalPrepReq = 1;
    if (ReverifyAndProcessPayload(prepareRequestPayload)) validPrepReq++;
    }
    else if (context.IsPrimary)
    SendPrepareRequest();
    }

  • We are talking about an average advance of 1,85s = 185ms, considering a total number of blocks of 128 we have 23680ms = 23,6s total advance in a ~32minutes run. If we are on different machines I believe this is expected, however, if the test were run on a privnet maybe there was an error as you suspected, @erikzhang. Since this PR helps the consensus to cache next payloads, maybe the behavior improved a little bit and now we saw this behavior. I am not sure as well.

Anyway, the focus of this PR is just to cache the next payloads that were already sent via P2P. I do not see reason to discard them. Perhaps the name of the PR can be changed, it does not effectively need to be an speed up.

@erikzhang
Copy link
Member

Anyway, the focus of this PR is just to cache the next payloads that were already sent via P2P. I do not see reason to discard them.

@vncoelho You are right. But I think this PR must solve a problem. So, what problem does this PR solve? In my opinion, it provides a caching mechanism that reduces the probability of packet loss. Then, if we can find evidence in the consensus log that our current consensus mechanism often switches views due to packet loss, I will think this PR is useful. But now, I don't think it's worth merging.

@vncoelho
Copy link
Member Author

vncoelho commented Jan 8, 2020

@erikzhang, in fact, we had an internal discussion yesterday and we discussed about closing/(not opening new ones) PRs (features) related to Consensus and dBFT 2.0.

The idea is to merge a whole package for dBFT 3.0, we are on the way of writing the paper and start the implementation until middle of this year.

There are some requirement for the safety and properties of pBFT, there are other important things we need to solve for fixing "workarounds", such as the use of FailedNodes flags.

This PR, in particular, is very simple, there is no bad trade-off in merging this since it only involves a local processing power without any possible scalability issue.

@cloud8little
Copy link
Contributor

@vncoelho I can help make more test on different TPS comparison, on different machine. let's see what will it act.

@vncoelho
Copy link
Member Author

Sounds great, @cloud8little!

@cloud8little
Copy link
Contributor

cloud8little commented Jan 14, 2020

@vncoelho I've retested on 4 consensus nodes, located in different countries, here is the detail of the results. there is no difference for 15000ms, slightly improvement for 5000/800ms. Obevious improvement for 400ms. the experiment is based on no transactions, since there is a issue #1410, I can't send massive txs at this moment.

  OS Location CPU Memory Bandwith Disk
n1 Ubuntu 18.04 Tokyo 4 8G 5Mbps 20G
n2 CentOS 7.4 America 2 8G 6Mbps 20G
n3 Win Server 2016 BeiKing 2 8G 7Mbps 20G
n4 Ubuntu 18.04 India 2 8G 8Mbps 20G

neo-cli: master 51cd29fbe21abb9e1f17f64e5c6d21bc7decbbb9
neo: master ab4830c
neo-vm: master be2ac36bf35a3033d828e0ba0630d390599c487d

baseline            
MillisecondsPerBlock StartTime EndTime number of blocks duration secs/per block Avg secs/block
15000 12:07:48 12:38:20 116 0:30:32 15.64655172 16.08189655
15000 12:06:44 12:38:20 116 0:31:36 16.34482759
15000 12:07:38 12:38:22 116 0:30:45 15.90517241
15000 12:06:34 12:38:20 116 0:31:46 16.43103448
5000 14:32:57 15:03:12 300 0:30:16 6.053333333 6.0925
5000 14:32:36 15:03:12 300 0:30:37 6.123333333
5000 14:32:52 15:03:12 300 0:30:20 6.066666667
5000 14:32:34 15:03:12 300 0:30:38 6.126666667
800 15:50:00 16:20:26 1562 0:30:26 1.169654289 1.169494238
800 15:50:00 16:20:26 1562 0:30:26 1.169014085
800 15:50:00 16:20:27 1562 0:30:27 1.169654289
800 15:50:00 16:20:26 1562 0:30:27 1.169654289
400 23:29:00 23:59:59 1988 0:30:58 0.930080483 0.93221831
400 23:29:01 23:59:59 1988 0:30:58 0.930080483
400 23:29:01 23:59:59 1988 0:30:58 0.934607646
400 23:29:01 23:59:58 1988 0:30:57 0.934104628
pr949            
MillisecondsPerBlock StartTime EndTime number of blocks duration secs/per block Avg secs/block
15000 10:45:29 11:17:10 115 0:31:41 16.53043478 16.40652174
15000 10:45:28 11:17:10 115 0:31:43 16.54782609
15000 10:46:31 11:17:11 115 0:30:39 15.99130435
15000 10:45:28 11:17:12 115 0:31:44 16.55652174
5000 0:08:11 0:39:04 331 0:30:53 5.598187311 5.592900302
5000 0:08:14 0:39:04 331 0:30:50 5.589123867
5000 0:08:08 0:39:04 331 0:30:56 5.607250755
5000 0:08:18 0:39:04 331 0:30:46 5.577039275
800 14:18:33 14:49:00 1573 0:30:27 1.161474889 1.158773045
800 14:18:36 14:49:00 1573 0:30:24 1.159567705
800 14:18:42 14:49:00 1573 0:30:18 1.155753338
800 14:18:38 14:49:00 1573 0:30:22 1.158296249
400 11:46:53 12:17:01 2413 0:30:08 0.749274762 0.749274762
400 11:46:53 12:17:01 2413 0:30:08 0.749274762
400 11:46:53 12:17:01 2413 0:30:08 0.749274762
400 11:46:53 12:17:01 2413 0:30:08 0.749274762

@vncoelho
Copy link
Member Author

Great experiments, @cloud8little.

I believe that there is not statistical different.
But it is very good to see this results.
It is incredible to see how 400ms worked even in a scenario with nodes in different location with possible great delays.

Perhaps that if the PrepReq had more Txs or size we could detect more gains in order to avoid losing this Payload.
In addition, in the real mainnet or testnet we do not direct have communication with CN, thus, payloads packages have longer routes through the graph network, as well as more uncertainty, which would surely reinforce the benefits of this current PR.

@vncoelho
Copy link
Member Author

@cloud8little, @superboyiii, @shargon, this is a not a big change but now that we have more features for testing with tx's, could you test if this change improve's performance when network is under high load? Perhaps that as bigger is the PrepareRequest more efficient it will be.

@erikzhang
Copy link
Member

erikzhang commented May 18, 2022

Since the consensus module has been moved to neo-modules, I will close this first.

@erikzhang erikzhang closed this May 18, 2022
@erikzhang erikzhang deleted the speed-up-consensus-with-future-payloads branch May 18, 2022 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Consensus Module - Changes that affect the consensus protocol or internal verification logic Enhancement Type - Changes that may affect performance, usability or add new features to existing modules.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consensus optimization for next block - Caching payloads
9 participants