Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Test Versi Scaling Limit #5962

Closed
Tracked by #26 ...
eskimor opened this issue Sep 2, 2022 · 21 comments
Closed
Tracked by #26 ...

Test Versi Scaling Limit #5962

eskimor opened this issue Sep 2, 2022 · 21 comments
Assignees
Labels
T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.

Comments

@eskimor
Copy link
Member

eskimor commented Sep 2, 2022

To get an idea what could be feasible on Kusama.

Test:

  • 250 para validators + 50 parachains.
  • 300 para validators + 60 parachains.

See where block times start to suffer, also check what difference it makes when validator set is scaled up to Kusama size as well. If block times for 250 para validators is still good, but not for 300 - check whether 250 is also good if there are 900 authorities in total.

Depending on the result, other experiments might be worthwhile as well:

300 para validators with only 50 parachains for example: This makes sense to test with non trivial POVs, as a higher number of para validators with same number of parachains reduces load on approval checkers - this will only be noticeable if there actually is any load (candidates perform some computation).

@eskimor eskimor added the T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. label Sep 2, 2022
@ggwpez
Copy link
Member

ggwpez commented Sep 2, 2022

How will you test this? With zombienet?

Are you just testing the network size or also performance (transactions per second)?

@eskimor
Copy link
Member Author

eskimor commented Sep 2, 2022

On Versi - our test net. The most important performance metric we will be monitoring is block rate of parachains. Especially for the second part, we might also look into transactions per second.

@sandreim
Copy link
Contributor

sandreim commented Sep 13, 2022

I think we can easily deploy polkadot-introspector for block time monitoring in the CI test, but it would only work with cumulus collators as it requires us to connect to collators via RPC and get inherent TS (which doesn't work for cumulus). However adding Prometheus mode for parachain commander would solve that as we would compute the para block times from the relaychain data.

@sandreim
Copy link
Contributor

this is also overlapping with paritytech/polkadot-sdk#874

@eskimor
Copy link
Member Author

eskimor commented Oct 20, 2022

@tdimitrov could you have a look into this one as well, as soon as Versi is operational again please?

@ggwpez
Copy link
Member

ggwpez commented Oct 20, 2022

How does this relate to https://github.com/paritytech/polkadot-stps/ ?
Do you want to produce some official sTPS numbers or just get an idea?

@sandreim
Copy link
Contributor

In the context of this issue we want to identify the performance bottlenecks so we have a high level idea of where to dig further and iterate on optimizing the node performance wrt parachain consensus.

@eskimor
Copy link
Member Author

eskimor commented Oct 20, 2022

How does this relate to https://github.com/paritytech/polkadot-stps/ ? Do you want to produce some official sTPS numbers or just get an idea?

Thanks for the pointer to polkadot-stps , I am not interested in official numbers at this point, mostly I want to know what we are able to do right now. So if someone comes (and someone will come) and ask, whether we can scale up Kusama some more in the number of para validators and parachains, I can tell them whether we can or not.

@tdimitrov tdimitrov self-assigned this Oct 20, 2022
@tdimitrov
Copy link
Contributor

tdimitrov commented Oct 28, 2022

I ran 250 validators with 50 parachains on versi.

Block times on relay chain and parachains looked generally okay. There were occasional 12/18/24 sec block times on some parachains.
On the next day one parachain has stalled but there was an issue with the database on it so I assume this was unrelated.

The new parachains were onboarded between 16:30 and 17:30` (timestamps on the graphs below).

I noticed approval-distribution channel size increased slightly:
Screenshot from 2022-10-27 19-19-37
There were also some peaks with approval-voting and bitdield-distribution subsystems channel size:
Screenshot from 2022-10-27 19-19-44
Screenshot from 2022-10-27 19-20-00

ToFs for approval-distribution were increased:
Screenshot from 2022-10-27 19-20-19
ToFs of approval-voting and bitfield-distribution increased a little too:
Screenshot from 2022-10-27 19-20-39
Screenshot from 2022-10-27 19-21-11

@eskimor
Copy link
Member Author

eskimor commented Oct 28, 2022

Nice, thank you Tsveto! So this looks like we are already pushing limits here. I would be very interested how this behaves with extending this to more validators. I have seen in the past that approval-distribution does not seem to like an increased validator set, would be good to confirm or disprove this. So in particular:

Increase the validator set even further, let's say 100 more nodes - see how ToFs and channel sizes behave.

@rphmeier
Copy link
Contributor

rphmeier commented Oct 31, 2022

I find the apparent bimodal distribution of Approval-Distribution ToFs quite curious; that is, the gap between messages that arrive instantly and those which are delayed. It makes me wonder if it's blocking somewhere it shouldn't be, for instance on approval-voting DB writes.

The heatmap is also clearly on an exponential scale, so the orange/red message volumes are much higher than the blue/green. It still looks as though a majority of messages are handled almost instantaneously.

With the exponential scale in mind, it doesn't seem like approval-voting or bitfield-distribution is bottlenecking at all. While there are a few outliers that take a long time, for the most part all messages are processed instantaneously.

paritytech/polkadot-sdk#841 was a hypothesis in the past, but @sandreim investigated back then and didn't see convincing evidence that it was a cause.

If it is true that approval-distribution is not waiting on the approval-voting subsystem, then it may just be waiting on the network-bridge. Network-bridge ToFs would be interesting to look at as well. Reputation handling clogging up the queue is a potential concern.

@sandreim
Copy link
Contributor

sandreim commented Nov 1, 2022

Looks like approval-voting will block quite a lot when sending messages to approval-distribution as seen in this chart, other than this one, there isn't anything blocking on the same scale that I can see using this metric

Screenshot 2022-11-01 at 17 02 41

https://grafana.parity-mgmt.parity.io/goto/73iMs9N4z?orgId=1

@rphmeier
Copy link
Contributor

rphmeier commented Nov 2, 2022

questions on my mind:

  • how are the slow ToFs distributed across machines? i.e. do we see roughly the same ToF graph on all machines or are a few of them slower than others?
  • what is the bottleneck in approval-distribution? are messages not being processed because of work being done in the approval-distribution subsystem or because it's waiting on something else, which is slow?
  • since it only communicates with the network bridge and approval-voting, is network-bridge the culprit?
  • if so, why?
  • How does the behavior change as we scale Versi?

@eskimor
Copy link
Member Author

eskimor commented Nov 2, 2022

Where does approval-voting actually block on approval-distribution. I only see it sending messages via unbounded to approval distribution.

@eskimor
Copy link
Member Author

eskimor commented Nov 2, 2022

questions on my mind:

* how are the slow ToFs distributed across machines? i.e. do we see roughly the same ToF graph on all machines or are a few of them slower than others?

* what is the bottleneck in approval-distribution? are messages not being processed because of work being done in the approval-distribution subsystem or because it's waiting on something else, which is slow?

* since it only communicates with the network bridge and approval-voting, is network-bridge the culprit?

* if so, why?

* How does the behavior change as we scale Versi?

We identified a bottleneck. I would expect fixing that to bump up performance at least threefold.

@sandreim
Copy link
Contributor

sandreim commented Nov 2, 2022

questions on my mind:

  • how are the slow ToFs distributed across machines? i.e. do we see roughly the same ToF graph on all machines or are a few of them slower than others?

Things look similar across machines.

  • what is the bottleneck in approval-distribution? are messages not being processed because of work being done in the approval-distribution subsystem or because it's waiting on something else, which is slow?

The issue we suspect is causing most pain is the importing of assignments and approvals which are serialized and both wait for approval voting subsystem checks before doing book keeping. I'm working on some changes to wait for the approval voting checks in parallel but still serialising work per peer so to not break deduplication. We would expect approval voting to have more work to do in this scenario and for it no not block when sending messages to approval-distribution as we'll clear the queue faster there.

  • since it only communicates with the network bridge and approval-voting, is network-bridge the culprit?

network-bridge looks fine, very few instances where we'd block on its queue being full.

  • if so, why?
  • How does the behavior change as we scale Versi?

Currently we are blocked by a deployment/networking issue. We can't even get to 200 validators because of low connectivity (authority discovery failures).

@sandreim
Copy link
Contributor

sandreim commented Nov 2, 2022

Where does approval-voting actually block on approval-distribution. I only see it sending messages via unbounded to approval distribution.

ctx.send_messages(messages.into_iter()).await;

@eskimor
Copy link
Member Author

eskimor commented Nov 2, 2022

Darn 🥇 ... yep, I did not see that one 🙈

@rphmeier
Copy link
Contributor

rphmeier commented Nov 3, 2022

ctx.send_messages(messages.into_iter()).await;

It's worth noting that the BecomeActive logic only gets triggered once during a node's lifetime - that is, when it first gets into sync. It should be unbounded but this isn't going to impact long-running performance.

The issue we suspect is causing most pain is the importing of assignments and approvals which are serialized and both wait for approval voting subsystem checks before doing book keeping. I'm working on some changes to wait for the approval voting checks in parallel but still serialising work per peer so to not break deduplication. We would expect approval voting to have more work to do in this scenario and for it no not block when sending messages to approval-distribution as we'll clear the queue faster there.

Ok, I hope it is indeed the bottleneck. My understanding was that we only send assignments or approvals over to approval-voting the first time we receive them, so only a minority of incoming messages actually trigger that code path. It would explain the bimodal distribution of ToFs we see, but my expectation was that anything waiting on approval-voting would be bottlenecked on the DB write, and we disproved that in the past, didn't we? Otherwise, the only work that approval-voting does is verify a signature (not insignificant, but would be a surprising bottleneck at these message volumes), update some in-memory state, and do some DB reads (which should be cached by RocksDB, no?)

@sandreim
Copy link
Contributor

sandreim commented Nov 3, 2022

but my expectation was that anything waiting on approval-voting would be bottlenecked on the DB write, and we disproved that in the past, didn't we? Otherwise, the only work that approval-voting does is verify a signature (not insignificant, but would be a surprising bottleneck at these message volumes), update some in-memory state, and do some DB reads (which should be cached by RocksDB, no?)

Yes, we disproved the DB as being the bottleneck in the past.

@sandreim
Copy link
Contributor

We concluded last round of testing at 350 paravalidators and 60 parachains with PR #6530.

Finality lag:
Screenshot 2023-01-23 at 14 52 27

Parachain block times:
Screenshot 2023-01-23 at 14 52 51

Board https://github.com/orgs/paritytech/projects/63 for tracking.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
No open projects
Development

No branches or pull requests

6 participants