Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add elastic scaling MVP guide #4663

Merged
merged 7 commits into from
Jul 17, 2024
Merged

Conversation

alindima
Copy link
Contributor

Resolves #4468

Gives instructions on how to enable elastic scaling MVP to parachain teams.

Still a draft because it depends on further changes we make to the slot-based collator: #4097

Parachains cannot use this yet because the collator was not released and no relay chain network has been configured for elastic scaling yet

@alindima alindima added R0-silent Changes should not be mentioned in any release notes T11-documentation This PR/Issue is related to documentation. labels May 31, 2024
@alindima alindima marked this pull request as draft May 31, 2024 14:44
Comment on lines 24 to 27
//! 1. **A parachain can use at most 3 cores at a time.** This limitation stems from the fact that
//! every parablock has an execution timeout of 2 seconds and the relay chain block authoring
//! takes 6 seconds. Therefore, assuming parablock authoring is sequential, a collator only has
//! enough time to build 3 candidates in a relay chain slot.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that using full 2s of execution is the only usecase, it is also possible to use little computation but reach PoV limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I created this guide assuming parachains that want to use multiple cores would do so to achieve higher throughput, but it can be also used to achieve lower latency (at least to inclusion in a candidate). I'll rephrase

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

higher throughput,

can also mean more data.

docs/sdk/src/guides/enable_elastic_scaling_mvp.rs Outdated Show resolved Hide resolved
docs/sdk/src/guides/enable_elastic_scaling_mvp.rs Outdated Show resolved Hide resolved
docs/sdk/src/guides/enable_elastic_scaling_mvp.rs Outdated Show resolved Hide resolved
//! 1. Increase the `BLOCK_PROCESSING_VELOCITY` to the desired value. In this example, 3.
//!
//! ```rust
//! const BLOCK_PROCESSING_VELOCITY: u32 = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! const BLOCK_PROCESSING_VELOCITY: u32 = 3;
//! const BLOCK_PROCESSING_VELOCITY: u32 = (RELAY_CHAIN_SLOT_TIME / MIN_SLOT_DURATION);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use docify please :)

//! 2. Decrease the `MILLISECS_PER_BLOCK` to the desired value. In this example, 2000.
//!
//! ```rust
//! const MILLISECS_PER_BLOCK: u32 = 2000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! const MILLISECS_PER_BLOCK: u32 = 2000;
//! const MILLISECS_PER_BLOCK: u32 = MIN_SLOT_DURATION;

docs/sdk/src/guides/enable_elastic_scaling_mvp.rs Outdated Show resolved Hide resolved
docs/sdk/src/guides/enable_elastic_scaling_mvp.rs Outdated Show resolved Hide resolved
docs/sdk/src/guides/enable_elastic_scaling_mvp.rs Outdated Show resolved Hide resolved
//!
//! **This guide assumes full familiarity with Asynchronous Backing and its terminology, as defined
//! in <https://wiki.polkadot.network/docs/maintain-guides-async-backing>.
//! Furthermore, the parachain should have already been upgraded according to the guide.**
Copy link
Contributor

@kianenigma kianenigma Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also link to #4363 once it is merged.

Moreover, I think you can also benefit a bit from the suggestiosn similar to #4363 (comment)

PTAL: https://paritytech.github.io/polkadot-sdk/master/polkadot_sdk_docs/meta_contributing/index.html#why-rust-docs

//! still [work in progress](https://github.com/paritytech/polkadot-sdk/issues/1829).
//! Below are described the current limitations of the MVP:
//!
//! 1. **Limited core count**. Parachain block authoring is sequential, so the second block will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know that these 3 para-blocks are still valid when imported in 3 parallel cores?

For example, there are 2 tx in each parablock. The collator proposes [t1, t2, t3, t4, t5, t6] and they are all valid. But the validity of t6 depends on the execution of t1. When imported in 3 cores, t1 and t6 are no longer present.

In general, I would assume all of this to be fixed in the cumulus block building code. My question is, does it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 blocks are expected to form a chain, the ones that don't will not be included.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 blocks are expected to form a chain, the ones that don't will not be included.

yes, also a candidate will not be included until all of its ancestors are included. if one ancestor is not included (times out availability) or is concluded invalid via a dispute, all of its descendants will also be evicted from the cores. So we only deal with candidate chains

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I still don't get this.

@sandreim if they form a chain, and part of the chain is executed in one core and part of it in another core, how does either of the cores check that the whole thing is a chain?

in my example, [t1, t2, t3, t4, t5, t6], [t1, t2, t3] goes into one core, [t4, t5, t6] into another. The whole [t1 -> t6] indeed forms a chain, and execution of t5 depends on the execution of t2.

Perhaps what you mean to say is that the transactions that go into different cores must in fact be independent of one another?

Copy link
Contributor

@sandreim sandreim Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transactions are not independent. We achieve parallel execution even in that case, and still check they form a chain by passing in the appropriate validation inputs (

pub struct PersistedValidationData<H = Hash, N = BlockNumber> {
) . We can validate t2 because we already have the parent head data of t1 from the collator of t2. So we can correctly construct the inputs and the PoV contains the right data ( t2 was built after t1 by the collator).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the answer?

[t1, t2, t3] goes into one core, [t4, t5, t6], but the PoV of the latter contains the full execution of the former?

I think this is fine, but truthfully to scale up, I think different transactions going into different cores must be independent, or else the system can only scale as much as you can jack up one collator.

Copy link
Member

@ordian ordian Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the PoV of the latter contains the full execution of the former?

PoV of [t4, t5, t6] would refer to the state post [t1, t2, t3] execution.

I think different transactions going into different cores must be independent, or else the system can only scale as much as you can jack up one collator

One way of achieving that without jacking up one collator would be to have a DAG instead of a blockchain (two blocks having the same parent state). But then you'd need to somehow ensure they are truly independent. This could be done with e.g. specifying dependencies in the transactions themselves (a ala Solana or Ethereum access lists).

Another way would be to rely on multiple CPU cores of a collator and implement execution on the collator side differently with optimistic concurrency control (ala Monad). This only requires modification on the collator side and does not affect transaction format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks @ordian.

I totally agree with all of your directions as well. I am not sure if you have seen it or not, but my MSC Thesis was on the same topic 🙈 https://github.com/kianenigma/SonicChain. I think what I have done here is similar to access list, and it should be quite easy to add to FRAME and Substrate: each tx to declare, via its code author, what storage keys it "thinks" it will access. Then the collators can easily agree among themselves to collate non-conflicting transactions.

This is a problem that is best solve from the collator side, and once there is a lot of demand. Polkadot is already doing what it should do, and should not do any "magic" to handle this.

Once there is more demand:

  1. Either collators just jack up, as they kinda are expected to do now. This won't scale a lot but it will for a bit.
  2. I think the access list stuff is super cool and will scale
  3. OCC is fancy but similarly doesn't scale, because there is only so many CPU cores, and you are still bound to one collator somehow filling up 8 Polkadot cores. Option 2 is much more powerful, because you can enable 8 collators to fill 8 blocks simultaneously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OCC is fancy but similarly doesn't scale, because there is only so many CPU cores, and you are still bound to one collator somehow filling up 8 Polkadot cores. Option 2 is much more powerful, because you can enable 8 collators to fill 8 blocks simultaneously.

I agree here only partially. First, you can't produce (para)blocks at a rate faster than collators/full-nodes can import them. Unless they are not checking everything themselves. But even if they are not checking, this assumes that the bottleneck will be CPU and not storage/IO, which is not currently the case. Even with NOMT and other future optimizations, you can't accept transactions faster than you can modify the state. You need to know the latest state in order to check transactions. Unless we're talking about sharding parachain's state itself.

Another argument is that single threaded performance is going to reach a plateau eventually (whether it's Moor's law or physics) and nowadays we see even smartphones have 8 cores, so why not utilize them all instead of doing everything single-threaded?

That being said, I think options 2 and 3 are composable, you can do both.

Copy link
Contributor

@sandreim sandreim Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current status quo is that we rely on 1 (beefy collators). 2 for sure is something that can scale well, but it seems to be complicated and is not really compatible with the relay chain which expects chains not a DAG. #4696 (comment) shows how the limitations of what is possible with ref hw and 5 collators.

We did a nice brainstorming session with @skunert and @eskimor on the subject some time ago. We think that best way to go forward is to implement a transaction streaming mechanism. At the begging of each slot, the block author sends transactions to the next block author as it pushes them in the current block. By the time it announces the block, the next author should already have all state changes applied and doesn't need to wait to import it and can immediately start building his own block. And so on.

If that is not enough, next block author can start to speculatively build it's next block update the transactions and state as it learns what the current author is putting in his blocks.

Copy link
Contributor

@kianenigma kianenigma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems best to me for you to first coordinate with #4363, possily push it to completion of Radha is not available, then build this on top of.

@sandreim
Copy link
Contributor

sandreim commented Jun 4, 2024

It seems best to me for you to first coordinate with #4363, possily push it to completion of Radha is not available, then build this on top of.

That's a good suggestion, I think we should align on better terminology and minimize the amount of changes required to enable these features.

@DrW3RK
Copy link
Contributor

DrW3RK commented Jun 10, 2024

Applied changes to #4363 Needs one more approval for merge

@kianenigma
Copy link
Contributor

Applied changes to #4363 Needs one more approval for merge

Almost, but not yet :)

@skunert skunert self-requested a review June 25, 2024 15:23
@kianenigma
Copy link
Contributor

Would be happy to review this once it is updated based on the previous guides.

@alindima
Copy link
Contributor Author

Would be happy to review this once it is updated based on the previous guides.

Thanks, I'll let you know when this is ready. I've put this on hold a bit waiting for #4097 to be merged (and it has been) and getting more specific performance numbers here: #4696

@alindima
Copy link
Contributor Author

Ok I thought about using docify here. But how can I, considering that the parachain template is not updated for elastic scaling yet? (and we don't plan to update it yet as it's still an experimental MVP).
AFAICT I can only reference existing code which is not really helpful. We can use docify only after we modify the template, by which time the async backing guide will no longer be able to use docify. @kianenigma

@alindima
Copy link
Contributor Author

alindima commented Jul 12, 2024

another problem I see with docify is that it only works on types. you cannot annotate and import blocks of code (unless you define artificial functions for them)

@alindima alindima removed the R0-silent Changes should not be mentioned in any release notes label Jul 12, 2024
@alindima alindima marked this pull request as ready for review July 12, 2024 08:35
@alindima
Copy link
Contributor Author

I tried using docify as much as it made sense. Please review

@sandreim sandreim added this pull request to the merge queue Jul 17, 2024
Merged via the queue into master with commit 0db5092 Jul 17, 2024
157 of 162 checks passed
@sandreim sandreim deleted the alindima/elastic-scaling-mvp-guide branch July 17, 2024 09:53
ordian added a commit that referenced this pull request Jul 17, 2024
* master:
  add elastic scaling MVP guide (#4663)
  Send PeerViewChange with high priority (#4755)
  [ci] Update forklift in CI image (#5032)
  Adjust base value for statement-distribution regression tests (#5028)
  [pallet_contracts] Add support for transient storage in contracts host functions (#4566)
  [1 / 5] Optimize logic for gossiping assignments (#4848)
  Remove `pallet-getter` usage from pallet-session (#4972)
  command-action: added scoped permissions to the github tokens (#5016)
  net/litep2p: Propagate ValuePut events to the network backend (#5018)
  rpc: add back rpc logger (#4952)
  Updated substrate-relay version for tests (#5017)
  Remove most all usage of `sp-std` (#5010)
  Use sp_runtime::traits::BadOrigin (#5011)
paritytech-ci pushed a commit that referenced this pull request Jul 17, 2024
Resolves #4468

Gives instructions on how to enable elastic scaling MVP to parachain
teams.

Still a draft because it depends on further changes we make to the
slot-based collator:
#4097

Parachains cannot use this yet because the collator was not released and
no relay chain network has been configured for elastic scaling yet
jpserrat pushed a commit to jpserrat/polkadot-sdk that referenced this pull request Jul 18, 2024
Resolves paritytech#4468

Gives instructions on how to enable elastic scaling MVP to parachain
teams.

Still a draft because it depends on further changes we make to the
slot-based collator:
paritytech#4097

Parachains cannot use this yet because the collator was not released and
no relay chain network has been configured for elastic scaling yet
ordian added a commit that referenced this pull request Jul 18, 2024
* master: (125 commits)
  add elastic scaling MVP guide (#4663)
  Send PeerViewChange with high priority (#4755)
  [ci] Update forklift in CI image (#5032)
  Adjust base value for statement-distribution regression tests (#5028)
  [pallet_contracts] Add support for transient storage in contracts host functions (#4566)
  [1 / 5] Optimize logic for gossiping assignments (#4848)
  Remove `pallet-getter` usage from pallet-session (#4972)
  command-action: added scoped permissions to the github tokens (#5016)
  net/litep2p: Propagate ValuePut events to the network backend (#5018)
  rpc: add back rpc logger (#4952)
  Updated substrate-relay version for tests (#5017)
  Remove most all usage of `sp-std` (#5010)
  Use sp_runtime::traits::BadOrigin (#5011)
  network/tx: Ban peers with tx that fail to decode (#5002)
  Try State Hook for Bounties (#4563)
  [statement-distribution] Add metrics for distributed statements in V2 (#4554)
  added sync command (#4818)
  Bridges V2 refactoring backport and `pallet_bridge_messages` simplifications (#4935)
  xcm-executor: Improve logging (#4996)
  Remove usage of `sp-std` on templates (#5001)
  ...
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Aug 2, 2024
Resolves paritytech#4468

Gives instructions on how to enable elastic scaling MVP to parachain
teams.

Still a draft because it depends on further changes we make to the
slot-based collator:
paritytech#4097

Parachains cannot use this yet because the collator was not released and
no relay chain network has been configured for elastic scaling yet
ordian added a commit that referenced this pull request Aug 6, 2024
* master: (130 commits)
  add elastic scaling MVP guide (#4663)
  Send PeerViewChange with high priority (#4755)
  [ci] Update forklift in CI image (#5032)
  Adjust base value for statement-distribution regression tests (#5028)
  [pallet_contracts] Add support for transient storage in contracts host functions (#4566)
  [1 / 5] Optimize logic for gossiping assignments (#4848)
  Remove `pallet-getter` usage from pallet-session (#4972)
  command-action: added scoped permissions to the github tokens (#5016)
  net/litep2p: Propagate ValuePut events to the network backend (#5018)
  rpc: add back rpc logger (#4952)
  Updated substrate-relay version for tests (#5017)
  Remove most all usage of `sp-std` (#5010)
  Use sp_runtime::traits::BadOrigin (#5011)
  network/tx: Ban peers with tx that fail to decode (#5002)
  Try State Hook for Bounties (#4563)
  [statement-distribution] Add metrics for distributed statements in V2 (#4554)
  added sync command (#4818)
  Bridges V2 refactoring backport and `pallet_bridge_messages` simplifications (#4935)
  xcm-executor: Improve logging (#4996)
  Remove usage of `sp-std` on templates (#5001)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T11-documentation This PR/Issue is related to documentation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Elastic scaling: document how to enable elastic scaling in parachain runtime
6 participants