-
Notifications
You must be signed in to change notification settings - Fork 683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea: Core Groups #607
Comments
I like this idea and also think that more batching is the way to go as it minimises the overhead. I will expand this some more soon |
Core TimeIt makes sense to introduce a The parachains are assigned to cores. For fractions they can be also assigned to same core, then grouped in a This is for sure useful for more exotic scheduling as described in BackpressureWhen dealing with large core groups it could be possible that the validator has more PVF work to do than what it is possible in a given time frame. Execution priority is Disputes first , approval voting then backing last to enforce faster finality while also slowing down parachain block production by backing less candidates in a group. |
We should do this of course. We should remain aware that workload variance per relay chain block increases somewhat by doing this, but maybe no too much, and total variance even less so. We'll still merge paritytech/polkadot#6782 or some flavor of course. We'd ideally still sign approvals for multiple core groups ala #732 too, but it becomes less valuable now. |
Yeah, I also don't think #701 makes that much sense if we move to |
We need to figure out a light algorithm for validators assigned to different cores to choose the subset(s) of possible core group candidates that they want to back, accounting for the fact that each parachain can produce as few as 0 candidates and an unbounded amount of valid candidates. |
I thought a bunch more about core groups, and I don't think it's quite the right solution, nor does it really align with the scheduling infrastructure proposed here polkadot-fellows/RFCs#3 . That's still a pending RFC so I won't take it as a given, but I haven't seen any other solid proposals for dynamic scheduling. To be clear, this SHOULD be an RFC, and I'm treating this as an initial discussion where nothing is set in stone. Reframing GoalsThe original issue stated these goals
I would reframe this way
Why not core groups?The main issue with core groups is that it hides a large amount of complexity in the validators of a backing group choosing which parachain candidates to back together. Any solution there would probably require running a consensus protocol among backers, and leaves out the possibility for candidates to be backed at different points in time, i.e. for backing to be a continuous and asynchronous process that feeds into the relay chain. It also doesn't overlap nicely with the probabilistic scheduling approach in RFC-3, where backers don't know exactly which chains are scheduled on which cores at which times, but just roughly what could be scheduled. We are able to reuse backing statements from previous rounds to include stuff that didn't quite make it in. We should apply this same approach to core groups: what we want to do is bundle many small state transitions onto a single core. We don't know exactly which chains they're going to be from, and we do want backing statements about them to be reusable in later relay-chain blocks. Proposed AlternativeI'm still in favor of doing some kind of bundling. Goals (1) and (2) are pretty obviously useful at this point. So here are all the things that stay the same as in core groups:
Here's what's different:
Integrating into SchedulingTo be honest, the only thing we need for cores is to specify their limits // Per-relay-chain-block core specs
struct CoreResourceLimits {
// Total across all candidates backed in the same relay-chain block
total_execution_time: Duration,
// Total across all candidates backed in the same relay-chain block
total_data_size: u32,
// The minimum amount of execution time a single state transition backed on this core can use.
// State transitions on this core are always rounded up to the nearest multiple of this.
execution_time_step: Duration,
// The minimum amount of data a single state transition backed on this core can use.
// State transitions on this core are always rounded up to the nearest multiple of this.
data_size_step: u32,
// maybe some overhead budgeting constants, i.e. each additional candidate has an expected overhead
} If we update the struct CandidateCommitments {
// ...
execution_time: ...,
pov_size: ...,
} then we get this really nice property that the amount of candidates as well as the distribution of resources across candidates going into a core can vary from block to block. This is not just core groups, but dynamic market-based core groups. Where we can regulate the amount of candidates parachains produce with the I'm thinking of modifying RFC-3 so that regions just specify the general resource consumption rate of a parachain and then chains get to decide how many blocks they make, how big they are, etc without being attached to any particular size and rate. |
Yeah, this sounds like the essence anyways. I like.. Interestingly, we might even solve the "weights are too aggressive" problem here, but without touching the weights, although this becomes fraught elsewhere. cc @Sophia-Gold
Yeah this seems fine. It's possible to censor blocks you never backed, but so what.
Importantly, we fetch exactly the same availability chunk indices for each parablock in the availability core, so we only do the setup phase for one index set. It's thus no different from doing one big decoding, except maybe a couple cache misses. As an aside, we do not technically need availability and approvals cores to be the same either, but not doing so adds complexity, performs slightly worse if you must rerun the setup phase, and afaik yields no benefits. It'd look like this:
Yet reconstruction now fetches chunks with three optimization preference levels:
In essence, your proposal does the grouping in 2, but you could do the grouping in 4, if you accepting having bigger bitfields, and this nasty reconstruction optimization preference levels mess. I doubt this improves anything. |
It's possible to delay availability on blocks you never backed, but if you never backed them they've still been backed by enough other validators in the same group. Seems just as censorable as the current system - you know who backed it when it goes on-chain and can DoS them. It's faster as a validator to get your chunks in a single request but as an example:
OK, this is good info. Yes - glad this amortizes nicely, though we would need some new APIs to do that in practice and it's not a blocker for an initial implementation.
I also considered doing the grouping in 4 but it seems like it's always best to do the grouping as early as possible and then move it later when that presents challenges.
Yes, and RFC-3 I've proposed introducing some "hypercores" for making scheduling probabilistic (i.e. relay chain block authors can temporarily "widen" cores and bitfields up to some per-block limit to allow making up for missed opportunities to post stuff on-chain) As another aside, it's worth noting that RFC-5 just specifies the proportion of resources each parachain should have on the core. That interface would work for approaches where we have one block per core per relay-chain block, but could also adapt to core-sharing under the hood. e.g. if a parachain wants to, they could just build 2x as many blocks, half the size, and some of them will go in earlier, some of them will go in later. Obviously this comes with some additional cumulative PoV usage, so it's not something you can do for free, but still may be useful. RFC-3 also now doesn't phrase things in terms of "blocks" but rather core resource consumption rates, so it should be totally forward-compatible with these changes |
Assume backing groups of 4 or 5 validators with a backing threshold of 2. If X is backed by 1 & 2 and Y is backed by 3 & 4, but not 1 & 2, then X < X+Y cannot become available if 3 & 4 withhold chunks in Y. Imho, this is not really an immediate problem, and these backing groups sound too large anyways, but it'll maybe bother @AlistairStewart or sr labs or complicate billing. I suppose grouping in inclusion fixes this, but likely not worth those headaches. We could've logic that mitigates this under your grouping in backing model too. It's easiest if backing are 2 out of 3 but other options: Example 1. If you make the relay chain block after availability fails then you could retry with the chunks you received, but not with the chunks you did not recieve. This is bad in Aurababe, since you're not anonymous 3/4th of the time, but you're anonymous in sassafras, so then it'll work reasonably well. Example 2. If availability is really slow for some availability core, then we dynamically add more find grained votes to the bitfield, before abandoning the whole availability run. As a bike shed, I'd think names like "hypercores" or even "core groups" confuse people, so I'd suggest something more inspired by the protocol.
Also.. If cores contain multiple blocks then we'll likely validate them in parallel. We discussed flavors of SPREE being truly a separate block, but this sounded excessive to me, and is not how Basti envisions SPREE. We might anyways have a truly separat-ish blocks from the same parachain being validated in parallel here, so one could imagine multiple blocks from the same parachain having merged PoV witnesses. |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/altering-polkadots-fork-choice-to-reduce-da-load/3389/13 |
Any idea how hard @rphmeier's flavor of core groups is going to be? |
Essentially this requires changes across the entire stack, so it will not be easy at all. I'm going to start working on a PoC after I have a very good understanding of the details. |
Ahh, I suppose the initial idea changed little after backing, plus whatever cumulus does, but the improved version by Rob demands deep changes in availability too. |
I've drafted a PoC plan based on the discussion here. Nothing here is set in stone, so feel free to comment. I expect the PoC to be very well documented and incrementally built. @eskimor @rphmeier @burdges If this sounds good, I will start to create individual tickets for the work. GoalMinimizing the constant overheads in approval checking and availability recovery. Compatiblity with RFC-1, RFC-3, RFC-5. High level changes
New configuration parameters
Target Numbers for testing6s prablocks core_groups: 75 From a birds eye view it becomes a necessity to bump CPUs from 8 to 10 or more to accomodate this scale and be able to absorb any surge in block space utilization ( longer PVF runs). The testing will use a mix of PVF execution times and PoV sizes using Glutton, that should amount to a target of 50% utilization of block space and time. We can easily adiust the configuration Gluttons to increase/decrease that target. RuntimeAt least one new API will be needed to expose the Backing
AvailabilityAs we fetch the same availability chunk indices for each candidate in the We’d still prefer to fetch the PoV from the backing group up to some limit (currently 128KB), but if that fails we’ll do a normal chunk recovery. There are additional optimizations left on the table, which can be implemented later such as:
Approval checkingFor the PoC it would be best if we could reuse CandidateIndex and CoreIndex from assignments certificates and make them point to ExecutionThe target numbers assume worse case scenario of all cores in a group use 100% of their maximum execution time means we need to bump PVF workers to 5 which is 50% of total node CPU bandwidth. The other 50% must be enough for the last 2 top CPU consumers, networking and availability recovery. The erasure recovery and network optimizations could possibly give back 1 CPU that can be used in validation. |
Important: It's relay chain block producers who allocate candidates to cores, but it's still backers on the hook for the execution time. We need thus backers to place the resource consumption, in time and memory, into their backing statements. In principle, the backers statements about memory usage determine if nodes are running parachain blocks in parallel. |
I imagined this could go in the candidate receipt, where collators actually determine it and backers just check if the resource requirements hold. PoV size should be committed to in the candidate receipt anyway, for a bunch of reasons, but committing to execution time accurately is probably a bit harder so it'd tend to be an overestimate. |
I was imagining we'd still keep it as a single "core", but just have multiple things go onto it. If we stick with 1 core per candidate, it'll lead to the number of cores per relay-chain block fluctuating based on what gets posted. That seems odd. My mental model is based around regions, so I'm just imagining that backers work on all the regions that are assigned to the core, however many of them there may be. For the PoC we have to do something different, so maybe that's what this is. But we could achieve the same thing by just having the
Wouldn't we rather have the resource constraints specified on a per-core rather than a per-group basis? Or is an
Yeah, this sounds right
We could still call these cores. Not sure we need a new term, just that we need to change the semantic meaning of a core. Everything else looks right here.
Yep
We should be thinking in regions, not cores. e.g. we should support including multiple candidates from the same process in a single bundle. I'm not sure why a backer would need to be assigned to many cores at once. At least in light of RFC-3, we will already have many processes multiplexed on the same core, so a single backing group can work on all those regions at once anyway. Maybe fine for a PoC, but I expect this'd have to change a lot for production.
Shouldn't it be the case that the sum of all candidates on a single core use e.g. 2 seconds of execution time or whatever the limit per core is? I suppose we could push higher when taking parallelism into account. However, at that point we are starting to break compatibility with RFC-1/3 as we assume that 1 parachain is even capable of occupying 100% of a core. Of course, with block authoring/import parallelism on the Cumulus side we can probably achieve this. However I'd still push for the approach where a single "core" can just handle larger amounts of throughput per-block, but we expect that many chains are simultaneously multiplexed across them. e.g. every core can handle 8s of execution per 6s, but any individual state transition put on that core is at most 2s of execution. This'd let us lean into the amortization without introducing any new concepts, except for the resource allocation schedule of the core. |
My assumption here was that we do not want to change backing or implement parts of RFC-3. But yeah, the PoC should still in principle be as close as possible to production. Something like this should be workable
Why should a parachain using 100% of the core break RFC-1,3 ? That would basically be what we have now, and it means it will be the only thing in an
Yes, that is doable, but I would like to keep it simpler. We can just consider 1 PVF worker per core per relay chain block -> so we're limited by 6s of execution. But, since these |
Can you explain the I am just thinking that a
It doesn't, but it sounds like you're talking about cores (or groups?) which can process more than 2s of execution time. Once we pass a certain threshold it's pretty hard for a parachain to keep up, which I think is OK. |
Yes, its |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/dynamic-backing-groups-preparing-cores-for-agile-scheduling/3629/1 |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/dynamic-backing-groups-preparing-cores-for-agile-scheduling/3629/3 |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/dynamic-backing-groups-preparing-cores-for-agile-scheduling/3629/8 |
The idea evolved to push the bundling into the other direction: To builders/collators. Candidate Descriptor changes are already being worked on in the course of #1829 . |
Although less dynamic, that's less relay chain complexity, probably the right choice. |
Backers are no longer responsible for cores, but for core groups. A core group
consists of e.g. 4 cores. The backers will then try to back 4 cores at once and
bundle them up into a single bundle they vote on. There will be a single
backing vote per backer per bundle. If a PoV turns out invalid the backer will
simply not put into the bundle.
Cores will then only matter for scheduling and for backers in the sense that
they are bundling PoVs up. Likely backers will communicate about individual PoVs
with each other to build up those bundles efficiently.
From then on we have statements per bundle, bitfields were bits represent those
bundles and approvals/assignments that operate on bundles. Basically with a
bundle size of 2, we should be as good with 6s block times as we were with 12s,
with a bundle size of 4 we are even twice as good and should be able to support
twice as many parachains as we can support now. This optimization does optimize
the whole stack:
more meaningful.
The nice thing is, we can scale bundles with machines becoming more powerful - we
are no longer bottlenecking on network and constant overhead.
Roadmap:
of half with bundles of size 4!
Further Scaling Extensions
We can have big and little cores. A big core is actually representing a physical
core. Meaning, POVs of two big cores will be executed in parallel. But we can
also have smaller cores, with half the execution time for example. Then groups would have
cores of different sizes. E.g. a core group with 4 big/full cores, another could
have 8 little cores (each offering half the execution time). Parachains can
order big/little/tiny cores as they see fit.
Path for a parachain: A parachain with little blockspace demand can take a small
core, if demand grows it can scale up to a big core, if that still is not
sufficient it can use boost, taking two cores for parallel execution of two PoVs
at once.
The text was updated successfully, but these errors were encountered: