Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relay chain coretime assigner does not support more assignments than fit in a single XCM message (currently 28) #6102

Closed
seadanda opened this issue Oct 17, 2024 · 7 comments · Fixed by #6384
Labels
C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. C2-good-first-issue A task for a first time contributor to become familiar with the Polkadot-SDK. D0-easy Can be fixed primarily by duplicating and adapting code by an intermediate coder. I2-bug The node fails to follow expected behavior. T8-polkadot This PR/Issue is related to/affects the Polkadot network.

Comments

@seadanda
Copy link
Contributor

The system allows interlacing right down to the single block level (80 assignments per timeslice, each with a CoreMask with one bit set)
The problem is that it creates a call that doesn't actually fit in an XCM message (we can fit max 28 assignments in a single XCM)

We can easily chunk that on the Coretime Chain side and send it over as four messages, however with the current design that means we need to call assign_core multiple times on the relay for a given timeslice which is disallowed by design due to some assumptions made by the scheduler.

Mitigation in the mean time:
28 assignments is the limit, but 27 assignments that don't add up to a complete mask will be rejected due to the requirement for a full mask on the relay. Therefore we take the first 27 and append an Idle assignment, taking it to 28.
This will make anybody who interlaces more than 27 times lose some assignments, but it's better than the current system, which just drops the entire core's assignments because the message is too big. Once this is missed, it's gone from the workplan and is a total mess. Far preferable to truncate and assign everything we can until we can drop some assumptions in the scheduler on the relay.

Mitigation for the Polkadot launch: polkadot-fellows/runtimes#434
Testnets mitigation: #6022

@seadanda seadanda added I2-bug The node fails to follow expected behavior. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Oct 17, 2024
@seadanda
Copy link
Contributor Author

Just copying the initial idea here that I had when this first popped up:
The likelihood of somebody interlacing down to 27 assignments is very low, so maybe something like assigning each chunk one block later than the previous could be a fix that maintains some of the assumptions in the implementation, with a potential short outage for the workloads who get the second or third chunk, but by timeslice 2 of the region they're all running as intended. Since the first 27 assignments are already on the relay, it should be possible to achieve this without any downtime.

assign_core has the signature

	pub fn assign_core(
		core_idx: CoreIndex,
		begin: BlockNumberFor<T>,
		assignments: Vec<(CoreAssignment, PartsOf57600)>,
		end_hint: Option<BlockNumberFor<T>>,
	) -> Result<(), DispatchError> {

and we just need to never call it more than once for the same core and begin combination.
so 80 assignments:
0..27 assigned on the begin
27..55 assigned on begin+1
55..80 assigned on begin+2

But we still need to change the relay logic to drop the requirement for each assign_core call to contain a fully scheduled core. As part of that we'd need to add logic in there to pad an underscheduled core with Idle, then when a further underscheduled assignment comes in within a timeslice (for example) it should try to remove the Idle padding, "append" the new parts and recompute the padding again.

@bkchr bkchr added D0-easy Can be fixed primarily by duplicating and adapting code by an intermediate coder. C2-good-first-issue A task for a first time contributor to become familiar with the Polkadot-SDK. labels Oct 17, 2024
@seadanda seadanda added the C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. label Oct 17, 2024
@spanow
Copy link

spanow commented Nov 4, 2024

Hello,
May I work on this issue ?

@seadanda
Copy link
Contributor Author

seadanda commented Nov 5, 2024

Hi, yes this is free to take on

@eskimor eskimor mentioned this issue Nov 6, 2024
@eskimor
Copy link
Member

eskimor commented Nov 6, 2024

@spanow have you started already? I hope not, release is getting cut today and we have the other fixes ready, so I want to get this in today: Fixed the issue here by relaxing the append requirement, it now is relaxed to be fine also if begin == last. The mask also no longer needs to be full.

@spanow
Copy link

spanow commented Nov 6, 2024

Totally understandable
It's alright, thank you for notifying me @eskimor

@eskimor
Copy link
Member

eskimor commented Nov 6, 2024

@spanow @seadanda suggested a unit test to ensure non-exhaustive (and overfull) assignments don't cause any issues. If you are interested in an even more beginner friendly task, that would still be left to do (and is not time-critical).

github-merge-queue bot pushed a commit that referenced this issue Nov 6, 2024
Relax requirements for `assign_core` so that it accepts updates for the
last scheduled entry.

Fixes #6102

---------

Co-authored-by: eskimor <eskimor@no-such-url.com>
Co-authored-by: GitHub Action <action@github.com>
@seadanda
Copy link
Contributor Author

seadanda commented Nov 6, 2024

#6397 is the issue if you're interested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. C2-good-first-issue A task for a first time contributor to become familiar with the Polkadot-SDK. D0-easy Can be fixed primarily by duplicating and adapting code by an intermediate coder. I2-bug The node fails to follow expected behavior. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants