Move candidate-validation on blocking tasks #3122

alexggh · 2024-01-30T08:13:12Z

Candidate validation has a lot of operations that are cpu bound on its main loop things, like:

validation_code.hash()
sp_maybe_compressed_blob::decompress(
		&validation_code.0,
		VALIDATION_CODE_BOMB_LIMIT,
)

sp_maybe_compressed_blob::decompress(&pov.block_data.0, POV_BOMB_LIMIT) 
let code_hash = sp_crypto_hashing::blake2_256(&code).into();

When you add all that you for large POV and CODE it is going to take in the order of 10s of ms and because these are cpu bound operation it is going to hog the executor thread and negatively affect other subsystems around it, so it is better to just move the subsystem on the blocking pool to make sure such unexpected behaviour is avoided.

Note! In practice this subsystem does not have a high number of work to be done, so probably the impact of it is really low, but better safe than sorry.

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

alindima

I think this'll close: #599

alexggh · 2024-01-30T08:50:44Z

I think this'll close: #599

yes, it will.

bkchr · 2024-01-30T09:08:59Z

If I understand this correctly, we move the entire susbsystem to a blocking task? Instead of doing it correctly and only have the blocking operations being put into a blocking task? So, the subsystem can block and bring down the entire node again because it thinks that the subsystem isn't answering requests?

alexggh · 2024-01-30T09:47:46Z

If I understand this correctly, we move the entire susbsystem to a blocking task? Instead of doing it correctly and only have the blocking operations being put into a blocking task? So, the subsystem can block and bring down the entire node again because it thinks that the subsystem isn't answering requests?

The subsystem doesn't do much else, it is basically doing all the cpu intensive work and then calls into the pvf-worker.
Calling spawn.blocking for each message wouldn't bring us too much value and there is also a big downside since the number of spawn_blocking task is limited, so you need to be careful with how much you spawn before you hit this: https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads and block(it is doable but it is not a panacea).

So, the subsystem can block and bring down the entire node again because it thinks that the subsystem isn't answering requests?

If by again you are referring to the problem on rococo of yesterday, that was cause by this https://github.com/paritytech/orchestra/pull/71 and it is not related at all to this PR, it is just something I noticed while investigating it.

Now, if you think that this subsystem could timeout because it processes its messages too slow on a single blocking thread, that's not really a concern for me, because this subsystem has volume of around 6-7 messages per block and a bounded queue of around 4096 messages, the blocking work takes at most 10s of millis and it can't be longer because of MAX_POV and MAX_CODE size, the work is also multi-tasked internally in a FuturesUnordered queue, so it is not like we fully process serially each message. So, I would say it is very unlikely to happen for this subsystem.

bkchr · 2024-01-30T09:51:16Z

If by again you are referring to the problem on rococo of yesterday

I refer to this: #1730

I just have seen the new discussion started around removing the timeout stuff.

alexggh · 2024-01-30T09:59:20Z

If by again you are referring to the problem on rococo of yesterday

I refer to this: #1730

I just have seen the new discussion started around removing the timeout stuff.

Yeah, that is definitely different from what this addresses and shoud be fixed by removing the timeout.

Move candidate-validation on blocking tasks

7a96859

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

alexggh added R0-silent Changes should not be mentioned in any release notes T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Jan 30, 2024

eskimor approved these changes Jan 30, 2024

View reviewed changes

alindima approved these changes Jan 30, 2024

View reviewed changes

tdimitrov approved these changes Jan 30, 2024

View reviewed changes

sandreim approved these changes Jan 30, 2024

View reviewed changes

sandreim added this pull request to the merge queue Jan 30, 2024

alindima linked an issue Jan 30, 2024 that may be closed by this pull request

Move ZSTD PVF decompression to a blocking task #599

Closed

Merged via the queue into master with commit ff2e7db Jan 30, 2024
129 of 131 checks passed

sandreim deleted the alexaggh/feature/make_candidate_validation_blocking branch January 30, 2024 09:47

github-actions bot mentioned this pull request Mar 13, 2024

Update polkadot-sdk from v1.3.0 to v1.7.2 moonbeam-foundation/moonbeam#2703

Closed

This was referenced Jul 19, 2024

Prepare PVFs if node is a validator in the next session #4791

Merged

Offload PVF code decompression to a separate task on the blocking pool #5071

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move candidate-validation on blocking tasks #3122

Move candidate-validation on blocking tasks #3122

alexggh commented Jan 30, 2024

alindima left a comment

alexggh commented Jan 30, 2024

bkchr commented Jan 30, 2024

alexggh commented Jan 30, 2024 •

edited

Loading

bkchr commented Jan 30, 2024

alexggh commented Jan 30, 2024

Move candidate-validation on blocking tasks #3122

Move candidate-validation on blocking tasks #3122

Conversation

alexggh commented Jan 30, 2024

alindima left a comment

Choose a reason for hiding this comment

alexggh commented Jan 30, 2024

bkchr commented Jan 30, 2024

alexggh commented Jan 30, 2024 • edited Loading

bkchr commented Jan 30, 2024

alexggh commented Jan 30, 2024

alexggh commented Jan 30, 2024 •

edited

Loading