-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offload PVF code decompression to a separate task on the blocking pool #5071
Comments
@sandreim @alexggh @alindima I need a bit of advice here. The idiomatic approach to run a lot of CPU-bound tasks is to do that on |
We have the |
As @bkchr says, just spawn a blocking task. I did the same in availability recovery, you can use it as an example. |
Okay, I went through the code once again, and it doesn't make much sense to me 🫠 To my mind, it should be refactored. The PVF decompression should be performed by the preparation worker, and the PoV decompression should go to the execution worker. The PVF host should return observable decompressed sizes to the candidate validation subsys (at the end of the day, it doesn't matter when we observe the sizes, before the validation or after it). As I see it, the only drawback of this approach is that decompression times will be included in the preparation and execution timeouts. But they should be negligible anyway? Do we maybe have benchmarks of how fast the zstd decompression is on the reference hardware? Any objections to this approach? |
Not sure why you think it is not used, the code shows that the decompressed blob is being passed in
|
Exactly, but the My point is that everything that is currently done with the compressed data in the candidate validation subsys may be done in PVF preparation/execution workers as well, and nothing will change. Candidate validation subsys doesn't need that data decompressed for itself, it decompresses them for the PVF host. |
I see, this is a nice micro optimization. The impact of it depends on the number of pending validation requests in the queue and how long decompression takes. I think next step should be to measure and see if decompression significantly stalling the candidate validation loop. |
After we merged #4791, we decompress the same PVF one more time when we prepare the node for the active set, which looks unoptimal. This can be solved with a slight refactoring of the message arguments. But does that mean we no longer need the subsystem in the blocking pool? |
I'm currently working on a PR that moves all the decompression to PVF host workers, stay tuned :) |
Closes #5071 This PR aims to * Move all the blocking decompression from the candidate validation subsystem to the PVF host workers; * Run the candidate validation subsystem on the non-blocking pool again. Upsides: no blocking operations in the subsystem's main loop. PVF throughput is not limited by the ability of the subsystem to decompress a lot of stuff. Correctness and homogeneity improve, as the artifact used to be identified by the hash of decompressed code, and now they are identified by the hash of compressed code, which coincides with the on-chain `ValidationCodeHash`. Downsides: the PVF code decompression is now accounted for in the PVF preparation timeout (be it pre-checking or actual preparation). Taking into account that the decompression duration is on the order of milliseconds, and the preparation timeout is on the order of seconds, I believe it is negligible.
) Closes paritytech#5071 This PR aims to * Move all the blocking decompression from the candidate validation subsystem to the PVF host workers; * Run the candidate validation subsystem on the non-blocking pool again. Upsides: no blocking operations in the subsystem's main loop. PVF throughput is not limited by the ability of the subsystem to decompress a lot of stuff. Correctness and homogeneity improve, as the artifact used to be identified by the hash of decompressed code, and now they are identified by the hash of compressed code, which coincides with the on-chain `ValidationCodeHash`. Downsides: the PVF code decompression is now accounted for in the PVF preparation timeout (be it pre-checking or actual preparation). Taking into account that the decompression duration is on the order of milliseconds, and the preparation timeout is on the order of seconds, I believe it is negligible.
In #3122, we moved the whole candidate validation subsystem to the blocking pool, as it performs PVF code decompression, which is a blocking task, although a small one. That wasn't the perfect solution, but in the context of problem space, as it looked like at that point, that was acceptable. Now, with agile coretime and #5012 around the corner, we must be ready to prepare a lot more PVFs than before, and the problem with blocking the candidate validation with decompressions is much more concerning.
We should offload that work to a separate task to allow candidate validation subsys to await on it asynchronously.
The text was updated successfully, but these errors were encountered: