PVF: consider adding a checksum for artifacts #5441

sandreim · 2024-08-22T10:36:44Z

The checksum should only be stored after successful validation of candidates . It then should be checked before PVF artifact is used to validate a candidate. If it differs, we recompile the artifact and then validate the candidate. If after recompilation the validation still fails, we emit an error and stop validating using that artifact.

Any thoughts about this @s0me0ne-unkn0wn @alexggh ?

alexggh · 2024-08-22T10:52:44Z

Any thoughts about this @s0me0ne-unkn0wn @alexggh ?

My first thought would be that checksumming the entire PVF everytime could prove to be expensive, however I don't see any reasons why we can't do it periodically and cleanup the corrupted one, in this way the validator recovers quickly if we hit such a condition and we don't pay the price of checksumming all the time.

sandreim · 2024-08-22T11:28:53Z

Agreed there is overhead, but let’s measure it. Assuming nodes do at most 10-12 validations on average per RCB it shouldn’t be much overhead IMO.

alexggh · 2024-08-22T12:12:03Z

Agreed there is overhead, but let’s measure it. Assuming nodes do at most 10-12 validations on average per RCB it shouldn’t be much overhead IMO.

The largest kusama PVF has around 50MiB(the smallest is 20MiB), sha-1 on it on reference hardware seems to take around 50ms given most PVF execution on kusama are bellow 500ms, that could be around 10% overhead.
For 10 validation per block that's an extra 500ms.

I wouldn't want to pay this price all the time for fixing this edge-case, maybe we could just check it for PVFs that fail validation as a way to try to recover the node as fast as possible.

sandreim · 2024-08-22T13:51:07Z

SHA-1 is quite expensive, wouldn't a good old CRC32 fit our usecase ? It is great at detecting accidental bit flips in network or storage devices. It won't protect against intentional changes, but we don't care about it. I like the trade off.

eskimor · 2024-08-22T14:10:36Z

I think we had an issue for this already and the idea to not pay the overhead on the happy path was:

Just run it - if it fails only raise a dispute after checking the checksum.
If failed and checksum was wrong: Well clean up that mess and issue a big fat warning in the logs.

alexggh · 2024-08-22T14:10:54Z

SHA-1 is quite expensive, wouldn't a good old CRC32 fit our usecase ? It is great at detecting accidental bit flips in network or storage devices. It won't protect against intentional changes, but we don't care about it. I like the trade off.

Checked the performance of https://docs.rs/crc-catalog/latest/crc_catalog/algorithm/constant.CRC_32_BZIP2.html & https://docs.rs/crc-catalog/latest/crc_catalog/algorithm/constant.CRC_32_CKSUM.html

I'm a bit surprised but on this 50MiB file it seems to actually perform worse than sha1, it is around 100ms.

sandreim · 2024-08-22T14:16:27Z

I think we had an issue for this already and the idea to not pay the overhead on the happy path was:

Just run it - if it fails only raise a dispute after checking the checksum.

If failed and checksum was wrong: Well clean up that mess and issue a big fat warning in the logs.

Yeah, this is more efficient. However I am surprised by the CRC32 results.

burdges · 2024-08-22T14:32:39Z

I am surprised by the CRC32 results.

I've noticed remark that CRC32 winds up slow in practice.

Just run it - if it fails only raise a dispute after checking the checksum.

Yes, this makes sense.

We're likely happy if we lower latancy here, but have all CPU cores work hard upon this, given we're only running the check once validation fails, right?

I'd think Blake3 checks the boxes wellk enough: It's extremely fast thanks to being a Merkle tree, at the cost of using all available CPU cores. We do not need a cryptographic hash for disk corruptions, but who knows maybe something stranger becomes possible with compiler toolchains.

s0me0ne-unkn0wn · 2024-08-22T14:44:06Z

There was a closely related discussion in #3139. I remember Jan saying that the blake3 hasher throughput should be far enough for any practical purpose in our case.
However, the "execute, and if it fails, check the checksum" approach makes perfect sense to me.

sandreim · 2024-08-28T16:18:02Z

closing in favor of #677

sandreim added the I5-enhancement An additional feature request. label Aug 22, 2024

alexggh mentioned this issue Aug 22, 2024

Polkadot node raising disputes because of Execution aborted due to trap: wasm trap: wasm unreachable` #5413

Closed

sandreim assigned s0me0ne-unkn0wn Aug 22, 2024

sandreim closed this as completed Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVF: consider adding a checksum for artifacts #5441

PVF: consider adding a checksum for artifacts #5441

sandreim commented Aug 22, 2024 •

edited

Loading

alexggh commented Aug 22, 2024 •

edited

Loading

sandreim commented Aug 22, 2024

alexggh commented Aug 22, 2024 •

edited

Loading

sandreim commented Aug 22, 2024

eskimor commented Aug 22, 2024

alexggh commented Aug 22, 2024

sandreim commented Aug 22, 2024

burdges commented Aug 22, 2024

s0me0ne-unkn0wn commented Aug 22, 2024

sandreim commented Aug 28, 2024

PVF: consider adding a checksum for artifacts #5441

PVF: consider adding a checksum for artifacts #5441

Comments

sandreim commented Aug 22, 2024 • edited Loading

alexggh commented Aug 22, 2024 • edited Loading

sandreim commented Aug 22, 2024

alexggh commented Aug 22, 2024 • edited Loading

sandreim commented Aug 22, 2024

eskimor commented Aug 22, 2024

alexggh commented Aug 22, 2024

sandreim commented Aug 22, 2024

burdges commented Aug 22, 2024

s0me0ne-unkn0wn commented Aug 22, 2024

sandreim commented Aug 28, 2024

sandreim commented Aug 22, 2024 •

edited

Loading

alexggh commented Aug 22, 2024 •

edited

Loading

alexggh commented Aug 22, 2024 •

edited

Loading