Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF: consider adding a checksum for artifacts #5441

Closed
sandreim opened this issue Aug 22, 2024 · 10 comments
Closed

PVF: consider adding a checksum for artifacts #5441

sandreim opened this issue Aug 22, 2024 · 10 comments
Assignees
Labels
I5-enhancement An additional feature request.

Comments

@sandreim
Copy link
Contributor

sandreim commented Aug 22, 2024

... related to #5413 (comment)

The checksum should only be stored after successful validation of candidates . It then should be checked before PVF artifact is used to validate a candidate. If it differs, we recompile the artifact and then validate the candidate. If after recompilation the validation still fails, we emit an error and stop validating using that artifact.

Any thoughts about this @s0me0ne-unkn0wn @alexggh ?

@sandreim sandreim added the I5-enhancement An additional feature request. label Aug 22, 2024
@alexggh
Copy link
Contributor

alexggh commented Aug 22, 2024

Any thoughts about this @s0me0ne-unkn0wn @alexggh ?

My first thought would be that checksumming the entire PVF everytime could prove to be expensive, however I don't see any reasons why we can't do it periodically and cleanup the corrupted one, in this way the validator recovers quickly if we hit such a condition and we don't pay the price of checksumming all the time.

@sandreim
Copy link
Contributor Author

Agreed there is overhead, but let’s measure it. Assuming nodes do at most 10-12 validations on average per RCB it shouldn’t be much overhead IMO.

@alexggh
Copy link
Contributor

alexggh commented Aug 22, 2024

Agreed there is overhead, but let’s measure it. Assuming nodes do at most 10-12 validations on average per RCB it shouldn’t be much overhead IMO.

The largest kusama PVF has around 50MiB(the smallest is 20MiB), sha-1 on it on reference hardware seems to take around 50ms given most PVF execution on kusama are bellow 500ms, that could be around 10% overhead.
For 10 validation per block that's an extra 500ms.

I wouldn't want to pay this price all the time for fixing this edge-case, maybe we could just check it for PVFs that fail validation as a way to try to recover the node as fast as possible.

@sandreim
Copy link
Contributor Author

SHA-1 is quite expensive, wouldn't a good old CRC32 fit our usecase ? It is great at detecting accidental bit flips in network or storage devices. It won't protect against intentional changes, but we don't care about it. I like the trade off.

@eskimor
Copy link
Member

eskimor commented Aug 22, 2024

I think we had an issue for this already and the idea to not pay the overhead on the happy path was:

  1. Just run it - if it fails only raise a dispute after checking the checksum.
  2. If failed and checksum was wrong: Well clean up that mess and issue a big fat warning in the logs.

@alexggh
Copy link
Contributor

alexggh commented Aug 22, 2024

SHA-1 is quite expensive, wouldn't a good old CRC32 fit our usecase ? It is great at detecting accidental bit flips in network or storage devices. It won't protect against intentional changes, but we don't care about it. I like the trade off.

Checked the performance of https://docs.rs/crc-catalog/latest/crc_catalog/algorithm/constant.CRC_32_BZIP2.html & https://docs.rs/crc-catalog/latest/crc_catalog/algorithm/constant.CRC_32_CKSUM.html

I'm a bit surprised but on this 50MiB file it seems to actually perform worse than sha1, it is around 100ms.

@sandreim
Copy link
Contributor Author

I think we had an issue for this already and the idea to not pay the overhead on the happy path was:

  1. Just run it - if it fails only raise a dispute after checking the checksum.
  2. If failed and checksum was wrong: Well clean up that mess and issue a big fat warning in the logs.

Yeah, this is more efficient. However I am surprised by the CRC32 results.

@burdges
Copy link

burdges commented Aug 22, 2024

I am surprised by the CRC32 results.

I've noticed remark that CRC32 winds up slow in practice.

Just run it - if it fails only raise a dispute after checking the checksum.

Yes, this makes sense.

We're likely happy if we lower latancy here, but have all CPU cores work hard upon this, given we're only running the check once validation fails, right?

I'd think Blake3 checks the boxes wellk enough: It's extremely fast thanks to being a Merkle tree, at the cost of using all available CPU cores. We do not need a cryptographic hash for disk corruptions, but who knows maybe something stranger becomes possible with compiler toolchains.

@s0me0ne-unkn0wn
Copy link
Contributor

There was a closely related discussion in #3139. I remember Jan saying that the blake3 hasher throughput should be far enough for any practical purpose in our case.
However, the "execute, and if it fails, check the checksum" approach makes perfect sense to me.

@sandreim
Copy link
Contributor Author

closing in favor of #677

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I5-enhancement An additional feature request.
Projects
Status: Completed
Development

No branches or pull requests

5 participants