Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polkadot node raising disputes because of Execution aborted due to trap: wasm trap: wasm unreachable` #5413

Closed
alexggh opened this issue Aug 20, 2024 · 7 comments

Comments

@alexggh
Copy link
Contributor

alexggh commented Aug 20, 2024

The following node was raising a lot of disputes on parachain 2032 because of the bellow error, full logs attached:

Aug 16 01:21:42 n36bc68 polkadot[2629030]: WASM backtrace:
Aug 16 01:21:42 n36bc68 polkadot[2629030]: error while executing at wasm backtrace:
Aug 16 01:21:42 n36bc68 polkadot[2629030]:     0: 0x75b2ad - <unknown>!<wasm function 9243>
Aug 16 01:21:42 n36bc68 polkadot[2629030]:     1: 0x3a2152 - <unknown>!trie_db::trie_codec::decode_compact_from_iter::hb98b612c8c36d847
Aug 16 01:21:42 n36bc68 polkadot[2629030]:     2: 0x410291 - <unknown>!sp_trie::trie_codec::decode_compact::h45c1a6490d0f8abc
Aug 16 01:21:42 n36bc68 polkadot[2629030]:     3: 0x3a1b27 - <unknown>!sp_trie::storage_proof::CompactProof::to_memory_db::h92908fb2b5a194e9
Aug 16 01:21:42 n36bc68 polkadot[2629030]:     4: 0x620d73 - <unknown>!cumulus_pallet_parachain_system::validate_block::implementation::validate_block::hca407a89dd5736f7
Aug 16 01:21:42 n36bc68 polkadot[2629030]:     5: 0x4f605a - <unknown>!validate_block
Aug 16 01:21:42 n36bc68 polkadot[2629030]:     6: 0x7582d3 - <unknown>!<wasm function 8889> artifact_id=ArtifactId { code_hash: 0xff0fd1ba1d31b1f05c66ffff10b83d6f6af6c28c006b2ce79dfad246c92ea0f5, executor_params_prep_hash: 0x50b16a713c0f8774ba3a0757c722b48c6eb4c1972f6e58c54fd0851f21a08633 } worker=Worker(1v1) worker_rip=false
Aug 16 01:21:42 n36bc68 polkadot[2629030]: 2024-08-16 01:21:42 Failed to validate candidate para_id=Id(2032) error=Invalid(WorkerReportedInvalid("execute: Execution aborted due to trap: wasm trap: wasm `unreachable` instruction executed\nWASM backtrace:\nerror while executing at wasm backtrace:\n    0: 0x75b2ad - <unknown>!<wasm function 9243>\n    1: 0x3a2152 - <unknown>!trie_db::trie_codec::decode_compact_from_iter::hb98b612c8c36d847\n    2: 0x410291 - <unknown>!sp_trie::trie_codec::decode_compact::h45c1a6490d0f8abc\n    3: 0x3a1b27 - <unknown>!sp_trie::storage_proof::CompactProof::to_memory_db::h92908fb2b5a194e9\n    4: 0x620d73 - <unknown>!cumulus_pallet_parachain_system::validate_block::implementation::validate_block::hca407a89dd5736f7\n    5: 0x4f605a - <unknown>!validate_block\n    6: 0x7582d3 - <unknown>!<wasm function 8889>"))
Aug 16 01:21:42 n36bc68 polkadot[2629030]: 2024-08-16 01:21:42 Detected invalid candidate as an approval checker. reason=ExecutionError("execute: Execution aborted due to trap: wasm trap: wasm `unreachable` instruction executed\nWASM backtrace:\nerror while executing at wasm backtrace:\n    0: 0x75b2ad - <unknown>!<wasm function 9243>\n    1: 0x3a2152 - <unknown>!trie_db::trie_codec::decode_compact_from_iter::hb98b612c8c36d847\n    2: 0x410291 - <unknown>!sp_trie::trie_codec::decode_compact::h45c1a6490d0f8abc\n    3: 0x3a1b27 - <unknown>!sp_trie::storage_proof::CompactProof::to_memory_db::h92908fb2b5a194e9\n    4: 0x620d73 - <unknown>!cumulus_pallet_parachain_system::validate_block::implementation::validate_block::hca407a89dd5736f7\n    5: 0x4f605a - <unknown>!validate_block\n    6: 0x7582d3 - <unknown>!<wasm function 8889>") candidate_hash=0xb745185b6df2efc515ad8ee0b5551afe2f449592a3a110584eb862e7ebc92b98 para_id=Id(2032) traceID=243607485733499342685943594883350862590
Aug 16 01:21:44 n36bc68 polkadot[2629030]: 2024-08-16 01:21:44 💤 Idle (40 peers), best: #22109111 (0xa153…0270), finalized #22109108 (0x415f…ae66), ⬇ 1.1MiB/s ⬆ 866.4kiB/s

carpediem.txt

Point of contact: https://matrix.to/#/!NZrbtteFeqYKCUGQtr:matrix.parity.io/$17240809840JLZsS:parity.io?via=parity.io&via=corepaper.org&via=matrix.org.

Related to: https://github.com/paritytech/project-mythical/issues/213

@bkchr
Copy link
Member

bkchr commented Aug 20, 2024

Looks like the PoV is invalid?

@alexggh
Copy link
Contributor Author

alexggh commented Aug 20, 2024

These are included candidates and the node raising the dispute because of this reason was the only one voting invalid, so I'm not sure how the PoV can be invalid.

@Jsdiem
Copy link

Jsdiem commented Aug 20, 2024

Hello, here are some more data that you requested:
Version at that time was 1.15.0, right now it is 1.15.1.
Hardware specs are dedicated server with 6c/12t CPU, 64 GB of RAM and 2 x 1 TB NVME. Bare metal with only Polkadot running on it, didn't have any issues before.
Command line I'm using to start a node is: ExecStart=/home/admin/polkadot-sdk/target/release/polkadot --validator --name "Carpediem" --chain=polkadot --database=paritydb --sync=warp --pruning=1000 --telemetry-url 'wss://telemetry-backend.w3f.community/submit/ 1'
I have a script running on the node that restarts polkadot every 60 hours, but it seems that it didn't help and that node came back to normal itself. You can check the attached log.
out.zip

@alexggh
Copy link
Contributor Author

alexggh commented Aug 21, 2024

@alexggh
Copy link
Contributor Author

alexggh commented Aug 22, 2024

Look a bit more closer on the provided logs, some timeline:

  1. Node starts out of the sudden to fail validation for all block on parachain 2032
Aug 16 00:43:07 n36bc68 polkadot[2629030]: 2024-08-16 00:43:06 Failed to validate candidate para_id=Id(2032) error=Invalid(WorkerReportedInvalid(
  1. Last validation failure:
Aug 16 11:45:48 n36bc68 polkadot[2629030]: 2024-08-16 11:45:48 Failed to validate candidate para_id=Id(2032) 
  1. Node gets restarted because of this:
Aug 16 11:49:20 n36bc68 polkadot[2629030]: 2024-08-16 11:49:20 🥩 Error: RequestsReceiverStreamClosed. Terminating.
Aug 16 11:49:21 n36bc68 polkadot[2629030]: 2024-08-16 11:49:21 Essential task `beefy-gadget` failed. Shutting down service.
Aug 16 11:49:25 n36bc68 polkadot[2794002]: 2024-08-16 11:49:25 Parity Polkadot
Aug 16 11:49:25 n36bc68 polkadot[2794002]: 2024-08-16 11:49:25 ✌️  version 1.15.0-743dc632fd6
  1. Node recovered and does not fail to validate parachain 2032

Possible root-cause

The node actually recovered because of a restart and given that after restart all PVFs get recompiled I tend to think the problem might have been caused by a corruption of the PVF artefact for parachain 2032 which then lead it to fail validation on all candidates for that parachain.

@alexggh
Copy link
Contributor Author

alexggh commented Aug 22, 2024

With a zombienet I managed to get the validator into a state where it continuously disputes candidates for a given parachain by manually corrupting its PVF artefact, so from the data we got I'm inclined to think that's what happened here.

Also, I'm not sure there is any new data that we can obtained at this moment, because once the node restarts the artefacts are gone, so the next steps here would be:

  1. Mitigate and address this corruption possibility with PVF: consider adding a checksum for artifacts #5441 or something similar.

  2. @Jsdiem keep an eye on your node, if you seen this error coming back, although if the theory is right it is unlikely it will, please archive parachains and pvf-artifacts folders from the database path and share them with us.

@alexggh
Copy link
Contributor Author

alexggh commented Aug 27, 2024

Closing this mitigation should be implemented with: #5441.

@alexggh alexggh closed this as completed Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants