Retry timeout #2438

eagr · 2023-11-22T05:31:45Z

closes #660

Is the retry delay being used (200ms) good enough for the retry for job timeout?

Btw, how serious are you with the SchrodingersCandidate? I got zero problem with the name. :)) @mrcnski @s0me0ne-unkn0wn

paritytech-cicd-pr · 2023-11-22T06:01:04Z

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable 2/3
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4429721

mrcnski · 2023-11-22T08:05:58Z

Btw, how serious are you with the SchrodingersCandidate? I got zero problem with the name. :)) @mrcnski @s0me0ne-unkn0wn

I love that you asked this. 😂 I vote "yes", only question is, should we include the umlaut?

mrcnski · 2023-11-22T08:11:19Z

polkadot/node/core/pvf/src/execute/queue.rs

+		// Either the worker or the job timed out. Kill the worker in either case. Treated as
+		// definitely-invalid, because if we timed out, there's no time left for a retry.


Could be reworded as it's potentially confusing. It is only definitely-invalid in backing where we don't retry the possibly-invalid candidates. In backing there's no time for a retry (and it's stricter).

mrcnski · 2023-11-22T08:51:57Z

Hmm, actually I'm not sure if we should retry, at least with the current timeout which is DEFAULT_APPROVAL_EXECUTION_TIMEOUT = 12. That's already a long time given for execution.

One idea is to allow a retry but cut that timeout in half. The idea is that if execution is taking 6 seconds (which is already a long time) then something went wrong (i.e. maybe the worker/job communication broke down).

And speaking of communication, the host/worker comms (note, not worker/job comm) have a really long timeout (like factor of 4 of the default timeout). We shouldn't retry if that fails because there'd just be no time left. But that long timeout needs to remain long, because it's wall clock time, whereas in the job we timeout based on CPU clock time. (That probably sounds really confusing, so see paritytech/polkadot#4217.)

But getting back to cutting the approval timeout in half, I wonder if opens up some attacks, like candidates getting through backing but timing out on some validators with slower machines, leading to disputes. 6 seconds would already be 3x the backing timeout, but I can see it happening. cc @eskimor

eagr · 2023-11-22T09:37:10Z

You have a change of mind now or did I misunderstood your comment?

12s is a really long time. How often does a timeout happen on a valid candidate in practice? If it's extremely rare, I suppose we shouldn't. In that case, I'm happy to close this as it no longer makes sense to switch the category.

mrcnski · 2023-11-22T09:45:44Z

You understood correctly, I just forgot that we had bumped the approval timeout to 12s.

How often does a timeout happen on a valid candidate in practice?

Not sure, I suppose my concerns are more theoretical in nature. Like if there is a validator that is 3x slower than the rest, it can potentially be exploited to create disputes. I think in practice it's more likely that an approval timeout happens due to some local conditions, like some deadlock, comm bug, etc... though still not very likely.

Maybe some compromise is possible, like 8s + a retry? Not sure, @eskimor would have a better idea.

eskimor · 2023-11-22T11:28:19Z

Definitely no to shortening the timeout. We have measured fluctuations on the same machine up to a factor of 3. An additional factor of 2 does not seem to be too conservative, given that validators don't run identical machines.

Whether a retry on timeout makes sense is an interesting question:

We would be too slow to vote and thus approval voting would escalate because of our no-show. This is an issue, as without the retry we would have raised a dispute - which is a full escalation, but in that case at least someone will pay the bill.
Retrying increases the chances of us voting valid on a candidate. If a super majority finds this borderline candidate to be invalid, we just increased our chances of getting slashed significantly. Although if everybody does this, then the chances of a super majority finding the candidate invalid also gets smaller. On top of this with Slash approval voters on approving invalid blocks - dynamically #635 in place, collectively it does not matter much if 1 or 10 validators voted valid on an invalid candidate.

2 is not much of a reason, but 1 means it is free to trigger an escalation (and break sharding). So retry on timeout is not an option ... in fact if we had the time, I don't see why we would ever retry instead of just increasing the timeout to begin with. What would be the value of retrying, instead of increasing the timeout? If you restart - any work already done (even if little) would go to waste. While with a longer timeout you keep it and have a better success chance, given the same time constraints.

eskimor · 2023-11-22T11:32:18Z

12s is a really long time. How often does a timeout happen on a valid candidate in practice?

It is.

If it's extremely rare

We at least hope so. That's the whole point of that long timeout. We actually should increase the noshow timeout because of above reasoning, as execution time alone is not the full story.

For handling dishonest backers, we need time disputes.

mrcnski · 2023-11-22T13:23:48Z

We have measured fluctuations on the same machine up to a factor of 3.

Wow, a 3x fluctuation on the same machine? This must be wall clock time? The timeouts in question are CPU time. But still, I agree that we should have a large margin for differently-performing validators.

in fact if we had the time, I don't see why we would ever retry instead of just increasing the timeout to begin with. What would be the value of retrying, instead of increasing the timeout?

If execution is taking that long, then it could be due to some deadlock etc. Waiting a second and then retrying could avoid some local nondeterminism. Then again, I can't think of any specific internal scenarios that would cause this to occur, except for high load on the machine, in which case retrying could make things worse.

Maybe, there is some way for the node to query whether the execution job is at least making progress as opposed to spinning in a loop. But I might be getting into the weeds here.

eagr · 2023-11-22T13:34:57Z

Thanks for the explanation and pointers! @eskimor Those are so informative. Curious, what difference does it make if we no-show for longer time, if any? Say we double the timeout.

You could just close this if there's no reason to proceed. :) @mrcnski

burdges · 2023-11-22T13:51:37Z

I've forgotten our last discussion, but I think..

We designed time disputes aka overtime billing so that candidate validation timeouts could be increased to 30s or 1 minute, which reduces our risks from execution times differing. We've maybe deprioritized actually doing this until polkavm reaches some maturity. If polkavm becomes twice as fast then maybe we'll have deterministic metering via the jan's floating point overflow trick. It's possible jan's floating point overflow trick work for metering wasmtime too. We'll restart the overtime billing plan though if the floating point trick becomes dubious.

mrcnski · 2023-11-22T14:07:50Z

If polkavm becomes twice as fast

Looks like it just happened.

mrcnski · 2023-11-22T18:13:34Z

Closing. Thanks for the hustle @eagr and apologies for the spurious suggestion! Will also close the related issue as there's nothing left to do.

mrcnski · 2023-11-23T07:50:39Z

Maybe, there is some way for the node to query whether the execution job is at least making progress as opposed to spinning in a loop.

I suppose it can be done in PVM by checking if gas is going up. Maybe overly paranoid. I wouldn't do it unless timeouts become a source of issues.

* Asynchronous backing PR (#2300) * Update substrate & polkadot * min changes to make async backing compile * (async backing) parachain-system: track limitations for unincluded blocks (#2438) * unincluded segment draft * read para head from storage proof * read_para_head -> read_included_para_head * Provide pub interface * add errors * fix unincluded segment update * BlockTracker -> Ancestor * add a dmp limit * Read para head depending on the storage switch * doc comments * storage items docs * add a sanity check on block initialize * Check watermark * append to the segment on block finalize * Move segment update into set_validation_data * Resolve para head todo * option watermark * fix comment * Drop dmq check * fix weight * doc-comments on inherent invariant * Remove TODO * add todo * primitives tests * pallet tests * doc comments * refactor unincluded segment length into a ConsensusHook (#2501) * refactor unincluded segment length into a ConsensusHook * add docs * refactor bandwidth_out calculation Co-authored-by: Chris Sosnin <48099298+slumber@users.noreply.github.com> * test for limits from impl * fmt * make tests compile * update comment * uncomment test * fix collator test by adding parent to state proof * patch HRMP watermark rules for unincluded segment * get consensus-common tests to pass, using unincluded segment * fix unincluded segment tests * get all tests passing * fmt * rustdoc CI * aura-ext: limit the number of authored blocks per slot (#2551) * aura_ext consensus hook * reverse dependency * include weight into hook * fix tests * remove stray println Co-authored-by: Chris Sosnin <48099298+slumber@users.noreply.github.com> * fix test warning * fix doc link --------- Co-authored-by: Chris Sosnin <48099298+slumber@users.noreply.github.com> Co-authored-by: Chris Sosnin <chris125_@live.com> * parachain-system: ignore go ahead signal once upgrade is processed (#2594) * handle goahead signal for unincluded segment * doc comment * add test * parachain-system: drop processed messages from inherent data (#2590) * implement `drop_processed_messages` * drop messages based on relay parent number * adjust tests * drop changes to mqc * fix comment * drop test * drop more dead code * clippy * aura-ext: check slot in consensus hook and remove all `CheckInherents` logic (#2658) * aura-ext: check slot in consensus hook * convert relay chain slot * Make relay chain slot duration generic * use fixed velocity hook for pallets with aura * purge timestamp inherent * fix warning * adjust runtime tests * fix slots in tests * Make `xcm-emulator` test pass for new consensus hook (#2722) * add pallets on_initialize * tests pass * add AuraExt on_init * ".git/.scripts/commands/fmt/fmt.sh" --------- Co-authored-by: command-bot <> --------- Co-authored-by: Ignacio Palacios <ignacio.palacios.santos@gmail.com> * update polkadot git refs * CollationGenerationConfig closure is now optional (#2772) * CollationGenerationConfig closure is now optional * fix test * propagate network-protocol-staging feature (#2899) * Feature Flagging Consensus Hook Type Parameter (#2911) * First pass * fmt * Added as default feature in tomls * Changed to direct dependency feature * Dealing with clippy error * Update pallets/parachain-system/src/lib.rs Co-authored-by: asynchronous rob <rphmeier@gmail.com> --------- Co-authored-by: asynchronous rob <rphmeier@gmail.com> * fmt * bump deps and remove warning * parachain-system: update RelevantMessagingState according to the unincluded segment (#2948) * mostly address 2471 with a bug introduced * adjust relevant messaging state after computing total * fmt * max -> min * fix test implementation of xcmp source * add test * fix test message sending logic * fix + test * add more to unincluded segment test * fmt --------- Co-authored-by: Chris Sosnin <chris125_@live.com> * Integrate new Aura / Parachain Consensus Logic in Parachain-Template / Polkadot-Parachain (#2864) * add a comment * refactor client/service utilities * deprecate start_collator * update parachain-template * update test-service in the same way * update polkadot-parachain crate * fmt * wire up new SubmitCollation message * some runtime utilities for implementing unincluded segment runtime APIs * allow parachains to configure their level of sybil-resistance when starting the network * make aura-ext compile * update to specify sybil resistance levels * fmt * specify relay chain slot duration in milliseconds * update Aura to explicitly produce Send futures also, make relay_chain_slot_duration a Duration * add authoring duration to basic collator and document params * integrate new basic collator into parachain-template * remove assert_send used for testing * basic-aura: only author when parent included * update polkadot-parachain-bin * fmt * some fixes * fixes * add a RelayNumberMonotonicallyIncreases * add a utility function for initializing subsystems * some logging for timestamp adjustment * fmt * some fixes for lookahead collator * add a log * update `find_potential_parents` to account for sessions * bound the loop * restore & deprecate old start_collator and start_full_node functions. * remove unnecessary await calls * fix warning * clippy * more clippy * remove unneeded logic * ci * update comment Co-authored-by: Marcin S. <marcin@bytedude.com> * (async backing) restore `CheckInherents` for backwards-compatibility (#2977) * bring back timestamp * Restore CheckInherents * revert to empty CheckInherents * make CheckInherents optional * attempt * properly end system blocks * add some more comments * ignore failing system parachain tests * update refs after main feature branch merge * comment out the offending tests because CI runs ignored tests * fix warnings * fmt * revert to polkadot master * cargo update -p polkadot-primitives -p sp-io --------- Co-authored-by: asynchronous rob <rphmeier@gmail.com> Co-authored-by: Ignacio Palacios <ignacio.palacios.santos@gmail.com> Co-authored-by: Bradley Olson <34992650+BradleyOlson64@users.noreply.github.com> Co-authored-by: Marcin S. <marcin@bytedude.com> Co-authored-by: eskimor <eskimor@users.noreply.github.com> Co-authored-by: Andronik <write@reusable.software> (cherry picked from commit 6ef1117) * Companion: restructure macro related exports (#3015) * restructure macro related exports * restructure macro related exports * wip * wip * update cargo lock * refactor RuntimeDebug on unincluded segment * fmt * Companion: restructure `benchmarking` macro related exports (#3039) * wip * wip * restructure benchmarking macro related exports * add cargo lock --------- Co-authored-by: parity-processbot <> (cherry picked from commit 8349c8d) * Add missing workspace members (#3056) * Add dependencies Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Add missing workspace members Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Fix more Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> --------- Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> (cherry picked from commit 44499cf) * Add CI for monorepo (#1145) * Add CI for monorepo * fix frame tests * Format features Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * add note for skipping tests and disable test-linux-stable-all * Fix tests and compile issues (#1152) * Fix feature dependant import Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Bump test timeout Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Remove feature gate Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Add resolver 2 Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Remove old lockfile Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Format features Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> --------- Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Fix check-dependency-rules Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * rm test-runtime Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Actually fix script Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * enable cargo-check-each-crate-macos * Run check-each-crate on 6 machines (#1163) --------- Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> (cherry picked from commit e494934) * Fix features (#1194) * Manually fix conflicting ? Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Remove duplicates Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Autofix feature propagation zepter lint propagate-feature --feature try-runtime --left-side-feature-missing=ignore --workspace --fix --feature-enables-dep="try-runtime:frame-try-runtime" zepter lint propagate-feature --feature runtime-benchmarks --left-side-feature-missing=ignore --workspace --fix --feature-enables-dep="runtime-benchmarks:frame-benchmarking" zepter lint propagate-feature --feature std --left-side-feature-missing=ignore --workspace --fix zepter f f Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Bump zepter Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Add some duplicates Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Revert "Add some duplicates" This reverts commit c6ce627. * Remove default enabled features Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Bump Zepter Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Bump in correct location 🤦 Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * DNM: Add some mistakes Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * DNM: Add some mistakes Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Revert "DNM: Add some mistakes" This reverts commit d469b3f. * Revert "DNM: Add some mistakes" This reverts commit d892a73. --------- Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> (cherry picked from commit 0400ed9) * Fix build profiles (#1229) * Fix build profiles Closes #1155 Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Manually set version to 1.0.0 Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Use workspace repo Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * 'Authors and Edition from workspace Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> --------- Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> (cherry picked from commit dcda0e5) * Set test crates to nopublish (#1240) * Set test crates to nopublish * Don't publish more crates * Set even more crates to nopublish --------- Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> (cherry picked from commit 1c7ef1f) * Add missing licenses and tune the scanning workflow (#1288) * Add missing Cumulus licenses * Typo * Add missing Substrate licenses * Single job checking the sub-repos in steps * Remove dates * Remove dates * Add missing (C) * Update FRAME UI tests Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> * Update more UI tests Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> --------- Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> (cherry picked from commit bfb241d) * Restructure `dispatch` macro related exports (#1162) * restructure dispatch macro related exports * moved Dispatchable to lib.rs * fix .gitignore final newline * ".git/.scripts/commands/fmt/fmt.sh" * fix rustdocs * wip --------- Co-authored-by: Liam Aharon <liam.aharon@hotmail.com> Co-authored-by: command-bot <> Co-authored-by: ordian <write@reusable.software> (cherry picked from commit bdbe982) * Fixes * Fix clippy --------- Co-authored-by: Chris Sosnin <48099298+slumber@users.noreply.github.com> Co-authored-by: Juan <juangirini@gmail.com> Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io> Co-authored-by: Alexander Samusev <41779041+alvicsam@users.noreply.github.com> Co-authored-by: Lulu <morgan@parity.io> Co-authored-by: Przemek Rzad <roopert7@gmail.com>

eagr added 2 commits November 22, 2023 11:35

InvalidCandidate::HardTimeout -> PossiblyInvalidError::HardTimeout

3f41297

retry on job timeout

7f7d8a3

rm unused imports

ed52f94

mrcnski reviewed Nov 22, 2023

View reviewed changes

mrcnski closed this Nov 22, 2023

mrcnski mentioned this pull request Nov 22, 2023

PVF: Add a new class of "possibly invalid" errors #660

Closed

eagr deleted the retry-timeout branch September 19, 2024 10:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry timeout #2438

Retry timeout #2438

eagr commented Nov 22, 2023 •

edited

Loading

paritytech-cicd-pr commented Nov 22, 2023

mrcnski commented Nov 22, 2023

mrcnski Nov 22, 2023

mrcnski commented Nov 22, 2023

eagr commented Nov 22, 2023

mrcnski commented Nov 22, 2023 •

edited

Loading

eskimor commented Nov 22, 2023

eskimor commented Nov 22, 2023

mrcnski commented Nov 22, 2023

eagr commented Nov 22, 2023

burdges commented Nov 22, 2023 •

edited

Loading

mrcnski commented Nov 22, 2023

mrcnski commented Nov 22, 2023

mrcnski commented Nov 23, 2023

		// Either the worker or the job timed out. Kill the worker in either case. Treated as
		// definitely-invalid, because if we timed out, there's no time left for a retry.

Retry timeout #2438

Retry timeout #2438

Conversation

eagr commented Nov 22, 2023 • edited Loading

paritytech-cicd-pr commented Nov 22, 2023

mrcnski commented Nov 22, 2023

mrcnski Nov 22, 2023

Choose a reason for hiding this comment

mrcnski commented Nov 22, 2023

eagr commented Nov 22, 2023

mrcnski commented Nov 22, 2023 • edited Loading

eskimor commented Nov 22, 2023

eskimor commented Nov 22, 2023

mrcnski commented Nov 22, 2023

eagr commented Nov 22, 2023

burdges commented Nov 22, 2023 • edited Loading

mrcnski commented Nov 22, 2023

mrcnski commented Nov 22, 2023

mrcnski commented Nov 23, 2023

eagr commented Nov 22, 2023 •

edited

Loading

mrcnski commented Nov 22, 2023 •

edited

Loading

burdges commented Nov 22, 2023 •

edited

Loading