-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitigate timer job submission failures #4852
Conversation
so it is not necessary to prune them!
I think you'll need to explain this to me... |
if (failed == 0) @timers := @prune @timers; | ||
failed += 1; | ||
@timers := ?(switch @timers { | ||
case (?{ id = 0; pre; post; job = j; expire; delay }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this match on id = 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only already reinsert
-ed nodes (special id
= 0) should be delegated into the pre
part of the PQ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tree rotation maintains the original expiration order.
@@ -544,7 +544,7 @@ func @prune(n : ?@Node) : ?@Node = switch n { | |||
if (n.expire[0] == 0) { | |||
@prune(n.post) // by corollary | |||
} else { | |||
?{ n with pre = @prune(n.pre); post = @prune(n.post) } | |||
?{ n with pre = @prune(n.pre) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rationale: current node is not expunged, so the post
portion won't contain any either
review feedback
func reinsert(job : () -> async ()) { | ||
if (failed == 0) { | ||
@timers := @prune @timers; | ||
ignore (prim "global_timer_set" : Nat64 -> Nat64) 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point there are no id
= 0 nodes in the PQ. They were all in the front and have been gathered, expunged and pruned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for the offline explanation
This fixes a disappearing timer situation described by Timo in https://dfinity.slack.com/archives/CPL67E7MX/p1736347600078339. It turns out that under high message load the `async` timer servicing routine cannot be run. The fix is simple, check if the self-call succeeded (causes a `throw` already), and if not, set a very near global timer to retry ASAP (in the top-level `catch`). TODO: - [x] `catch` send errors for user workers (and mitigate) — see #4852 - [ ] document that the user thunk may be called more than once, and thus should have no side effects other than submitting the self-call — see dfinity/motoko-base#682
This deals with the (unlikely) possibility that the send queue is not full when the timer servicing action is submitted, but becomes full while submitting the user jobs. Now we catch the failure and re-add (single-expiration) jobs to the start of the priority queue.
This is the missing piece to #4846.
This is an incremental change, so that we don't have to touch the happy path. A rewrite would be justified to collapse gathering and self-sends.
There is an optimisation realised in
@prune
, the same could be done ingatherExpired(n.post)
with a slight restructuring of the conditions: