Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate timer job submission failures #4852

Merged
merged 6 commits into from
Jan 17, 2025
Merged

Conversation

ggreif
Copy link
Contributor

@ggreif ggreif commented Jan 16, 2025

This deals with the (unlikely) possibility that the send queue is not full when the timer servicing action is submitted, but becomes full while submitting the user jobs. Now we catch the failure and re-add (single-expiration) jobs to the start of the priority queue.

This is the missing piece to #4846.

This is an incremental change, so that we don't have to touch the happy path. A rewrite would be justified to collapse gathering and self-sends.

There is an optimisation realised in @prune, the same could be done in gatherExpired(n.post) with a slight restructuring of the conditions:

      if (n.expire[0] <= now and gathered < thunks.size()) {
        if (n.expire[0] > 0) {
          ...
        };
        gatherExpired(n.post)
      }

Copy link

github-actions bot commented Jan 16, 2025

Comparing from 980cd23 to 321786a:
In terms of gas, no changes are observed in 5 tests.
In terms of size, no changes are observed in 5 tests.

@ggreif ggreif requested a review from crusso January 16, 2025 16:12
@ggreif ggreif marked this pull request as ready for review January 16, 2025 17:32
@ggreif ggreif requested a review from a team as a code owner January 16, 2025 17:32
@crusso
Copy link
Contributor

crusso commented Jan 16, 2025

I think you'll need to explain this to me...

src/prelude/internals.mo Outdated Show resolved Hide resolved
if (failed == 0) @timers := @prune @timers;
failed += 1;
@timers := ?(switch @timers {
case (?{ id = 0; pre; post; job = j; expire; delay })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this match on id = 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only already reinsert-ed nodes (special id = 0) should be delegated into the pre part of the PQ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tree rotation maintains the original expiration order.

@@ -544,7 +544,7 @@ func @prune(n : ?@Node) : ?@Node = switch n {
if (n.expire[0] == 0) {
@prune(n.post) // by corollary
} else {
?{ n with pre = @prune(n.pre); post = @prune(n.post) }
?{ n with pre = @prune(n.pre) }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rationale: current node is not expunged, so the post portion won't contain any either

func reinsert(job : () -> async ()) {
if (failed == 0) {
@timers := @prune @timers;
ignore (prim "global_timer_set" : Nat64 -> Nat64) 1
Copy link
Contributor Author

@ggreif ggreif Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point there are no id = 0 nodes in the PQ. They were all in the front and have been gathered, expunged and pruned.

@ggreif ggreif requested a review from crusso January 17, 2025 13:29
@ggreif ggreif self-assigned this Jan 17, 2025
@ggreif ggreif added the opportunity More optimisation opportunities inside label Jan 17, 2025
Copy link
Contributor

@crusso crusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for the offline explanation

@ggreif ggreif added automerge-squash When ready, merge (using squash) and removed automerge-squash When ready, merge (using squash) labels Jan 17, 2025
@ggreif ggreif merged commit af743f3 into gabor/timer-check Jan 17, 2025
5 checks passed
@ggreif ggreif deleted the gabor/timer-mitigate branch January 17, 2025 14:50
mergify bot pushed a commit that referenced this pull request Jan 17, 2025
This fixes a disappearing timer situation described by Timo in https://dfinity.slack.com/archives/CPL67E7MX/p1736347600078339.

It turns out that under high message load the `async` timer servicing routine cannot be run. The fix is simple, check if the self-call succeeded (causes a `throw` already), and if not, set a very near global timer to retry ASAP (in the top-level `catch`).

TODO:
- [x] `catch` send errors for user workers (and mitigate) — see #4852
- [ ] document that the user thunk may be called more than once, and thus should have no side effects other than submitting the self-call — see dfinity/motoko-base#682
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
opportunity More optimisation opportunities inside
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants