Fix queued snapshot status handler #1963

nicktrn · 2025-04-22T07:43:35Z

Runs can be requeued, e.g. when runners fail to come up fast enough. We didn't handle this very well previously - the runs would fail. We now handle this more gracefully by exiting the current execution without failing the run, and waiting for another run as usual.

This PR also adds a resource monitor helper to the references and improves runtime manager debug logs in case of resume failures.

Summary by CodeRabbit

New Features
- Introduced a Resource Monitor that periodically logs system and process resource usage, including disk, memory, and process metrics.
- Added a new task to demonstrate the Resource Monitor, allowing users to start monitoring and view resource snapshots.
Improvements
- Enhanced logging for task execution status changes, including better handling and messaging for "QUEUED" status.
- Improved runtime status logging with recurring updates and more detailed context in log messages.

changeset-bot · 2025-04-22T07:43:39Z

🦋 Changeset detected

Latest commit: 1c8f8a8

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

coderabbitai · 2025-04-22T07:43:42Z

Caution

Review failed

The pull request is closed.

Walkthrough

This update introduces a new ResourceMonitor class for collecting and logging system and process resource metrics, along with a new task (resourceMonitorTest) to demonstrate its usage. Additionally, it enhances logging and status handling in the managed execution and runtime management modules. Specifically, the execution logic now properly handles the "QUEUED" status by suspending runs instead of failing them, and the runtime manager periodically logs its status with more detailed context. No changes were made to exported method signatures, except for the addition of new types and classes related to resource monitoring.

Changes

File(s)	Change Summary
`references/hello-world/src/resourceMonitor.ts`	Added a new `ResourceMonitor` class and supporting types for monitoring and logging disk, memory, and process metrics. Provides public methods to start/stop monitoring and log snapshots.
`references/hello-world/src/trigger/example.ts`	Added a new exported task `resourceMonitorTest` that demonstrates the use of `ResourceMonitor` by logging resource metrics before and after a delay.
`packages/cli-v3/src/entryPoints/managed/execution.ts`	Updated handling of "QUEUED" execution status to suspend runs and log re-queue events, removing previous invalid status change handling for "QUEUED".
`packages/core/src/v3/runtime/managedRuntimeManager.ts`	Enhanced logging in the runtime manager: now logs status every 5 minutes and includes more detailed context in logs for missing waitpoints/resolvers. Added a private `status` getter.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ExampleTask
    participant ResourceMonitor

    User->>ExampleTask: Trigger resourceMonitorTest
    ExampleTask->>ResourceMonitor: Create instance (dirName, processName)
    ExampleTask->>ResourceMonitor: startMonitoring(1000ms)
    ExampleTask->>ResourceMonitor: logResourceSnapshot("initial")
    Note right of ResourceMonitor: Collects disk, memory, process metrics
    ExampleTask->>ExampleTask: Wait 5 seconds
    ExampleTask->>ResourceMonitor: logResourceSnapshot("after 5s")
    ExampleTask->>ResourceMonitor: stopMonitoring()
    ExampleTask-->>User: Return message

Possibly related PRs

Fix v4 restore race condition #1870: Both PRs modify the handling of execution snapshot status changes, specifically improving the control flow when encountering "QUEUED" or "FINISHED" statuses in the managed run lifecycle.
Retry heartbeat timeouts by putting back in the queue #1689: The main PR adds handling for a "QUEUED" execution status by re-queuing and suspending runs without failure, which conceptually aligns with the retrieved PR's logic of retrying heartbeat timeouts by putting runs back in the queue to avoid failure, indicating related changes in managing run re-queuing and retry behavior.

Suggested reviewers

matt-aitken
ericallam

Poem

In burrows deep, a monitor wakes,
Watching memory, disk, and CPU stakes.
With every tick, it logs and spies,
On Node and friends as time goes by.
Now "QUEUED" runs rest, not doomed to fail—
The logs grow richer, with every trail.
🐇✨ System health, in every tale!

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f8ad30e and 1c8f8a8.

📒 Files selected for processing (1)

.changeset/wet-deers-think.md (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

references/hello-world/src/resourceMonitor.ts (2)
99-110: Possible overlapping calls when the snapshot takes longer than the interval.

setInterval keeps firing even if logResourceSnapshot is still running, causing race conditions & CPU spikes.
Consider setTimeout recursion or a running‑flag.
-this.logInterval = setInterval(this.logResources.bind(this), intervalMs);
+const tick = async () => {
+  await this.logResources();
+  this.logInterval = setTimeout(tick, intervalMs);
+};
+tick();
266-269: continue inside the catch is redundant.

Control already flows to the next iteration; removing it silences the linter warning.
-} catch {
-  // Ignore errors reading individual process info
-  continue;
-}
+} catch {
+  // Ignore errors reading individual process info
+}
🧰 Tools

🪛 Biome (1.9.4)

[error] 268-268: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c19693 and 19ebee6.

📒 Files selected for processing (4)

packages/cli-v3/src/entryPoints/managed/execution.ts (2 hunks)
packages/core/src/v3/runtime/managedRuntimeManager.ts (3 hunks)
references/hello-world/src/resourceMonitor.ts (1 hunks)
references/hello-world/src/trigger/example.ts (2 hunks)

🧰 Additional context used

🪛 Biome (1.9.4)

references/hello-world/src/resourceMonitor.ts

[error] 268-268: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

⏰ Context from checks skipped due to timeout of 90000ms (7)

GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
GitHub Check: units / 🧪 Unit Tests
GitHub Check: typecheck / typecheck
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (8)

packages/core/src/v3/runtime/managedRuntimeManager.ts (4)

30-33: Added regular status logging to help debug stuck executions.

This is a valuable improvement for observability. Periodic logging of the runtime status will help identify and debug stuck executions by providing insights into the internal state every 5 minutes.

179-179: Enhanced error logging with detailed context.

Including the full runtime status in the error log when a waitId is missing provides significantly more context for debugging, making troubleshooting easier.

186-186: Improved error logging with runtime state context.

Enriching the log with the current status when a resolver is missing provides better diagnostic information for tracking down issues with waitpoints and resolvers.

229-234: Well-structured status getter implementation.

The new private getter efficiently encapsulates the internal state of the runtime manager, providing a consistent view of active waitIds and waitpoints that's used throughout the logging improvements.

packages/cli-v3/src/entryPoints/managed/execution.ts (2)

272-278: Fixed handling of requeued runs.

This change properly handles the "QUEUED" status by suspending the run instead of failing it. The implementation correctly follows the pattern established for the "FINISHED" case by killing the process without marking the run as failed, allowing the system to naturally wait for another run.

412-412: Removed QUEUED from invalid status cases.

This change is a necessary companion to the new QUEUED status handler above. Since QUEUED is now properly handled with its own case, it's correctly removed from the invalid status changes section.
references/hello-world/src/resourceMonitor.ts (1)

284-292: machine.memory may be 0, leading to divide‑by‑zero in nodeMemoryPercent.

Add a minimum value or fallback to totalMemory.
-const machineMemoryBytes = this.ctx.machine
-  ? this.ctx.machine.memory * 1024 * 1024 * 1024
-  : totalMemory;
+const machineMemoryBytes =
+  this.ctx.machine?.memory && this.ctx.machine.memory > 0
+    ? this.ctx.machine.memory * 1024 * 1024 * 1024
+    : totalMemory;
references/hello-world/src/trigger/example.ts (1)

3-3: Import uses a .js extension in a TypeScript file.

If the project compiles with moduleResolution: node (classic), this may break. Prefer path without extension (or .ts) and let the bundler resolve it.
-import { ResourceMonitor } from "../resourceMonitor.js";
+import { ResourceMonitor } from "../resourceMonitor";

references/hello-world/src/resourceMonitor.ts

references/hello-world/src/trigger/example.ts

nicktrn added 4 commits April 22, 2025 08:11

handle queued status change gracefully

2d32e09

add resource monitor to references

3aeb86b

add resource monitor example

13afaa2

improve runtime manager debug logs

19ebee6

coderabbitai bot reviewed Apr 22, 2025

View reviewed changes

fix resource monitor example

f8ad30e

matt-aitken approved these changes Apr 22, 2025

View reviewed changes

add changeset

1c8f8a8

nicktrn merged commit cedd932 into main Apr 22, 2025
10 of 12 checks passed

nicktrn deleted the fix/queued-snapshot-handler branch April 22, 2025 14:32

github-actions bot mentioned this pull request Apr 22, 2025

chore: Update version for release (v4-beta) #1954

Merged

This was referenced Apr 28, 2025

Fix managed run controller edge cases #1987

Merged

Fix controller waitpoint resolution, suspendable state, and snapshot race conditions #2006

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix queued snapshot status handler #1963

Fix queued snapshot status handler #1963

Uh oh!

nicktrn commented Apr 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

changeset-bot bot commented Apr 22, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 22, 2025 •

edited

Loading

Review failed

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix queued snapshot status handler #1963

Fix queued snapshot status handler #1963

Uh oh!

Conversation

nicktrn commented Apr 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

changeset-bot bot commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

coderabbitai bot commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nicktrn commented Apr 22, 2025 •

edited by coderabbitai bot

Loading

changeset-bot bot commented Apr 22, 2025 •

edited

Loading

coderabbitai bot commented Apr 22, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)