Skip to content

Fix queued snapshot status handler #1963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 22, 2025
Merged

Conversation

nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented Apr 22, 2025

Runs can be requeued, e.g. when runners fail to come up fast enough. We didn't handle this very well previously - the runs would fail. We now handle this more gracefully by exiting the current execution without failing the run, and waiting for another run as usual.

This PR also adds a resource monitor helper to the references and improves runtime manager debug logs in case of resume failures.

Summary by CodeRabbit

  • New Features

    • Introduced a Resource Monitor that periodically logs system and process resource usage, including disk, memory, and process metrics.
    • Added a new task to demonstrate the Resource Monitor, allowing users to start monitoring and view resource snapshots.
  • Improvements

    • Enhanced logging for task execution status changes, including better handling and messaging for "QUEUED" status.
    • Improved runtime status logging with recurring updates and more detailed context in log messages.

Copy link

changeset-bot bot commented Apr 22, 2025

🦋 Changeset detected

Latest commit: 1c8f8a8

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor

coderabbitai bot commented Apr 22, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

This update introduces a new ResourceMonitor class for collecting and logging system and process resource metrics, along with a new task (resourceMonitorTest) to demonstrate its usage. Additionally, it enhances logging and status handling in the managed execution and runtime management modules. Specifically, the execution logic now properly handles the "QUEUED" status by suspending runs instead of failing them, and the runtime manager periodically logs its status with more detailed context. No changes were made to exported method signatures, except for the addition of new types and classes related to resource monitoring.

Changes

File(s) Change Summary
references/hello-world/src/resourceMonitor.ts Added a new ResourceMonitor class and supporting types for monitoring and logging disk, memory, and process metrics. Provides public methods to start/stop monitoring and log snapshots.
references/hello-world/src/trigger/example.ts Added a new exported task resourceMonitorTest that demonstrates the use of ResourceMonitor by logging resource metrics before and after a delay.
packages/cli-v3/src/entryPoints/managed/execution.ts Updated handling of "QUEUED" execution status to suspend runs and log re-queue events, removing previous invalid status change handling for "QUEUED".
packages/core/src/v3/runtime/managedRuntimeManager.ts Enhanced logging in the runtime manager: now logs status every 5 minutes and includes more detailed context in logs for missing waitpoints/resolvers. Added a private status getter.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ExampleTask
    participant ResourceMonitor

    User->>ExampleTask: Trigger resourceMonitorTest
    ExampleTask->>ResourceMonitor: Create instance (dirName, processName)
    ExampleTask->>ResourceMonitor: startMonitoring(1000ms)
    ExampleTask->>ResourceMonitor: logResourceSnapshot("initial")
    Note right of ResourceMonitor: Collects disk, memory, process metrics
    ExampleTask->>ExampleTask: Wait 5 seconds
    ExampleTask->>ResourceMonitor: logResourceSnapshot("after 5s")
    ExampleTask->>ResourceMonitor: stopMonitoring()
    ExampleTask-->>User: Return message
Loading

Possibly related PRs

  • Fix v4 restore race condition #1870: Both PRs modify the handling of execution snapshot status changes, specifically improving the control flow when encountering "QUEUED" or "FINISHED" statuses in the managed run lifecycle.
  • Retry heartbeat timeouts by putting back in the queue #1689: The main PR adds handling for a "QUEUED" execution status by re-queuing and suspending runs without failure, which conceptually aligns with the retrieved PR's logic of retrying heartbeat timeouts by putting runs back in the queue to avoid failure, indicating related changes in managing run re-queuing and retry behavior.

Suggested reviewers

  • matt-aitken
  • ericallam

Poem

In burrows deep, a monitor wakes,
Watching memory, disk, and CPU stakes.
With every tick, it logs and spies,
On Node and friends as time goes by.
Now "QUEUED" runs rest, not doomed to fail—
The logs grow richer, with every trail.
🐇✨ System health, in every tale!


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f8ad30e and 1c8f8a8.

📒 Files selected for processing (1)
  • .changeset/wet-deers-think.md (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
references/hello-world/src/resourceMonitor.ts (2)

99-110: Possible overlapping calls when the snapshot takes longer than the interval.

setInterval keeps firing even if logResourceSnapshot is still running, causing race conditions & CPU spikes.
Consider setTimeout recursion or a running‑flag.

-this.logInterval = setInterval(this.logResources.bind(this), intervalMs);
+const tick = async () => {
+  await this.logResources();
+  this.logInterval = setTimeout(tick, intervalMs);
+};
+tick();

266-269: continue inside the catch is redundant.

Control already flows to the next iteration; removing it silences the linter warning.

-} catch {
-  // Ignore errors reading individual process info
-  continue;
-}
+} catch {
+  // Ignore errors reading individual process info
+}
🧰 Tools
🪛 Biome (1.9.4)

[error] 268-268: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c19693 and 19ebee6.

📒 Files selected for processing (4)
  • packages/cli-v3/src/entryPoints/managed/execution.ts (2 hunks)
  • packages/core/src/v3/runtime/managedRuntimeManager.ts (3 hunks)
  • references/hello-world/src/resourceMonitor.ts (1 hunks)
  • references/hello-world/src/trigger/example.ts (2 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
references/hello-world/src/resourceMonitor.ts

[error] 268-268: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: units / 🧪 Unit Tests
  • GitHub Check: typecheck / typecheck
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (8)
packages/core/src/v3/runtime/managedRuntimeManager.ts (4)

30-33: Added regular status logging to help debug stuck executions.

This is a valuable improvement for observability. Periodic logging of the runtime status will help identify and debug stuck executions by providing insights into the internal state every 5 minutes.


179-179: Enhanced error logging with detailed context.

Including the full runtime status in the error log when a waitId is missing provides significantly more context for debugging, making troubleshooting easier.


186-186: Improved error logging with runtime state context.

Enriching the log with the current status when a resolver is missing provides better diagnostic information for tracking down issues with waitpoints and resolvers.


229-234: Well-structured status getter implementation.

The new private getter efficiently encapsulates the internal state of the runtime manager, providing a consistent view of active waitIds and waitpoints that's used throughout the logging improvements.

packages/cli-v3/src/entryPoints/managed/execution.ts (2)

272-278: Fixed handling of requeued runs.

This change properly handles the "QUEUED" status by suspending the run instead of failing it. The implementation correctly follows the pattern established for the "FINISHED" case by killing the process without marking the run as failed, allowing the system to naturally wait for another run.


412-412: Removed QUEUED from invalid status cases.

This change is a necessary companion to the new QUEUED status handler above. Since QUEUED is now properly handled with its own case, it's correctly removed from the invalid status changes section.

references/hello-world/src/resourceMonitor.ts (1)

284-292: machine.memory may be 0, leading to divide‑by‑zero in nodeMemoryPercent.

Add a minimum value or fallback to totalMemory.

-const machineMemoryBytes = this.ctx.machine
-  ? this.ctx.machine.memory * 1024 * 1024 * 1024
-  : totalMemory;
+const machineMemoryBytes =
+  this.ctx.machine?.memory && this.ctx.machine.memory > 0
+    ? this.ctx.machine.memory * 1024 * 1024 * 1024
+    : totalMemory;
references/hello-world/src/trigger/example.ts (1)

3-3: Import uses a .js extension in a TypeScript file.

If the project compiles with moduleResolution: node (classic), this may break. Prefer path without extension (or .ts) and let the bundler resolve it.

-import { ResourceMonitor } from "../resourceMonitor.js";
+import { ResourceMonitor } from "../resourceMonitor";

@nicktrn nicktrn merged commit cedd932 into main Apr 22, 2025
10 of 12 checks passed
@nicktrn nicktrn deleted the fix/queued-snapshot-handler branch April 22, 2025 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants