Skip to content

Fixes for v4 waits and restores #1868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 2, 2025
Merged

Fixes for v4 waits and restores #1868

merged 12 commits into from
Apr 2, 2025

Conversation

nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented Apr 2, 2025

Fixes:

  • runs shouldn't exit after 5m
  • clock reset after restore
  • env overrides after restore
  • exit codes for cleanup

Summary by CodeRabbit

  • New Features

    • Introduced new environment configurations allowing for metadata URL customization and configurable exit codes for improved process termination.
    • Enhanced workload management with support for a graceful shutdown status and immediate cleanup of terminated pods.
  • Tests

    • Added comprehensive tests validating graceful shutdown handling and immediate pod deletion, ensuring reliable performance.
  • Chores

    • Updated development scripts and dependencies for improved testing performance and smoother runtime operations.
    • Refined request handling and timing synchronization steps for enhanced overall system stability.

Copy link

changeset-bot bot commented Apr 2, 2025

⚠️ No Changeset found

Latest commit: 07a76e2

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

coderabbitai bot commented Apr 2, 2025

Walkthrough

This update spans multiple parts of the codebase. In the supervisor app, script adjustments and dependency updates were made alongside the addition of a new metadata URL option via environment variables. The pod processing logic now distinguishes graceful shutdown scenarios with new status types and exit codes, and the related tests were updated accordingly. In addition, configurable exit codes have been introduced in the CLI module, while schema changes and clock resets were applied in core modules to support updated heartbeat and runtime operations.

Changes

File(s) Change Summary
apps/supervisor/package.json Updated test scripts by replacing "test:watch" with a version including --no-file-parallelism and added a new "test:run" script with the --run flag; added vitest@^1.4.0 dependency.
apps/supervisor/src/env.ts, apps/supervisor/src/index.ts, apps/supervisor/src/workloadManager/{docker.ts, kubernetes.ts}, apps/supervisor/src/workloadManager/types.ts Introduced support for the metadata URL via an optional TRIGGER_METADATA_URL environment variable and integrated it into workload manager options.
apps/supervisor/src/services/failedPodHandler.ts, apps/supervisor/src/services/failedPodHandler.test.ts Added a new "GracefulShutdown" pod status and GRACEFUL_SHUTDOWN_EXIT_CODE (200); updated logic in processFailedPod to handle graceful shutdown, and modified test cases and function signatures to support flexible exit code handling and immediate pod deletion.
apps/supervisor/src/services/podCleaner.test.ts Modified the cleanup function to pass a gracePeriodSeconds parameter with value 0 for immediate pod deletion.
packages/cli-v3/src/entryPoints/managed-run-controller.ts Added new environment variables TRIGGER_SUCCESS_EXIT_CODE and TRIGGER_FAILURE_EXIT_CODE along with corresponding properties in ManagedRunController; updated process exit logic to use configurable exit codes.
packages/core/src/v3/runEngineWorker/{supervisor/schemas.ts, workload/http.ts} Changed cpu and memory in WorkerApiRunHeartbeatRequestBody to optional and updated heartbeatRun method to accept an optional request body.
packages/core/src/v3/runtime/managedRuntimeManager.ts Introduced a new import for clock and inserted a clock.reset() call in completeWaitpoint to ensure accurate time management during waitpoint resolution.

Sequence Diagram(s)

sequenceDiagram
    participant Pod as Pod
    participant F as FailedPodHandler
    participant M as Metrics
    participant L as Logger

    Pod->>F: Pod terminates (exit code 200)
    F->>F: Check against GRACEFUL_SHUTDOWN_EXIT_CODE
    F->>L: Log graceful shutdown detected
    F->>M: Increment "GracefulShutdown" metric
    F-->>Pod: Exit processing early
Loading
sequenceDiagram
    participant Env as Environment
    participant MRC as ManagedRunController
    participant Proc as Process

    Env->>MRC: Supply TRIGGER_SUCCESS_EXIT_CODE and TRIGGER_FAILURE_EXIT_CODE
    MRC->>MRC: Initialize exit codes from environment
    alt Success condition
        MRC->>Proc: exitProcess(successExitCode)
    else Failure condition
        MRC->>Proc: exitProcess(failureExitCode)
    end
Loading

Possibly related PRs

Suggested reviewers

  • ericallam
  • matt-aitken

Poem

I'm a coding rabbit, hopping with cheer,
Dancing through code changes, far and near.
With metadata leaps and graceful shutdowns in sight,
I bounce in delight at each update so bright.
A twitch of my nose seals this whimsical flight!


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6d819c6 and 07a76e2.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (13)
  • apps/supervisor/package.json (2 hunks)
  • apps/supervisor/src/env.ts (1 hunks)
  • apps/supervisor/src/index.ts (1 hunks)
  • apps/supervisor/src/services/failedPodHandler.test.ts (5 hunks)
  • apps/supervisor/src/services/failedPodHandler.ts (3 hunks)
  • apps/supervisor/src/services/podCleaner.test.ts (1 hunks)
  • apps/supervisor/src/workloadManager/docker.ts (1 hunks)
  • apps/supervisor/src/workloadManager/kubernetes.ts (1 hunks)
  • apps/supervisor/src/workloadManager/types.ts (1 hunks)
  • packages/cli-v3/src/entryPoints/managed-run-controller.ts (10 hunks)
  • packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (1 hunks)
  • packages/core/src/v3/runEngineWorker/workload/http.ts (2 hunks)
  • packages/core/src/v3/runtime/managedRuntimeManager.ts (2 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
apps/supervisor/src/services/failedPodHandler.test.ts (1)
apps/supervisor/src/services/failedPodHandler.ts (1)
  • FailedPodHandler (18-298)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: units / 🧪 Unit Tests
  • GitHub Check: typecheck / typecheck
🔇 Additional comments (24)
apps/supervisor/src/workloadManager/types.ts (1)

8-8: Looks good: Adding metadataUrl property

The addition of the optional metadataUrl property to the WorkloadManagerOptions interface is well-implemented and follows the existing pattern for URL-based properties.

apps/supervisor/src/env.ts (1)

38-38: LGTM: New metadata URL environment variable

The addition of the optional TRIGGER_METADATA_URL environment variable is consistent with other optional string variables in the schema.

apps/supervisor/src/workloadManager/docker.ts (1)

48-50: Implementation looks good

The conditional check and environment variable addition for metadataUrl follows the same pattern as other similar properties like warmStartUrl.

packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (1)

76-77: Appropriate schema change for heartbeat flexibility

Making CPU and memory optional in the heartbeat request body adds flexibility that will support the clock reset feature after restore operations mentioned in the PR objectives.

apps/supervisor/src/workloadManager/kubernetes.ts (1)

137-139: LGTM: Added metadata URL support

The addition of the TRIGGER_METADATA_URL environment variable is properly implemented following the existing pattern for conditional environment variables.

apps/supervisor/src/services/podCleaner.test.ts (1)

32-32: Better test cleanup with graceful termination bypass

Setting gracePeriodSeconds: 0 ensures pods are deleted immediately during test cleanup without waiting for the standard termination grace period, improving test efficiency.

apps/supervisor/src/index.ts (1)

64-64: LGTM: Properly exposed metadata URL configuration

The addition of the metadataUrl property to the workload manager options correctly passes through the environment variable, supporting the PR objective of allowing environment variable overrides to take effect after a restore.

apps/supervisor/package.json (2)

11-12: LGTM: Improved test setup with sequential execution

Adding --no-file-parallelism ensures tests run sequentially across files, preventing potential race conditions or resource conflicts, especially important for Kubernetes integration tests.


28-29: LGTM: Added Vitest dependency

Proper addition of the Vitest testing framework as a dev dependency to support the updated test scripts.

packages/core/src/v3/runEngineWorker/workload/http.ts (1)

44-44: Improved flexibility in heartbeat function

The method signature has been updated to make the body parameter optional, allowing for more flexible usage of the heartbeat function. The body is now constructed with a fallback to an empty object when undefined, ensuring consistent behavior across calls.

Also applies to: 54-54

packages/core/src/v3/runtime/managedRuntimeManager.ts (2)

1-1: Added clock import for time management

This import enables time-related operations in the runtime manager, which is necessary for the added functionality.


194-196: Clock reset implemented before waitpoint resolution

This change ensures that the current time is accurate before resolving a waitpoint, which addresses the PR objective of implementing a clock reset feature after a restore operation. This is critical for time-sensitive operations that might be affected by system time changes during restores.

apps/supervisor/src/services/failedPodHandler.ts (3)

9-9: Added new pod status type for graceful shutdowns

The PodStatus type now includes "GracefulShutdown" as a possible state, which helps distinguish between different types of pod terminations and enables proper status tracking in metrics.


37-37: Added constant for graceful shutdown exit code

Using a static constant for the exit code provides a single source of truth that can be referenced throughout the codebase and in tests, making maintenance easier.


211-224: Special handling for gracefully shutdown pods

This logic allows the system to identify pods that terminated through the graceful shutdown process and handle them differently from other failed pods. This implementation:

  1. Checks for the special exit code in the main container
  2. Logs the detection at debug level
  3. Updates metrics with the new status type
  4. Returns early to skip standard failure processing

This aligns with the PR objective to adjust exit codes related to cleanup processes.

apps/supervisor/src/services/failedPodHandler.test.ts (4)

318-412: Added comprehensive test for graceful shutdown pods

This test verifies the new graceful shutdown functionality by:

  1. Testing processing of graceful shutdown pods that exist before the handler starts
  2. Testing processing of graceful shutdown pods created after the handler starts
  3. Verifying correct metric attribution (pods counted as GracefulShutdown not Failed)
  4. Confirming pods are still properly deleted

The test is structured to closely mirror the existing test patterns while validating the new behavior.


424-424: Enhanced test pod creation with configurable exit codes

Adding the exitCode parameter to the test helper function improves test flexibility by allowing for creation of pods with specific exit codes. The implementation appropriately handles the new parameter by overriding the container command when an exit code is specified.

Also applies to: 434-441


458-458: Container name updated to match production checks

Changing the test container name to "run-controller" ensures that the test environment properly mirrors the production environment, where the failedPodHandler specifically looks for this container name when checking exit codes.


576-576: Improved pod cleanup with zero grace period

Adding gracePeriodSeconds: 0 ensures immediate pod deletion during cleanup, which helps tests run more efficiently and reliably by avoiding potential delays between test runs.

packages/cli-v3/src/entryPoints/managed-run-controller.ts (5)

59-60: Support for configurable exit codes added.

These new environment variables allow configuring the exit codes used when terminating the process, enhancing flexibility for integration with different environments.


87-88: Properly extended Metadata type.

The exit code configuration options are correctly added to the Metadata type, maintaining consistency with the environment schema.


129-131: Exit code class properties properly initialized.

These private properties correctly initialize with the default environment values, setting up the controller to use configurable exit codes.


675-681: Environment override support for exit codes.

This implementation correctly applies exit code overrides from the metadata service, aligning with the PR objective to "allow for environment variable overrides to take effect after a restore."


835-835: Replaced hardcoded exit codes with configurable values.

The hardcoded exit codes have been replaced with the class properties, providing consistent behavior that respects configuration. This change addresses the PR objective to adjust exit codes related to cleanup processes.

Also applies to: 846-846, 874-874, 885-885, 898-898

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@nicktrn nicktrn merged commit dcacc3a into main Apr 2, 2025
14 checks passed
@nicktrn nicktrn deleted the fix/wait-with-suspend branch April 2, 2025 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants