Skip to content

Warm start and restore improvements #1793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Mar 14, 2025
Merged

Warm start and restore improvements #1793

merged 17 commits into from
Mar 14, 2025

Conversation

nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented Mar 14, 2025

Many small fixes in this one. The biggest changes here are:

  • Warm start config now dynamic and returned on connect request
  • Optional metadata service to override env vars after restore
  • Changes to related services to make it possible to update config after init

Summary by CodeRabbit

  • New Features

    • Enhanced service initialization with improved restart procedures and dynamic environment configuration.
    • Introduced adjustable heartbeat and connection management for smoother operation.
  • Bug Fixes

    • Improved logging and error messaging to provide clearer feedback during request handling.
  • Refactor

    • Streamlined internal communication and identifier generation processes for a leaner, more robust system.

Copy link

changeset-bot bot commented Mar 14, 2025

⚠️ No Changeset found

Latest commit: ebb205e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

coderabbitai bot commented Mar 14, 2025

Walkthrough

This update introduces robust logging, improved error handling, and changes to how unique identifiers are generated and imported. Several modules now validate the presence of essential services (like a checkpoint client) before proceeding. The warm start functionality has been refactored with a dedicated client and associated environment configuration, altering long polling and metadata processing. Additionally, custom API request handling code has been removed or centralized across both CLI and core modules. Several imports now point to a centralized package to standardize identifier creation.

Changes

File(s) Change Summary
apps/supervisor/src/index.ts Added logging to indicate checkpoint enablement; enhanced error handling for checkpoint restoration and warm start (removed optional chaining after validation).
apps/supervisor/src/util.ts Removed IdGenerator class and associated RunnerId instance.
apps/supervisor/src/workloadManager/docker.ts
apps/supervisor/src/workloadManager/kubernetes.ts
Updated RunnerId import from local utility to @trigger.dev/core/v3/isomorphic; in Kubernetes, changed container naming to a static string and removed run ID label.
apps/supervisor/src/workloadServer/index.ts Modified the suspend request handler to first check for the existence of checkpointClient, adjusting error responses accordingly.
packages/cli-v3/src/apiClient.ts Removed the ApiResult type and wrapZodFetch function, indicating a shift in API call handling.
packages/cli-v3/src/entryPoints/managed-run-controller.ts Updated environment variable names; introduced metadataClient for fetching env overrides and warmStartClient for managing warm start connections; refactored heartbeat and snapshot polling logic.
packages/core/src/v3/apiClient/core.ts Added a new ApiResult type and wrapZodFetch function with retry logic and comprehensive error handling.
packages/core/src/v3/isomorphic/friendlyId.ts Introduced a new IdGenerator class along with an exported RunnerId instance configured with a custom alphabet, length, and prefix.
packages/core/src/v3/runEngineWorker/supervisor/http.ts Removed the local wrapZodFetch function and associated ApiResult type, replacing them with a centralized implementation.
packages/core/src/v3/runEngineWorker/workload/http.ts Changed the apiUrl property to mutable and added an updateApiUrl method to manage URL formatting.
packages/core/src/v3/serverOnly/httpServer.ts Updated the RouteHandler return type from Promise<void> to Promise<any>.
packages/core/src/v3/utils/heartbeat.ts Added an updateInterval method in the HeartbeatService to adjust the heartbeat interval dynamically.
packages/core/src/v3/workers/index.ts Added new exports for WarmStartClient and its options type from the worker module.
packages/core/src/v3/workers/warmStartClient.ts Introduced a new WarmStartClient class with methods for connecting to a warm start API, initiating long polling via a private longPoll method, and handling errors with exponential backoff.
packages/core/src/v3/schemas/index.ts Added a new export for all entities from the warmStart.js module.
packages/core/src/v3/schemas/warmStart.ts Introduced a new WarmStartConnectResponse schema and its associated type using Zod for validation.

Sequence Diagram(s)

sequenceDiagram
    participant MRC as ManagedRunController
    participant MC as MetadataClient
    participant WH as WarmStartClient
    participant WS as Warm Start Service

    MRC->>MC: Fetch environment overrides
    MC-->>MRC: Return env overrides

    MRC->>WH: Initiate connection (GET /connect)
    WH->>WS: GET /connect
    WS-->>WH: Respond with WarmStartConnectResponse

    MRC->>WH: Start warm start process (long polling /warm-start)
    WH->>WS: Long poll /warm-start
    WS-->>WH: Return dequeued message or error
    WH-->>MRC: Deliver warm start response
Loading

Possibly related PRs

Suggested reviewers

  • matt-aitken

Poem

I'm a little rabbit, hopping through code,
Logging each checkpoint on my winding road.
With warm starts and retries, my pace is light,
Fixing errors and paths by day and night.
Code carrots crunch, let the changes unfold,
Hoppy reviews in every line, brave and bold!
🐇💻

Warning

There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

Scope: all 2 workspace projects
 ERR_PNPM_OPTIONAL_DEPS_REQUIRE_PROD_DEPS  Optional dependencies cannot be installed without production dependencies

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)
  • We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
    - To enable this feature, set early_access to true under in the settings.
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (10)
apps/supervisor/src/workloadManager/docker.ts (1)

10-10: Well-structured import for Docker utilities.

Importing getDockerHostDomain directly from ../util.js highlights a clear separation of concerns, keeping the Docker-specific logic encapsulated. Make sure that the utility handles network edge cases, such as missing or undefined env.DOCKER_NETWORK.

apps/supervisor/src/index.ts (1)

90-93: Consider validating the checkpoint URL.
While logging the existence of the checkpoint URL is helpful, it might be useful to confirm that it is a valid URL before proceeding, to avoid potential runtime errors.

apps/supervisor/src/workloadManager/kubernetes.ts (1)

58-58: Container name is now static.
Using a single, invariant name "run-controller" for every container may cause naming collisions if multiple pods or containers run concurrently. Consider adding a unique suffix to avoid potential conflicts.

- name: "run-controller",
+ name: `run-controller-${runnerId}`,
packages/core/src/v3/workers/warmStartClient.ts (1)

150-159: Consider removing superfluous continue statements.
The static analysis suggests that these continue statements are unnecessary as there is no code after them within the loop body. Removing them clarifies flow control.

150         if (error instanceof Error && error.name === "AbortError") {
151           this.logger.log("Long poll request timed out, retrying...");
-152           continue;
153         } else {
154           this.logger.error("Error during fetch, retrying...", { error });
...
-159           continue;
160         }
🧰 Tools
🪛 Biome (1.9.4)

[error] 152-152: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)


[error] 159-159: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

packages/cli-v3/src/entryPoints/managed-run-controller.ts (6)

78-85: Prevent drift by using partial environment schemas

Defining a parallel Metadata type can risk subtle mismatches with your original Env schema. Consider reusing a partial subset of Env (e.g., a new Zod schema) for consistency and easier maintenance.


87-103: Improve error handling and schema validation

MetadataClient fetches JSON from the metadata URL but doesn't validate the result. Consider using Zod for strict type checking and switch to logger instead of console.error to keep logging consistent.


135-243: Large constructor - consider splitting setup

The constructor initializes many services, pollers, and signal handlers. You could refactor these into dedicated methods or helper classes to reduce complexity. Also note the mix of console.* and logger.* usage—adopting a single logging approach might simplify debugging.


819-822: Graceful fallback for warm start unavailability

If this.warmStartClient is missing, you call exitProcess(0). Consider allowing a cold-start fallback or introducing a retry mechanism if partial warm start usage is acceptable.


890-893: Use caution with immediate process termination

process.exit() might skip ongoing async tasks like log flushing. Consider a brief delay or a graceful shutdown routine to ensure logs and crucial I/O are fully completed first.


1115-1115: Consider making forced shutdown configurable

Exiting the process after 5 minutes (exitProcess(1)) can prevent debugging and final logging tasks from running. If this timeout is only for testing, you might want a configuration toggle for production.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5896165 and ebb205e.

📒 Files selected for processing (17)
  • apps/supervisor/src/index.ts (3 hunks)
  • apps/supervisor/src/util.ts (0 hunks)
  • apps/supervisor/src/workloadManager/docker.ts (1 hunks)
  • apps/supervisor/src/workloadManager/kubernetes.ts (2 hunks)
  • apps/supervisor/src/workloadServer/index.ts (1 hunks)
  • packages/cli-v3/src/apiClient.ts (1 hunks)
  • packages/cli-v3/src/entryPoints/managed-run-controller.ts (11 hunks)
  • packages/core/src/v3/apiClient/core.ts (1 hunks)
  • packages/core/src/v3/isomorphic/friendlyId.ts (1 hunks)
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts (1 hunks)
  • packages/core/src/v3/runEngineWorker/workload/http.ts (2 hunks)
  • packages/core/src/v3/schemas/index.ts (1 hunks)
  • packages/core/src/v3/schemas/warmStart.ts (1 hunks)
  • packages/core/src/v3/serverOnly/httpServer.ts (1 hunks)
  • packages/core/src/v3/utils/heartbeat.ts (1 hunks)
  • packages/core/src/v3/workers/index.ts (1 hunks)
  • packages/core/src/v3/workers/warmStartClient.ts (1 hunks)
💤 Files with no reviewable changes (1)
  • apps/supervisor/src/util.ts
🧰 Additional context used
🪛 Biome (1.9.4)
packages/core/src/v3/workers/warmStartClient.ts

[error] 152-152: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)


[error] 159-159: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)

⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: units / 🧪 Unit Tests
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (32)
packages/core/src/v3/schemas/index.ts (1)

16-16: Added warm start functionality export

Good addition of the new export statement that makes the warm start functionality available through the index file. This aligns with the PR objective of making the warm start configuration dynamic.

packages/core/src/v3/utils/heartbeat.ts (1)

57-60: Good implementation of dynamic heartbeat interval update

This new method allows for updating the heartbeat interval after initialization, which aligns with the PR objective of making configurations updatable after the initialization process. The implementation correctly updates the private property and resets the current interval.

packages/core/src/v3/workers/index.ts (1)

22-22: Properly exported WarmStartClient

Good addition of the export for the WarmStartClient and its options type. This makes the warm start functionality accessible to other modules and supports the PR objective of implementing warm start improvements.

packages/core/src/v3/serverOnly/httpServer.ts (1)

19-19:

❓ Verification inconclusive

Enhanced RouteHandler return type flexibility

Changing the return type from Promise<void> to Promise<any> makes the API more flexible by allowing route handlers to return values that may be used elsewhere. This change supports the PR objective of facilitating updates to configuration settings.

Consider verifying that existing route handlers don't rely on the return value being void.


🏁 Script executed:

#!/bin/bash
# Find places where RouteHandler is used and check if they expect void return type

# Look for files where RouteHandler is imported or used
rg -l "RouteHandler" --type=ts

# Look for route handler implementations that might be returning values
rg "handler: async \(\{.*\}\) => \{.*return" --type=ts

Length of output: 193


Enhanced RouteHandler Return Type Verification

Based on our search results:

  • The term RouteHandler appears only in:
    • packages/core/src/v3/serverOnly/httpServer.ts
    • references/nextjs-realtime/src/app/api/uploadthing/route.ts
  • Our search for asynchronous implementations returning values (using the pattern handler: async ({ ... }) => { ... return) produced no matches, indicating that none of the implementations explicitly leverage a return value beyond void.

The change from Promise<void> to Promise<any> appears safe. However, please manually verify in references/nextjs-realtime/src/app/api/uploadthing/route.ts (and any other potential caller not captured by our regex) that none of the route handlers depend on the previous void return type.

packages/core/src/v3/schemas/warmStart.ts (1)

3-8: Looks good!

The schema and inferred type naming pattern is a common approach with Zod and TypeScript; reusing the same name for the schema constant and the type alias is acceptable and keeps the code straightforward.

apps/supervisor/src/workloadManager/docker.ts (1)

2-2: Centralized import of RunnerId is consistent.

Shifting RunnerId to a dedicated package helps centralize ID generation logic and maintain consistency across modules. This is a beneficial change, especially for large codebases where ID generation must remain uniform.

packages/cli-v3/src/apiClient.ts (1)

35-35: Ensure consistency with newly introduced imports.

Importing ApiResult, wrapZodFetch, and zodfetchSSE from @trigger.dev/core/v3/zodfetch is aligning with the rest of the file's usage of these utilities. Given the file’s continued reliance on ApiResult for typed responses, this import seems appropriate. Keep an eye on version updates for zodfetch to maintain compatibility in the future.

packages/core/src/v3/isomorphic/friendlyId.ts (2)

94-108: Well-structured ID generation class

The IdGenerator class is well-designed with proper encapsulation through private fields and a clear generation method. This provides a consistent way to generate IDs throughout the codebase.


110-114: Good standardization for runner identifiers

Creating a standardized RunnerId generator with consistent prefix and alphabet is a good practice. This centralizes runner ID generation logic and ensures consistency across the codebase.

apps/supervisor/src/workloadServer/index.ts (2)

203-213: Improved check for checkpoint service availability

Checking for the existence of checkpointClient before proceeding with the request is a good practice. It provides a clear error message when the service is unavailable.


215-231: Better structured error handling flow

Moving the runnerId check after the checkpointClient check improves the flow of the handler by validating prerequisites first. This provides more specific error messages for different failure scenarios.

packages/core/src/v3/apiClient/core.ts (2)

696-701: Well-defined API result type

The ApiResult type provides a clear discriminated union for API responses, making error handling more consistent throughout the codebase.


703-743: Centralized API request wrapper with robust error handling

The wrapZodFetch function centralizes API error handling logic and provides a consistent return format. The implementation handles different error types appropriately and includes sensible retry configurations.

packages/core/src/v3/runEngineWorker/workload/http.ts (3)

17-17: Good centralization of API wrapper function

Importing the centralized wrapZodFetch function reduces code duplication and ensures consistent error handling across the codebase.


22-22: API URL made modifiable to support dynamic configuration

Changing apiUrl from readonly to modifiable enables updating connection details at runtime, supporting the dynamic warm start configuration mentioned in the PR objectives.


40-42: Clean implementation of URL update method

The updateApiUrl method correctly handles trailing slashes for consistency with the constructor's implementation. This supports the PR objective of updating configuration settings after initialization.

packages/core/src/v3/runEngineWorker/supervisor/http.ts (1)

21-21: No issues with the new import statement.
The wrapZodFetch usage is consistent across the file, and there are no apparent integration concerns.

apps/supervisor/src/index.ts (4)

133-137: Good defensive check for missing checkpoint client.
Returning early is a safe approach to prevent further issues if checkpointClient is unavailable.


139-139: Ensure concurrency scenarios are properly handled.
While restoreRun is invoked here, consider verifying or testing concurrency scenarios to ensure that multiple concurrent restore attempts for the same run do not result in inconsistent states.

Would you like me to draft a script to scan for any concurrency checks around restoreRun usage in the codebase?


225-254: Robust warm start fetch logic.
The try block properly handles the response and logs errors. The fallback returning false ensures clean handling of failures. No issues detected.


255-258: Catch block ensures proper logging and fallback.
This block neatly captures any error during warm start attempts, aiding in debugging. The return false; is a safe fallback to maintain stability.

apps/supervisor/src/workloadManager/kubernetes.ts (1)

7-7: Centralizing RunnerId import.
Importing from @trigger.dev/core/v3/isomorphic helps standardize unique ID generation, improving consistency across the codebase.

packages/core/src/v3/workers/warmStartClient.ts (3)

1-14: Clean introduction of the warm start client.
The new imports and WarmStartClientOptions type create a clear foundation for managing warm starts.


33-53: Proper use of schema validation and retry logic.
The connect method effectively combines zod-based checks with exponential backoff to handle transient network issues.


55-97: Graceful fallback when a warm start is not available.
Logging failures and returning null keeps error handling contained and prevents unhandled rejections.

packages/cli-v3/src/entryPoints/managed-run-controller.ts (7)

16-16: No functional concerns

The new import of WarmStartClient looks good and is used consistently in the rest of the code.


50-50: Optional metadata URL well-integrated

TRIGGER_METADATA_URL as an optional environment variable introduces a flexible mechanism for retrieving overrides without breaking existing functionality.


52-58: Double-check environment variable defaults

Some fields like TRIGGER_WORKER_INSTANCE_NAME lack a default value. This is acceptable if you guarantee they're always set externally. Otherwise, consider providing safe defaults or making them optional with fallback logic.

Do you want me to verify references to TRIGGER_WORKER_INSTANCE_NAME to ensure it's consistently defined?


111-112: Optional injection of clients is handled well

Defining warmStartClient and metadataClient as optional is appropriate since they're conditionally created. Access guards in the code appear to handle undefined cases properly.


117-123: Ensure concurrency safety for runtime updates

heartbeatIntervalSeconds, snapshotPollIntervalSeconds, workerApiUrl, and workerInstanceName can all be changed while the controller runs. If they’re updated concurrently, consider using locks or ensuring safe usage to avoid race conditions.

Do you want concurrency tests or a verification script to confirm these properties won't be read/written simultaneously?


531-560: Confirm no sensitive data is logged

When calling this.httpClient.sendDebugLog with suspendResult.error, please ensure it doesn't contain sensitive or personal data. If there is any risk, sanitize or mask those fields.


46-46:

✅ Verification successful

Verify removal of old environment variable references

You renamed TRIGGER_WARM_START_TOTAL_DURATION_MS to TRIGGER_WARM_START_KEEPALIVE_MS. Please ensure no references to the old variable remain in other files.


🏁 Script executed:

#!/bin/bash
# Search for the old environment variable name in the codebase.
rg 'TRIGGER_WARM_START_TOTAL_DURATION_MS'

Length of output: 41


I'm generating an additional verification script to further check for any lingering occurrences of the old environment variable. Please run the script below to confirm that "TRIGGER_WARM_START_TOTAL_DURATION_MS" does not appear anywhere in the codebase.

If both commands produce no results, you can safely conclude the old variable reference has been completely removed.


🏁 Script executed:

#!/bin/bash
echo "Searching using ripgrep:"
rg -n "TRIGGER_WARM_START_TOTAL_DURATION_MS" .

echo -e "\nSearching using grep recursively:"
grep -R "TRIGGER_WARM_START_TOTAL_DURATION_MS" .

Length of output: 426


Verified Removal of Old Environment Variable Reference

The additional searches using both ripgrep and grep confirmed that there are no instances of TRIGGER_WARM_START_TOTAL_DURATION_MS remaining in the repository. The renaming to TRIGGER_WARM_START_KEEPALIVE_MS in packages/cli-v3/src/entryPoints/managed-run-controller.ts has been applied correctly across the codebase.

@nicktrn nicktrn merged commit 7842e9d into main Mar 14, 2025
12 checks passed
@nicktrn nicktrn deleted the fix/supervisor-misc branch March 14, 2025 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants