-
-
Notifications
You must be signed in to change notification settings - Fork 867
Warm start and restore improvements #1793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
WalkthroughThis update introduces robust logging, improved error handling, and changes to how unique identifiers are generated and imported. Several modules now validate the presence of essential services (like a checkpoint client) before proceeding. The warm start functionality has been refactored with a dedicated client and associated environment configuration, altering long polling and metadata processing. Additionally, custom API request handling code has been removed or centralized across both CLI and core modules. Several imports now point to a centralized package to standardize identifier creation. Changes
Sequence Diagram(s)sequenceDiagram
participant MRC as ManagedRunController
participant MC as MetadataClient
participant WH as WarmStartClient
participant WS as Warm Start Service
MRC->>MC: Fetch environment overrides
MC-->>MRC: Return env overrides
MRC->>WH: Initiate connection (GET /connect)
WH->>WS: GET /connect
WS-->>WH: Respond with WarmStartConnectResponse
MRC->>WH: Start warm start process (long polling /warm-start)
WH->>WS: Long poll /warm-start
WS-->>WH: Return dequeued message or error
WH-->>MRC: Deliver warm start response
Possibly related PRs
Suggested reviewers
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure. 🔧 ESLint
Scope: all 2 workspace projects Tip ⚡🧪 Multi-step agentic review comment chat (experimental)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (10)
apps/supervisor/src/workloadManager/docker.ts (1)
10-10: Well-structured import for Docker utilities.Importing
getDockerHostDomaindirectly from../util.jshighlights a clear separation of concerns, keeping the Docker-specific logic encapsulated. Make sure that the utility handles network edge cases, such as missing or undefinedenv.DOCKER_NETWORK.apps/supervisor/src/index.ts (1)
90-93: Consider validating the checkpoint URL.
While logging the existence of the checkpoint URL is helpful, it might be useful to confirm that it is a valid URL before proceeding, to avoid potential runtime errors.apps/supervisor/src/workloadManager/kubernetes.ts (1)
58-58: Container name is now static.
Using a single, invariant name"run-controller"for every container may cause naming collisions if multiple pods or containers run concurrently. Consider adding a unique suffix to avoid potential conflicts.- name: "run-controller", + name: `run-controller-${runnerId}`,packages/core/src/v3/workers/warmStartClient.ts (1)
150-159: Consider removing superfluous continue statements.
The static analysis suggests that thesecontinuestatements are unnecessary as there is no code after them within the loop body. Removing them clarifies flow control.150 if (error instanceof Error && error.name === "AbortError") { 151 this.logger.log("Long poll request timed out, retrying..."); -152 continue; 153 } else { 154 this.logger.error("Error during fetch, retrying...", { error }); ... -159 continue; 160 }🧰 Tools
🪛 Biome (1.9.4)
[error] 152-152: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 159-159: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
packages/cli-v3/src/entryPoints/managed-run-controller.ts (6)
78-85: Prevent drift by using partial environment schemasDefining a parallel
Metadatatype can risk subtle mismatches with your originalEnvschema. Consider reusing a partial subset ofEnv(e.g., a new Zod schema) for consistency and easier maintenance.
87-103: Improve error handling and schema validation
MetadataClientfetches JSON from the metadata URL but doesn't validate the result. Consider using Zod for strict type checking and switch tologgerinstead ofconsole.errorto keep logging consistent.
135-243: Large constructor - consider splitting setupThe constructor initializes many services, pollers, and signal handlers. You could refactor these into dedicated methods or helper classes to reduce complexity. Also note the mix of
console.*andlogger.*usage—adopting a single logging approach might simplify debugging.
819-822: Graceful fallback for warm start unavailabilityIf
this.warmStartClientis missing, you callexitProcess(0). Consider allowing a cold-start fallback or introducing a retry mechanism if partial warm start usage is acceptable.
890-893: Use caution with immediate process termination
process.exit()might skip ongoing async tasks like log flushing. Consider a brief delay or a graceful shutdown routine to ensure logs and crucial I/O are fully completed first.
1115-1115: Consider making forced shutdown configurableExiting the process after 5 minutes (
exitProcess(1)) can prevent debugging and final logging tasks from running. If this timeout is only for testing, you might want a configuration toggle for production.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
apps/supervisor/src/index.ts(3 hunks)apps/supervisor/src/util.ts(0 hunks)apps/supervisor/src/workloadManager/docker.ts(1 hunks)apps/supervisor/src/workloadManager/kubernetes.ts(2 hunks)apps/supervisor/src/workloadServer/index.ts(1 hunks)packages/cli-v3/src/apiClient.ts(1 hunks)packages/cli-v3/src/entryPoints/managed-run-controller.ts(11 hunks)packages/core/src/v3/apiClient/core.ts(1 hunks)packages/core/src/v3/isomorphic/friendlyId.ts(1 hunks)packages/core/src/v3/runEngineWorker/supervisor/http.ts(1 hunks)packages/core/src/v3/runEngineWorker/workload/http.ts(2 hunks)packages/core/src/v3/schemas/index.ts(1 hunks)packages/core/src/v3/schemas/warmStart.ts(1 hunks)packages/core/src/v3/serverOnly/httpServer.ts(1 hunks)packages/core/src/v3/utils/heartbeat.ts(1 hunks)packages/core/src/v3/workers/index.ts(1 hunks)packages/core/src/v3/workers/warmStartClient.ts(1 hunks)
💤 Files with no reviewable changes (1)
- apps/supervisor/src/util.ts
🧰 Additional context used
🪛 Biome (1.9.4)
packages/core/src/v3/workers/warmStartClient.ts
[error] 152-152: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 159-159: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: typecheck / typecheck
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: units / 🧪 Unit Tests
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (32)
packages/core/src/v3/schemas/index.ts (1)
16-16: Added warm start functionality exportGood addition of the new export statement that makes the warm start functionality available through the index file. This aligns with the PR objective of making the warm start configuration dynamic.
packages/core/src/v3/utils/heartbeat.ts (1)
57-60: Good implementation of dynamic heartbeat interval updateThis new method allows for updating the heartbeat interval after initialization, which aligns with the PR objective of making configurations updatable after the initialization process. The implementation correctly updates the private property and resets the current interval.
packages/core/src/v3/workers/index.ts (1)
22-22: Properly exported WarmStartClientGood addition of the export for the WarmStartClient and its options type. This makes the warm start functionality accessible to other modules and supports the PR objective of implementing warm start improvements.
packages/core/src/v3/serverOnly/httpServer.ts (1)
19-19:❓ Verification inconclusive
Enhanced RouteHandler return type flexibility
Changing the return type from
Promise<void>toPromise<any>makes the API more flexible by allowing route handlers to return values that may be used elsewhere. This change supports the PR objective of facilitating updates to configuration settings.Consider verifying that existing route handlers don't rely on the return value being void.
🏁 Script executed:
#!/bin/bash # Find places where RouteHandler is used and check if they expect void return type # Look for files where RouteHandler is imported or used rg -l "RouteHandler" --type=ts # Look for route handler implementations that might be returning values rg "handler: async \(\{.*\}\) => \{.*return" --type=tsLength of output: 193
Enhanced RouteHandler Return Type Verification
Based on our search results:
- The term
RouteHandlerappears only in:
packages/core/src/v3/serverOnly/httpServer.tsreferences/nextjs-realtime/src/app/api/uploadthing/route.ts- Our search for asynchronous implementations returning values (using the pattern
handler: async ({ ... }) => { ... return) produced no matches, indicating that none of the implementations explicitly leverage a return value beyondvoid.The change from
Promise<void>toPromise<any>appears safe. However, please manually verify inreferences/nextjs-realtime/src/app/api/uploadthing/route.ts(and any other potential caller not captured by our regex) that none of the route handlers depend on the previous void return type.packages/core/src/v3/schemas/warmStart.ts (1)
3-8: Looks good!The schema and inferred type naming pattern is a common approach with Zod and TypeScript; reusing the same name for the schema constant and the type alias is acceptable and keeps the code straightforward.
apps/supervisor/src/workloadManager/docker.ts (1)
2-2: Centralized import ofRunnerIdis consistent.Shifting
RunnerIdto a dedicated package helps centralize ID generation logic and maintain consistency across modules. This is a beneficial change, especially for large codebases where ID generation must remain uniform.packages/cli-v3/src/apiClient.ts (1)
35-35: Ensure consistency with newly introduced imports.Importing
ApiResult,wrapZodFetch, andzodfetchSSEfrom@trigger.dev/core/v3/zodfetchis aligning with the rest of the file's usage of these utilities. Given the file’s continued reliance onApiResultfor typed responses, this import seems appropriate. Keep an eye on version updates forzodfetchto maintain compatibility in the future.packages/core/src/v3/isomorphic/friendlyId.ts (2)
94-108: Well-structured ID generation classThe
IdGeneratorclass is well-designed with proper encapsulation through private fields and a clear generation method. This provides a consistent way to generate IDs throughout the codebase.
110-114: Good standardization for runner identifiersCreating a standardized
RunnerIdgenerator with consistent prefix and alphabet is a good practice. This centralizes runner ID generation logic and ensures consistency across the codebase.apps/supervisor/src/workloadServer/index.ts (2)
203-213: Improved check for checkpoint service availabilityChecking for the existence of
checkpointClientbefore proceeding with the request is a good practice. It provides a clear error message when the service is unavailable.
215-231: Better structured error handling flowMoving the
runnerIdcheck after thecheckpointClientcheck improves the flow of the handler by validating prerequisites first. This provides more specific error messages for different failure scenarios.packages/core/src/v3/apiClient/core.ts (2)
696-701: Well-defined API result typeThe
ApiResulttype provides a clear discriminated union for API responses, making error handling more consistent throughout the codebase.
703-743: Centralized API request wrapper with robust error handlingThe
wrapZodFetchfunction centralizes API error handling logic and provides a consistent return format. The implementation handles different error types appropriately and includes sensible retry configurations.packages/core/src/v3/runEngineWorker/workload/http.ts (3)
17-17: Good centralization of API wrapper functionImporting the centralized
wrapZodFetchfunction reduces code duplication and ensures consistent error handling across the codebase.
22-22: API URL made modifiable to support dynamic configurationChanging
apiUrlfromreadonlyto modifiable enables updating connection details at runtime, supporting the dynamic warm start configuration mentioned in the PR objectives.
40-42: Clean implementation of URL update methodThe
updateApiUrlmethod correctly handles trailing slashes for consistency with the constructor's implementation. This supports the PR objective of updating configuration settings after initialization.packages/core/src/v3/runEngineWorker/supervisor/http.ts (1)
21-21: No issues with the new import statement.
ThewrapZodFetchusage is consistent across the file, and there are no apparent integration concerns.apps/supervisor/src/index.ts (4)
133-137: Good defensive check for missing checkpoint client.
Returning early is a safe approach to prevent further issues ifcheckpointClientis unavailable.
139-139: Ensure concurrency scenarios are properly handled.
WhilerestoreRunis invoked here, consider verifying or testing concurrency scenarios to ensure that multiple concurrent restore attempts for the same run do not result in inconsistent states.Would you like me to draft a script to scan for any concurrency checks around
restoreRunusage in the codebase?
225-254: Robust warm start fetch logic.
The try block properly handles the response and logs errors. The fallback returningfalseensures clean handling of failures. No issues detected.
255-258: Catch block ensures proper logging and fallback.
This block neatly captures any error during warm start attempts, aiding in debugging. Thereturn false;is a safe fallback to maintain stability.apps/supervisor/src/workloadManager/kubernetes.ts (1)
7-7: Centralizing RunnerId import.
Importing from@trigger.dev/core/v3/isomorphichelps standardize unique ID generation, improving consistency across the codebase.packages/core/src/v3/workers/warmStartClient.ts (3)
1-14: Clean introduction of the warm start client.
The new imports andWarmStartClientOptionstype create a clear foundation for managing warm starts.
33-53: Proper use of schema validation and retry logic.
Theconnectmethod effectively combineszod-based checks with exponential backoff to handle transient network issues.
55-97: Graceful fallback when a warm start is not available.
Logging failures and returningnullkeeps error handling contained and prevents unhandled rejections.packages/cli-v3/src/entryPoints/managed-run-controller.ts (7)
16-16: No functional concernsThe new import of
WarmStartClientlooks good and is used consistently in the rest of the code.
50-50: Optional metadata URL well-integrated
TRIGGER_METADATA_URLas an optional environment variable introduces a flexible mechanism for retrieving overrides without breaking existing functionality.
52-58: Double-check environment variable defaultsSome fields like
TRIGGER_WORKER_INSTANCE_NAMElack a default value. This is acceptable if you guarantee they're always set externally. Otherwise, consider providing safe defaults or making them optional with fallback logic.Do you want me to verify references to
TRIGGER_WORKER_INSTANCE_NAMEto ensure it's consistently defined?
111-112: Optional injection of clients is handled wellDefining
warmStartClientandmetadataClientas optional is appropriate since they're conditionally created. Access guards in the code appear to handleundefinedcases properly.
117-123: Ensure concurrency safety for runtime updates
heartbeatIntervalSeconds,snapshotPollIntervalSeconds,workerApiUrl, andworkerInstanceNamecan all be changed while the controller runs. If they’re updated concurrently, consider using locks or ensuring safe usage to avoid race conditions.Do you want concurrency tests or a verification script to confirm these properties won't be read/written simultaneously?
531-560: Confirm no sensitive data is loggedWhen calling
this.httpClient.sendDebugLogwithsuspendResult.error, please ensure it doesn't contain sensitive or personal data. If there is any risk, sanitize or mask those fields.
46-46:✅ Verification successful
Verify removal of old environment variable references
You renamed
TRIGGER_WARM_START_TOTAL_DURATION_MStoTRIGGER_WARM_START_KEEPALIVE_MS. Please ensure no references to the old variable remain in other files.
🏁 Script executed:
#!/bin/bash # Search for the old environment variable name in the codebase. rg 'TRIGGER_WARM_START_TOTAL_DURATION_MS'Length of output: 41
I'm generating an additional verification script to further check for any lingering occurrences of the old environment variable. Please run the script below to confirm that
"TRIGGER_WARM_START_TOTAL_DURATION_MS"does not appear anywhere in the codebase.If both commands produce no results, you can safely conclude the old variable reference has been completely removed.
🏁 Script executed:
#!/bin/bash echo "Searching using ripgrep:" rg -n "TRIGGER_WARM_START_TOTAL_DURATION_MS" . echo -e "\nSearching using grep recursively:" grep -R "TRIGGER_WARM_START_TOTAL_DURATION_MS" .Length of output: 426
Verified Removal of Old Environment Variable Reference
The additional searches using both ripgrep and grep confirmed that there are no instances of
TRIGGER_WARM_START_TOTAL_DURATION_MSremaining in the repository. The renaming toTRIGGER_WARM_START_KEEPALIVE_MSinpackages/cli-v3/src/entryPoints/managed-run-controller.tshas been applied correctly across the codebase.
Many small fixes in this one. The biggest changes here are:
Summary by CodeRabbit
New Features
Bug Fixes
Refactor