-
-
Notifications
You must be signed in to change notification settings - Fork 700
Warm start and restore improvements #1793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
WalkthroughThis update introduces robust logging, improved error handling, and changes to how unique identifiers are generated and imported. Several modules now validate the presence of essential services (like a checkpoint client) before proceeding. The warm start functionality has been refactored with a dedicated client and associated environment configuration, altering long polling and metadata processing. Additionally, custom API request handling code has been removed or centralized across both CLI and core modules. Several imports now point to a centralized package to standardize identifier creation. Changes
Sequence Diagram(s)sequenceDiagram
participant MRC as ManagedRunController
participant MC as MetadataClient
participant WH as WarmStartClient
participant WS as Warm Start Service
MRC->>MC: Fetch environment overrides
MC-->>MRC: Return env overrides
MRC->>WH: Initiate connection (GET /connect)
WH->>WS: GET /connect
WS-->>WH: Respond with WarmStartConnectResponse
MRC->>WH: Start warm start process (long polling /warm-start)
WH->>WS: Long poll /warm-start
WS-->>WH: Return dequeued message or error
WH-->>MRC: Deliver warm start response
Possibly related PRs
Suggested reviewers
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure. 🔧 ESLint
Scope: all 2 workspace projects Tip ⚡🧪 Multi-step agentic review comment chat (experimental)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (10)
apps/supervisor/src/workloadManager/docker.ts (1)
10-10
: Well-structured import for Docker utilities.Importing
getDockerHostDomain
directly from../util.js
highlights a clear separation of concerns, keeping the Docker-specific logic encapsulated. Make sure that the utility handles network edge cases, such as missing or undefinedenv.DOCKER_NETWORK
.apps/supervisor/src/index.ts (1)
90-93
: Consider validating the checkpoint URL.
While logging the existence of the checkpoint URL is helpful, it might be useful to confirm that it is a valid URL before proceeding, to avoid potential runtime errors.apps/supervisor/src/workloadManager/kubernetes.ts (1)
58-58
: Container name is now static.
Using a single, invariant name"run-controller"
for every container may cause naming collisions if multiple pods or containers run concurrently. Consider adding a unique suffix to avoid potential conflicts.- name: "run-controller", + name: `run-controller-${runnerId}`,packages/core/src/v3/workers/warmStartClient.ts (1)
150-159
: Consider removing superfluous continue statements.
The static analysis suggests that thesecontinue
statements are unnecessary as there is no code after them within the loop body. Removing them clarifies flow control.150 if (error instanceof Error && error.name === "AbortError") { 151 this.logger.log("Long poll request timed out, retrying..."); -152 continue; 153 } else { 154 this.logger.error("Error during fetch, retrying...", { error }); ... -159 continue; 160 }🧰 Tools
🪛 Biome (1.9.4)
[error] 152-152: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 159-159: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
packages/cli-v3/src/entryPoints/managed-run-controller.ts (6)
78-85
: Prevent drift by using partial environment schemasDefining a parallel
Metadata
type can risk subtle mismatches with your originalEnv
schema. Consider reusing a partial subset ofEnv
(e.g., a new Zod schema) for consistency and easier maintenance.
87-103
: Improve error handling and schema validation
MetadataClient
fetches JSON from the metadata URL but doesn't validate the result. Consider using Zod for strict type checking and switch tologger
instead ofconsole.error
to keep logging consistent.
135-243
: Large constructor - consider splitting setupThe constructor initializes many services, pollers, and signal handlers. You could refactor these into dedicated methods or helper classes to reduce complexity. Also note the mix of
console.*
andlogger.*
usage—adopting a single logging approach might simplify debugging.
819-822
: Graceful fallback for warm start unavailabilityIf
this.warmStartClient
is missing, you callexitProcess(0)
. Consider allowing a cold-start fallback or introducing a retry mechanism if partial warm start usage is acceptable.
890-893
: Use caution with immediate process termination
process.exit()
might skip ongoing async tasks like log flushing. Consider a brief delay or a graceful shutdown routine to ensure logs and crucial I/O are fully completed first.
1115-1115
: Consider making forced shutdown configurableExiting the process after 5 minutes (
exitProcess(1)
) can prevent debugging and final logging tasks from running. If this timeout is only for testing, you might want a configuration toggle for production.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
apps/supervisor/src/index.ts
(3 hunks)apps/supervisor/src/util.ts
(0 hunks)apps/supervisor/src/workloadManager/docker.ts
(1 hunks)apps/supervisor/src/workloadManager/kubernetes.ts
(2 hunks)apps/supervisor/src/workloadServer/index.ts
(1 hunks)packages/cli-v3/src/apiClient.ts
(1 hunks)packages/cli-v3/src/entryPoints/managed-run-controller.ts
(11 hunks)packages/core/src/v3/apiClient/core.ts
(1 hunks)packages/core/src/v3/isomorphic/friendlyId.ts
(1 hunks)packages/core/src/v3/runEngineWorker/supervisor/http.ts
(1 hunks)packages/core/src/v3/runEngineWorker/workload/http.ts
(2 hunks)packages/core/src/v3/schemas/index.ts
(1 hunks)packages/core/src/v3/schemas/warmStart.ts
(1 hunks)packages/core/src/v3/serverOnly/httpServer.ts
(1 hunks)packages/core/src/v3/utils/heartbeat.ts
(1 hunks)packages/core/src/v3/workers/index.ts
(1 hunks)packages/core/src/v3/workers/warmStartClient.ts
(1 hunks)
💤 Files with no reviewable changes (1)
- apps/supervisor/src/util.ts
🧰 Additional context used
🪛 Biome (1.9.4)
packages/core/src/v3/workers/warmStartClient.ts
[error] 152-152: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 159-159: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: typecheck / typecheck
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: units / 🧪 Unit Tests
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (32)
packages/core/src/v3/schemas/index.ts (1)
16-16
: Added warm start functionality exportGood addition of the new export statement that makes the warm start functionality available through the index file. This aligns with the PR objective of making the warm start configuration dynamic.
packages/core/src/v3/utils/heartbeat.ts (1)
57-60
: Good implementation of dynamic heartbeat interval updateThis new method allows for updating the heartbeat interval after initialization, which aligns with the PR objective of making configurations updatable after the initialization process. The implementation correctly updates the private property and resets the current interval.
packages/core/src/v3/workers/index.ts (1)
22-22
: Properly exported WarmStartClientGood addition of the export for the WarmStartClient and its options type. This makes the warm start functionality accessible to other modules and supports the PR objective of implementing warm start improvements.
packages/core/src/v3/serverOnly/httpServer.ts (1)
19-19
:❓ Verification inconclusive
Enhanced RouteHandler return type flexibility
Changing the return type from
Promise<void>
toPromise<any>
makes the API more flexible by allowing route handlers to return values that may be used elsewhere. This change supports the PR objective of facilitating updates to configuration settings.Consider verifying that existing route handlers don't rely on the return value being void.
🏁 Script executed:
#!/bin/bash # Find places where RouteHandler is used and check if they expect void return type # Look for files where RouteHandler is imported or used rg -l "RouteHandler" --type=ts # Look for route handler implementations that might be returning values rg "handler: async \(\{.*\}\) => \{.*return" --type=tsLength of output: 193
Enhanced RouteHandler Return Type Verification
Based on our search results:
- The term
RouteHandler
appears only in:
packages/core/src/v3/serverOnly/httpServer.ts
references/nextjs-realtime/src/app/api/uploadthing/route.ts
- Our search for asynchronous implementations returning values (using the pattern
handler: async ({ ... }) => { ... return
) produced no matches, indicating that none of the implementations explicitly leverage a return value beyondvoid
.The change from
Promise<void>
toPromise<any>
appears safe. However, please manually verify inreferences/nextjs-realtime/src/app/api/uploadthing/route.ts
(and any other potential caller not captured by our regex) that none of the route handlers depend on the previous void return type.packages/core/src/v3/schemas/warmStart.ts (1)
3-8
: Looks good!The schema and inferred type naming pattern is a common approach with Zod and TypeScript; reusing the same name for the schema constant and the type alias is acceptable and keeps the code straightforward.
apps/supervisor/src/workloadManager/docker.ts (1)
2-2
: Centralized import ofRunnerId
is consistent.Shifting
RunnerId
to a dedicated package helps centralize ID generation logic and maintain consistency across modules. This is a beneficial change, especially for large codebases where ID generation must remain uniform.packages/cli-v3/src/apiClient.ts (1)
35-35
: Ensure consistency with newly introduced imports.Importing
ApiResult
,wrapZodFetch
, andzodfetchSSE
from@trigger.dev/core/v3/zodfetch
is aligning with the rest of the file's usage of these utilities. Given the file’s continued reliance onApiResult
for typed responses, this import seems appropriate. Keep an eye on version updates forzodfetch
to maintain compatibility in the future.packages/core/src/v3/isomorphic/friendlyId.ts (2)
94-108
: Well-structured ID generation classThe
IdGenerator
class is well-designed with proper encapsulation through private fields and a clear generation method. This provides a consistent way to generate IDs throughout the codebase.
110-114
: Good standardization for runner identifiersCreating a standardized
RunnerId
generator with consistent prefix and alphabet is a good practice. This centralizes runner ID generation logic and ensures consistency across the codebase.apps/supervisor/src/workloadServer/index.ts (2)
203-213
: Improved check for checkpoint service availabilityChecking for the existence of
checkpointClient
before proceeding with the request is a good practice. It provides a clear error message when the service is unavailable.
215-231
: Better structured error handling flowMoving the
runnerId
check after thecheckpointClient
check improves the flow of the handler by validating prerequisites first. This provides more specific error messages for different failure scenarios.packages/core/src/v3/apiClient/core.ts (2)
696-701
: Well-defined API result typeThe
ApiResult
type provides a clear discriminated union for API responses, making error handling more consistent throughout the codebase.
703-743
: Centralized API request wrapper with robust error handlingThe
wrapZodFetch
function centralizes API error handling logic and provides a consistent return format. The implementation handles different error types appropriately and includes sensible retry configurations.packages/core/src/v3/runEngineWorker/workload/http.ts (3)
17-17
: Good centralization of API wrapper functionImporting the centralized
wrapZodFetch
function reduces code duplication and ensures consistent error handling across the codebase.
22-22
: API URL made modifiable to support dynamic configurationChanging
apiUrl
fromreadonly
to modifiable enables updating connection details at runtime, supporting the dynamic warm start configuration mentioned in the PR objectives.
40-42
: Clean implementation of URL update methodThe
updateApiUrl
method correctly handles trailing slashes for consistency with the constructor's implementation. This supports the PR objective of updating configuration settings after initialization.packages/core/src/v3/runEngineWorker/supervisor/http.ts (1)
21-21
: No issues with the new import statement.
ThewrapZodFetch
usage is consistent across the file, and there are no apparent integration concerns.apps/supervisor/src/index.ts (4)
133-137
: Good defensive check for missing checkpoint client.
Returning early is a safe approach to prevent further issues ifcheckpointClient
is unavailable.
139-139
: Ensure concurrency scenarios are properly handled.
WhilerestoreRun
is invoked here, consider verifying or testing concurrency scenarios to ensure that multiple concurrent restore attempts for the same run do not result in inconsistent states.Would you like me to draft a script to scan for any concurrency checks around
restoreRun
usage in the codebase?
225-254
: Robust warm start fetch logic.
The try block properly handles the response and logs errors. The fallback returningfalse
ensures clean handling of failures. No issues detected.
255-258
: Catch block ensures proper logging and fallback.
This block neatly captures any error during warm start attempts, aiding in debugging. Thereturn false;
is a safe fallback to maintain stability.apps/supervisor/src/workloadManager/kubernetes.ts (1)
7-7
: Centralizing RunnerId import.
Importing from@trigger.dev/core/v3/isomorphic
helps standardize unique ID generation, improving consistency across the codebase.packages/core/src/v3/workers/warmStartClient.ts (3)
1-14
: Clean introduction of the warm start client.
The new imports andWarmStartClientOptions
type create a clear foundation for managing warm starts.
33-53
: Proper use of schema validation and retry logic.
Theconnect
method effectively combineszod
-based checks with exponential backoff to handle transient network issues.
55-97
: Graceful fallback when a warm start is not available.
Logging failures and returningnull
keeps error handling contained and prevents unhandled rejections.packages/cli-v3/src/entryPoints/managed-run-controller.ts (7)
16-16
: No functional concernsThe new import of
WarmStartClient
looks good and is used consistently in the rest of the code.
50-50
: Optional metadata URL well-integrated
TRIGGER_METADATA_URL
as an optional environment variable introduces a flexible mechanism for retrieving overrides without breaking existing functionality.
52-58
: Double-check environment variable defaultsSome fields like
TRIGGER_WORKER_INSTANCE_NAME
lack a default value. This is acceptable if you guarantee they're always set externally. Otherwise, consider providing safe defaults or making them optional with fallback logic.Do you want me to verify references to
TRIGGER_WORKER_INSTANCE_NAME
to ensure it's consistently defined?
111-112
: Optional injection of clients is handled wellDefining
warmStartClient
andmetadataClient
as optional is appropriate since they're conditionally created. Access guards in the code appear to handleundefined
cases properly.
117-123
: Ensure concurrency safety for runtime updates
heartbeatIntervalSeconds
,snapshotPollIntervalSeconds
,workerApiUrl
, andworkerInstanceName
can all be changed while the controller runs. If they’re updated concurrently, consider using locks or ensuring safe usage to avoid race conditions.Do you want concurrency tests or a verification script to confirm these properties won't be read/written simultaneously?
531-560
: Confirm no sensitive data is loggedWhen calling
this.httpClient.sendDebugLog
withsuspendResult.error
, please ensure it doesn't contain sensitive or personal data. If there is any risk, sanitize or mask those fields.
46-46
:✅ Verification successful
Verify removal of old environment variable references
You renamed
TRIGGER_WARM_START_TOTAL_DURATION_MS
toTRIGGER_WARM_START_KEEPALIVE_MS
. Please ensure no references to the old variable remain in other files.
🏁 Script executed:
#!/bin/bash # Search for the old environment variable name in the codebase. rg 'TRIGGER_WARM_START_TOTAL_DURATION_MS'Length of output: 41
I'm generating an additional verification script to further check for any lingering occurrences of the old environment variable. Please run the script below to confirm that
"TRIGGER_WARM_START_TOTAL_DURATION_MS"
does not appear anywhere in the codebase.If both commands produce no results, you can safely conclude the old variable reference has been completely removed.
🏁 Script executed:
#!/bin/bash echo "Searching using ripgrep:" rg -n "TRIGGER_WARM_START_TOTAL_DURATION_MS" . echo -e "\nSearching using grep recursively:" grep -R "TRIGGER_WARM_START_TOTAL_DURATION_MS" .Length of output: 426
Verified Removal of Old Environment Variable Reference
The additional searches using both ripgrep and grep confirmed that there are no instances of
TRIGGER_WARM_START_TOTAL_DURATION_MS
remaining in the repository. The renaming toTRIGGER_WARM_START_KEEPALIVE_MS
inpackages/cli-v3/src/entryPoints/managed-run-controller.ts
has been applied correctly across the codebase.
Many small fixes in this one. The biggest changes here are:
Summary by CodeRabbit
New Features
Bug Fixes
Refactor