Skip to content

Managed run controller revamp #1927

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 40 commits into from
Apr 15, 2025
Merged

Managed run controller revamp #1927

merged 40 commits into from
Apr 15, 2025

Conversation

nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented Apr 15, 2025

This is a major rework of the managed run controller to improve reliability and performance. It has been tested extensively in our test environment. Unit tests and porting relevant parts to the dev controller will follow.

Bonus:

  • strictPropertyInitialization: true
  • automatic removal of completed docker containers
  • heartbeat service -> interval service
  • small worker notification fix

Summary by CodeRabbit

  • New Features
    • Introduced a robust interval management service for asynchronous operations.
    • Added comprehensive run execution, environment configuration, and logging support for managed worker environments.
    • Enabled Docker container auto-removal configuration via a new environment variable.
    • Added metadata client for environment overrides and snapshot polling services.
  • Improvements
    • Enhanced managed run controller performance, reliability, and warm start optimizations.
    • Improved lifecycle management with refined socket handling and safer initialization.
    • Refined socket disconnection logic to avoid unnecessary disconnects.
  • Deprecations
    • Deprecated legacy heartbeat and dequeue methods in favor of the new interval-based service.
  • Refactor
    • Replaced heartbeat services with interval-based services across multiple modules.
    • Modularized managed run controller by extracting logic into dedicated classes.
    • Updated machine resource type usage for consistency.
  • Chores
    • Increased TypeScript strictness for property initialization.
    • Added release metadata and updated exports for interval utilities.

Copy link

changeset-bot bot commented Apr 15, 2025

🦋 Changeset detected

Latest commit: df845e1

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor

coderabbitai bot commented Apr 15, 2025

Walkthrough

This update introduces a comprehensive refactor of the managed run controller system. The legacy implementation of the ManagedRunController is removed and replaced with a new modular structure, adding dedicated files for environment configuration, execution lifecycle management, heartbeat and polling services, logging, and metadata overrides. The new design leverages an IntervalService for periodic tasks, replaces deprecated heartbeat utilities, and standardizes machine resource typing. Additional changes refine Docker container management, socket connection logic, and TypeScript strictness. Several internal APIs are updated or deprecated, and new types and utilities are introduced to support the improved architecture.

Changes

File(s) Change Summary
packages/cli-v3/src/entryPoints/managed/controller.ts, env.ts, execution.ts, heartbeat.ts, logger.ts, overrides.ts, poller.ts Introduced new modular implementation for managed run controller, including environment config, execution management, heartbeat, polling, logging, and metadata override support.
packages/cli-v3/src/entryPoints/managed-run-controller.ts Removed legacy ManagedRunController implementation; now delegates to the new controller module.
packages/cli-v3/src/executions/taskRunProcess.ts, core/src/v3/machines/index.ts, core/src/v3/schemas/common.ts Standardized machine resource typing by introducing MachinePresetResources and updating all relevant function and type signatures.
packages/cli-v3/src/entryPoints/dev-run-controller.ts, webapp/app/v3/authenticatedSocketConnection.server.ts, core/src/v3/runEngineWorker/supervisor/session.ts, apps/supervisor/src/services/podCleaner.ts Replaced deprecated HeartbeatService with new IntervalService for periodic tasks.
packages/core/src/v3/utils/interval.ts Added new IntervalService class for robust, configurable asynchronous interval management.
packages/core/src/v3/utils/heartbeat.ts Marked HeartbeatService as deprecated in favor of IntervalService.
packages/core/src/v3/index.ts Exported interval utilities from central module index.
packages/cli-v3/e2e/utils.ts Consolidated machine resource definitions into a reusable constant.
apps/supervisor/src/env.ts, index.ts, workloadManager/docker.ts, workloadManager/types.ts Added RUNNER_DOCKER_AUTOREMOVE env var, propagated config, and enabled Docker --rm flag based on config.
apps/supervisor/src/workloadServer/index.ts Refined socket run ID disconnection logic to avoid unnecessary disconnects.
internal-packages/run-engine/src/engine/db/worker.ts Changed query method from findUnique to findFirst; added a fixme comment.
packages/cli-v3/src/dev/devSupervisor.ts Made config and socket properties optional, improved initialization and shutdown safety.
.changeset/tricky-houses-invite.md Added changeset for patch release metadata.
packages/core/src/utils.ts Updated tryCatch utility to accept undefined promises.
packages/core/src/v3/runEngineWorker/supervisor/http.ts, workload/http.ts Added deprecation JSDoc comments to unused methods.
configs/tsconfig.base.json Enabled TypeScript strictPropertyInitialization for stricter property checks.
internal-packages/run-engine/src/engine/systems/batchSystem.ts, runAttemptSystem.ts, ttlSystem.ts, waitpointSystem.ts Changed Prisma queries from findUnique to findFirst in several places to adjust query semantics.

Sequence Diagram(s)

sequenceDiagram
    participant Worker as Worker Process
    participant ManagedRunController
    participant Env as RunnerEnv
    participant Execution as RunExecution
    participant Supervisor as Supervisor Socket
    participant HTTP as HTTP Client
    participant WarmStart as WarmStart Client
    participant Metadata as MetadataClient

    Worker->>Env: Initialize environment configuration
    Worker->>ManagedRunController: Start controller with manifest and env
    ManagedRunController->>Supervisor: Connect WebSocket
    ManagedRunController->>WarmStart: (If enabled) Connect for warm start
    ManagedRunController->>HTTP: Request next run
    HTTP-->>ManagedRunController: Run execution data
    ManagedRunController->>Execution: Prepare and start run execution
    Execution->>HTTP: Start attempt, send heartbeat, poll snapshot
    Execution->>Supervisor: Subscribe to run notifications
    Supervisor-->>Execution: Run/snapshot updates
    Execution->>Metadata: (If needed) Fetch env overrides
    Execution->>HTTP: Complete run, send results
    Execution-->>ManagedRunController: Notify completion
    ManagedRunController->>Supervisor: Unsubscribe, wait for next run
Loading

Possibly related PRs

  • triggerdotdev/trigger.dev#1870: Refactors the ManagedRunController logic, which is now replaced and extended in this PR.
  • triggerdotdev/trigger.dev#1793: Adds dynamic warm start configuration and metadata service, directly related to the new managed controller's warm start and override logic.
  • triggerdotdev/trigger.dev#1868: Fixes waits, restores, and exit codes in the managed run controller, which overlaps with the refactor and improvements in this PR.

Suggested reviewers

  • ericallam

Poem

In the warren of code, we hop and we play,
Refactoring controllers the rabbit way.
Heartbeats now intervals, skipping along,
Docker cleans up as we sing this song.
With new modules and types, our garden’s in bloom,
Managed runs flourish, no more old gloom!
— 🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d0bef9 and df845e1.

📒 Files selected for processing (5)
  • internal-packages/run-engine/src/engine/db/worker.ts (2 hunks)
  • internal-packages/run-engine/src/engine/systems/batchSystem.ts (1 hunks)
  • internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts (2 hunks)
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts (1 hunks)
  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal-packages/run-engine/src/engine/db/worker.ts
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: units / 🧪 Unit Tests
🔇 Additional comments (6)
internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts (2)

142-146: Query method change from findUnique to findFirst

The change from prisma.findUnique to prisma.findFirst with separate filter conditions maintains the same lookup logic while possibly allowing more flexibility in query patterns.


1200-1212: Consistent query pattern applied to getAuthenticatedEnvironmentFromRun

This change aligns with the pattern applied throughout the codebase, replacing findUnique with findFirst while maintaining the same filtering logic by ID.

internal-packages/run-engine/src/engine/systems/ttlSystem.ts (1)

36-36: Query method standardization

Changing from findUnique to findFirst is consistent with the pattern applied across all system files in this PR.

internal-packages/run-engine/src/engine/systems/batchSystem.ts (1)

37-45: Standardized database query approach

The change from findUnique to findFirst maintains the same query logic while aligning with the consistent pattern applied across the codebase in this PR.

internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (2)

162-167: Consistent query pattern in createDateTimeWaitpoint

Changed from findUnique to findFirst with the same filtering logic, maintaining consistency with the pattern applied throughout the PR.


242-247: Consistent query pattern in createManualWaitpoint

This change aligns with the pattern of replacing findUnique with findFirst that's being systematically applied across the run engine systems.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (20)
packages/core/src/v3/utils/interval.ts (2)

8-25: Consider consistent naming convention for private members

The class mixes two different private member naming conventions: some members use the underscore prefix (_onInterval), while some methods use the private field syntax (#doInterval). Consider standardizing on one approach for better maintainability.

-  private _onInterval: () => Promise<void>;
-  private _onError?: (error: unknown) => Promise<void>;
-
-  private _intervalMs: number;
-  private _nextInterval: NodeJS.Timeout | undefined;
-  private _leadingEdge: boolean;
-  private _isEnabled: boolean;
+  #onInterval: () => Promise<void>;
+  #onError?: (error: unknown) => Promise<void>;
+
+  #intervalMs: number;
+  #nextInterval: NodeJS.Timeout | undefined;
+  #leadingEdge: boolean;
+  #isEnabled: boolean;

Then update all references accordingly throughout the class.


64-84: Consider more robust error handling strategy

The error handling in #doInterval catches and processes errors from the main callback, but only logs errors from the error handler itself. Consider defining a clearer error handling strategy for nested errors.

  try {
    await this._onInterval();
  } catch (error) {
    if (this._onError) {
      try {
        await this._onError(error);
-      } catch (error) {
-        console.error("Error during interval error handler", error);
+      } catch (handlerError) {
+        console.error("Error during interval error handler", {
+          originalError: error,
+          handlerError,
+        });
      }
    }
  }
internal-packages/run-engine/src/engine/db/worker.ts (1)

267-267: Make the TODO comment more descriptive

The generic "TODO: fixme" comment doesn't provide any context about what needs to be fixed or why. Consider adding more details to make it actionable for future developers.

-  // TODO: fixme
+  // TODO: Improve worker retrieval from promoted deployment - current method may not handle all edge cases
packages/core/src/v3/runEngineWorker/workload/http.ts (1)

168-168: Consider including replacement information in deprecation notice

While the deprecation notice is good, it would be more helpful if it also directed developers to the recommended alternative (if one exists).

-  /** @deprecated Not currently used */
+  /** @deprecated Not currently used. Use RunExecution and RunExecutionHeartbeat services instead. */
apps/supervisor/src/workloadServer/index.ts (1)

455-455: Optimized socket connection behavior

This change prevents unnecessary disconnections when the same run ID is being reconnected. Now, the socket only disconnects an existing run if the new run ID is different from the currently connected one.

packages/cli-v3/src/entryPoints/managed/heartbeat.ts (1)

81-83: Inconsistent interval units between methods.

The updateInterval method accepts intervalMs (milliseconds) while the constructor uses heartbeatIntervalSeconds (seconds). This inconsistency could lead to confusion when calling the update method.

-updateInterval(intervalMs: number) {
+updateInterval(intervalSeconds: number) {
+  const intervalMs = intervalSeconds * 1000;
   this.heartbeat.updateInterval(intervalMs);
 }
packages/cli-v3/src/entryPoints/managed/logger.ts (1)

1-52: Well-structured new RunLogger class

This new RunLogger class is well-organized and provides a centralized approach to debug logging for run executions. The implementation includes proper context enrichment with run identifiers.

Consider adding error handling for the HTTP request in the sendDebugLog method, as it's currently fire-and-forget:

 sendDebugLog({ runId, message, date, properties }: SendDebugLogOptions) {
   if (!runId) {
     runId = this.env.TRIGGER_RUN_ID;
   }

   if (!runId) {
     return;
   }

   const mergedProperties = {
     ...properties,
     runId,
     runnerId: this.env.TRIGGER_RUNNER_ID,
     workerName: this.env.TRIGGER_WORKER_INSTANCE_NAME,
   };

   console.log(message, mergedProperties);

-  this.httpClient.sendDebugLog(runId, {
+  this.httpClient.sendDebugLog(runId, {
     message,
     time: date ?? new Date(),
     properties: mergedProperties,
-  });
+  }).catch(error => {
+    console.error("Failed to send debug log", { error, runId, message });
+  });
 }
packages/cli-v3/src/entryPoints/managed-run-controller.ts (1)

9-12: Validate environment usage consistency with the new configuration class.

Currently, env: stdEnv is passed directly to ManagedRunController. Consider migrating to a stricter typed object (e.g., RunnerEnv) if you need thorough validation or override capabilities, ensuring parity with other modules that leverage the new environment handling.

packages/cli-v3/src/entryPoints/managed/env.ts (1)

168-208: Override logic is straightforward; consider a single object approach if expansions become frequent.

The current pattern individually updates each environment variable. If you foresee more fields in the future, consider a single-iteration approach or a dynamic patch mechanism for simpler code handling.

packages/cli-v3/src/entryPoints/managed/poller.ts (2)

42-99: Consider handling repeated failures with a fallback strategy.

If repeated polls fail for an extended period, you might implement an exponential backoff or a max retry limit to minimize excessive log noise and repeated overhead.


102-120: Add tests for these methods.

Methods like resetCurrentInterval(), updateSnapshotId(), updateInterval(), start(), and stop() are critical for system flow. Consider adding dedicated unit tests to verify their behavior and catch regressions.

packages/cli-v3/src/dev/devSupervisor.ts (2)

416-416: Consider lazy or offline-safe socket creation.

If the environment does not require a live socket at all times, implementing lazy or retry-based creation may improve resiliency.


478-478: Clarify return of newly created socket.

Consider documenting that #createSocket() always returns a fresh socket instance. If the method is re-invoked, ensure existing connections are gracefully handled.

packages/cli-v3/src/entryPoints/managed/controller.ts (3)

16-19: Add explicit documentation for controller options.

Clarify each field’s usage and behavior in ManagedRunControllerOptions, especially regarding environment assumptions, to ensure new contributors can onboard quickly.


369-379: Process exit may interrupt asynchronous tasks.

Exiting the process directly can halt tasks (e.g., logging or HTTP calls). Consider a graceful shutdown approach, such as emitting an event for cleanup and then calling process.exit.


509-529: Unify immediate run start vs. warm start logic.

You trigger startRunExecution if IDs exist, otherwise you waitForNextRun. Consider a single path with conditions to reduce branching and future maintenance overhead.

packages/cli-v3/src/entryPoints/managed/execution.ts (4)

269-283: Clarify concurrent snapshot updates.

The code places multiple checks for snapshot changes after cleanup and suspend stages. While this handles common race conditions, consider documenting precisely how multiple concurrent snapshot events are queued and processed. Verifying each snapshot event carefully ensures no unexpected skip or override happens partway through the queue.


325-326: Remove or rephrase humor in debug logs.

Logging messages like “Suspending, any day now 🚬” may be confusing in production or user-facing logs. Recommend using clearer, professional messaging to maintain consistent log style.

- this.sendDebugLog("Suspending, any day now 🚬", snapshotMetadata);
+ this.sendDebugLog("Suspending run; awaiting further instructions.", snapshotMetadata);

483-493: Use a more robust pattern to ensure services are stopped.

You repeatedly call this.stopServices() after both success and error scenarios. Consolidating service cleanup in a try/finally block or a single top-level handler can reduce duplication, ensure consistent invocation, and simplify troubleshooting.


49-923: Refactor the RunExecution class for maintainability.

At ~900 lines, this class handles various workflow stages, environment manipulations, concurrency controls, and run attempt logic. Splitting out concerns—e.g., snapshot handling, service control, or run attempt orchestration—into smaller modules will improve readability and make testing easier.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b2eb34 and 2d0bef9.

📒 Files selected for processing (31)
  • .changeset/tricky-houses-invite.md (1 hunks)
  • .configs/tsconfig.base.json (1 hunks)
  • apps/supervisor/src/env.ts (1 hunks)
  • apps/supervisor/src/index.ts (1 hunks)
  • apps/supervisor/src/services/podCleaner.ts (4 hunks)
  • apps/supervisor/src/workloadManager/docker.ts (1 hunks)
  • apps/supervisor/src/workloadManager/types.ts (1 hunks)
  • apps/supervisor/src/workloadServer/index.ts (1 hunks)
  • apps/webapp/app/v3/authenticatedSocketConnection.server.ts (3 hunks)
  • internal-packages/run-engine/src/engine/db/worker.ts (2 hunks)
  • packages/cli-v3/e2e/utils.ts (4 hunks)
  • packages/cli-v3/src/dev/devSupervisor.ts (8 hunks)
  • packages/cli-v3/src/entryPoints/dev-run-controller.ts (5 hunks)
  • packages/cli-v3/src/entryPoints/managed-run-controller.ts (1 hunks)
  • packages/cli-v3/src/entryPoints/managed/controller.ts (1 hunks)
  • packages/cli-v3/src/entryPoints/managed/env.ts (1 hunks)
  • packages/cli-v3/src/entryPoints/managed/execution.ts (1 hunks)
  • packages/cli-v3/src/entryPoints/managed/heartbeat.ts (1 hunks)
  • packages/cli-v3/src/entryPoints/managed/logger.ts (1 hunks)
  • packages/cli-v3/src/entryPoints/managed/overrides.ts (1 hunks)
  • packages/cli-v3/src/entryPoints/managed/poller.ts (1 hunks)
  • packages/cli-v3/src/executions/taskRunProcess.ts (3 hunks)
  • packages/core/src/utils.ts (1 hunks)
  • packages/core/src/v3/index.ts (1 hunks)
  • packages/core/src/v3/machines/index.ts (1 hunks)
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts (1 hunks)
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts (5 hunks)
  • packages/core/src/v3/runEngineWorker/workload/http.ts (1 hunks)
  • packages/core/src/v3/schemas/common.ts (1 hunks)
  • packages/core/src/v3/utils/heartbeat.ts (1 hunks)
  • packages/core/src/v3/utils/interval.ts (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (11)
apps/supervisor/src/env.ts (1)
apps/supervisor/src/envUtil.ts (1)
  • BoolEnv (3-9)
apps/supervisor/src/services/podCleaner.ts (1)
packages/core/src/v3/utils/interval.ts (1)
  • IntervalService (8-95)
packages/cli-v3/src/entryPoints/managed/heartbeat.ts (3)
packages/core/src/v3/runEngineWorker/workload/http.ts (1)
  • WorkloadHttpClient (21-181)
packages/cli-v3/src/entryPoints/managed/logger.ts (1)
  • RunLogger (19-52)
packages/core/src/v3/utils/interval.ts (1)
  • IntervalService (8-95)
packages/cli-v3/e2e/utils.ts (1)
packages/core/src/v3/schemas/common.ts (2)
  • MachinePreset (117-122)
  • MachinePreset (124-124)
packages/cli-v3/src/executions/taskRunProcess.ts (1)
packages/core/src/v3/schemas/common.ts (1)
  • MachinePresetResources (126-126)
packages/core/src/v3/machines/index.ts (1)
packages/core/src/v3/schemas/common.ts (1)
  • MachinePresetResources (126-126)
packages/core/src/v3/runEngineWorker/supervisor/session.ts (1)
packages/core/src/v3/utils/interval.ts (1)
  • IntervalService (8-95)
packages/cli-v3/src/entryPoints/managed/env.ts (1)
packages/cli-v3/src/entryPoints/managed/overrides.ts (1)
  • Metadata (1-11)
packages/cli-v3/src/entryPoints/managed/poller.ts (3)
packages/core/src/v3/runEngineWorker/workload/http.ts (1)
  • WorkloadHttpClient (21-181)
packages/cli-v3/src/entryPoints/managed/logger.ts (1)
  • RunLogger (19-52)
packages/core/src/v3/utils/interval.ts (1)
  • IntervalService (8-95)
packages/cli-v3/src/entryPoints/managed/logger.ts (2)
packages/core/src/v3/runEngineWorker/workload/http.ts (1)
  • WorkloadHttpClient (21-181)
packages/cli-v3/src/entryPoints/managed/env.ts (1)
  • RunnerEnv (54-223)
apps/webapp/app/v3/authenticatedSocketConnection.server.ts (1)
packages/core/src/v3/utils/interval.ts (1)
  • IntervalService (8-95)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: units / 🧪 Unit Tests
🔇 Additional comments (52)
packages/core/src/v3/schemas/common.ts (1)

126-127: Great type refinement!

This new MachinePresetResources type creates a precise subset of MachinePreset with just the resource-related properties. This aligns well with the PR's strict typing improvements.

apps/supervisor/src/env.ts (1)

30-30: Good addition for Docker container cleanup!

Adding RUNNER_DOCKER_AUTOREMOVE with a default of true will help prevent resource leaks by automatically removing completed containers. This is a sensible default that aligns with Docker best practices.

packages/core/src/v3/utils/interval.ts (1)

1-63: Well-designed interval service with robust configuration options!

The new IntervalService class provides a flexible and robust replacement for the deprecated HeartbeatService. The design allows for customizable intervals, error handling, and execution timing.

.configs/tsconfig.base.json (1)

13-13: Great improvement for type safety!

Enabling strictPropertyInitialization ensures all class properties are explicitly initialized, preventing potential runtime errors. This aligns well with the overall goal of enhancing reliability in the managed run controller revamp.

apps/supervisor/src/workloadManager/docker.ts (1)

46-48: Good improvement for Docker resource management

Adding the --rm flag when dockerAutoremove is enabled implements the automatic removal of completed Docker containers mentioned in the PR objectives. This prevents resource leaks from accumulating unused containers and improves system reliability.

apps/supervisor/src/workloadManager/types.ts (1)

13-13: Clean type definition for Docker container cleanup

The addition of the optional dockerAutoremove property to the WorkloadManagerOptions interface provides a type-safe way to configure automatic container removal. This works in conjunction with the implementation in the Docker workload manager and maintains backward compatibility.

.changeset/tricky-houses-invite.md (1)

1-6: Changeset accurately reflects the scope of changes

This changeset correctly specifies patch updates for the affected packages, matching the non-breaking improvements made to the managed run controller. The message is concise but captures the essence of the extensive refactoring work.

internal-packages/run-engine/src/engine/db/worker.ts (1)

196-196:

❓ Verification inconclusive

Query method change might impact uniqueness assumptions

The change from findUnique to findFirst suggests that there might be multiple records with the same worker ID, or that ordering is important. Ensure this change is intentional and aligns with your database schema constraints.


🏁 Script executed:

#!/bin/bash
# Check if there are any unique constraints on backgroundWorker.id 
# and if there are any other usages of findUnique with this ID

# Check for unique constraint on id in model definitions
rg -A 5 -B 5 "model BackgroundWorker" --glob "*.prisma"

# Check for other findUnique calls with worker ID
rg "backgroundWorker.findUnique.*id: workerId" --type ts

Length of output: 2347


Attention: Verify Query Method Semantics for BackgroundWorker.lookup

The change from using findUnique to findFirst merits a manual re‐evaluation to ensure it aligns with the intended filtering logic. The Prisma schema for BackgroundWorker shows that the id field is uniquely constrained, which would normally justify using findUnique. If the query is filtering by this unique id (or any other unique field), consider reverting to findUnique for semantic clarity and to leverage built-in uniqueness guarantees. Conversely, if the filter intentionally targets a non-unique field or a composite condition, then using findFirst is appropriate—but please document the reasoning.

  • Verify that:
    • The filter condition in this query uses a unique field (such as the primary key id or the unique friendlyId) so that using findUnique would be more precise.
    • Or, if filtering by a non-unique field is intentional, confirm that returning the first matching entry is the desired behavior.
packages/core/src/v3/index.ts (1)

70-70: LGTM: Good addition of new interval service exports

This change correctly exposes the new IntervalService functionality, which replaces the deprecated HeartbeatService for managing periodic tasks.

apps/supervisor/src/index.ts (1)

69-69:

✅ Verification successful

Good improvement for Docker container management

Adding the dockerAutoremove option helps prevent resource leaks by automatically removing completed containers. This is a good practice for production environments.


🏁 Script executed:

#!/bin/bash
# Check how this option is used in the Docker workload manager

# Find where the dockerAutoremove flag is being used
rg "dockerAutoremove" --type ts

# Examine Docker run arguments construction
rg "docker run.*--rm" --type ts -A 5 -B 5

Length of output: 1217


Docker Auto-Removal Integration Confirmed

The dockerAutoremove flag is correctly integrated. In apps/supervisor/src/index.ts the flag is assigned from env.RUNNER_DOCKER_AUTOREMOVE, its type is declared in apps/supervisor/src/workloadManager/types.ts, and it is used in the Docker setup logic in apps/supervisor/src/workloadManager/docker.ts. Additionally, tests in packages/core show that Docker is invoked with the --rm flag, reinforcing that auto-removal is properly managed.

This is a good improvement for Docker container management and production environments.

packages/core/src/v3/runEngineWorker/supervisor/http.ts (1)

84-84: Appropriate deprecation notice

Good addition of the deprecation notice for dequeueFromVersion. This aligns with the broader refactoring of the managed run controller system where this method is no longer being used.

packages/core/src/v3/utils/heartbeat.ts (1)

8-10: Clear deprecation with alternative specified

The deprecation notice correctly guides developers to use the new IntervalService instead of HeartbeatService. This is an essential part of a clean migration path.

apps/webapp/app/v3/authenticatedSocketConnection.server.ts (3)

2-4: Updated import for new service

Correctly updated the import to use IntervalService instead of the deprecated HeartbeatService.


22-22: Updated type declaration

Type declaration correctly updated to reflect the new service type.


78-79: Migrated service instantiation

Successfully migrated from HeartbeatService to IntervalService, properly changing the callback property from heartbeat to onInterval while maintaining the same functionality.

packages/cli-v3/src/executions/taskRunProcess.ts (3)

4-4: Type import update aligns with machine resource representation.

The import change from MachinePreset to MachinePresetResources reflects the refactoring to use a more focused type for machine resources.


53-53: Property type updated to use specialized resource type.

This change properly updates the TaskRunProcessOptions interface to use the more specific MachinePresetResources type, which contains only the necessary memory and cpu properties needed for this context.


128-128: Effective local variable renaming maintains backward compatibility.

Renaming the destructured property from machineResources to machine locally is a clean approach that preserves the existing method implementation while adapting to the new naming convention in the options interface.

packages/cli-v3/e2e/utils.ts (4)

11-11: Updated import to support machine resource changes.

Adding the MachinePreset import supports the machine resource definition changes.


299-305: Improved code through resource consolidation.

Creating a centralized machine constant reduces duplication and ensures consistent machine resource configuration throughout the execution flow. The use of the satisfies operator is a good practice that preserves literal type information while ensuring type compatibility.


325-325: Property name updated to align with TaskRunProcess changes.

This line correctly updates the property name from machine to machineResources to match the changes in the TaskRunProcess class.


378-378: Reused machine constant for execution configuration.

Using the same machine constant in the execution configuration ensures consistency and reduces the risk of configuration mismatches.

packages/cli-v3/src/entryPoints/managed/overrides.ts (1)

1-11: Well-structured Metadata type with comprehensive environment properties.

The Metadata type clearly defines the structure for environment-related properties with proper typing, making all fields optional which provides flexibility for partial metadata.

packages/cli-v3/src/entryPoints/managed/heartbeat.ts (5)

1-4: Good dependency imports for heartbeat functionality.

The imports are well-structured, bringing in the required dependencies for interval management, HTTP client functionality, and logging.


5-12: Clear options interface for RunExecutionHeartbeat.

The RunExecutionHeartbeatOptions type clearly defines the required parameters for the heartbeat service.


13-21: Well-structured class properties with clear types.

The class properties are well-defined with appropriate types, and the separation between configuration values and service instances is clear.


22-71: Constructor with comprehensive initialization and logging.

The constructor properly initializes all properties, sets up logging to record the heartbeat configuration, and configures the IntervalService with appropriate callbacks.


73-92: Well-designed public interface for heartbeat management.

The class provides a clean and comprehensive public API for managing the heartbeat functionality, including methods to reset, update, start, and stop the heartbeat service.

apps/supervisor/src/services/podCleaner.ts (1)

4-4: Good job replacing HeartbeatService with IntervalService

The replacement of HeartbeatService with IntervalService is a positive change that aligns with the broader refactoring effort. The IntervalService provides more robust interval management with error handling capabilities.

Also applies to: 22-22, 35-39, 60-60, 64-64

packages/core/src/v3/runEngineWorker/supervisor/session.ts (1)

11-11: Good migration to IntervalService with proper error handling

The replacement of the heartbeat service with IntervalService includes proper error handling and maintains the same functionality. This aligns with the project's goal to improve reliability in the managed run controller.

Also applies to: 32-32, 53-69, 185-185, 199-199

packages/cli-v3/src/entryPoints/dev-run-controller.ts (1)

4-4: Good migration to IntervalService and improved type safety

The changes in this file show a consistent approach to replacing HeartbeatService with IntervalService for both run heartbeat and snapshot polling. The implementation includes proper error handling and maintains the same functionality.

The change from machine to machineResources in the TaskRunProcess constructor properly aligns with the standardized machine resource typing mentioned in the PR objectives.

Also applies to: 47-49, 81-151, 622-622

packages/cli-v3/src/entryPoints/managed-run-controller.ts (2)

3-4: Use of the refactored ManagedRunController import looks good.

Great job delegating the logic to the new module and keeping this file minimal and focused. This separation improves maintainability and clarity.


6-7: Consider adding error handling for JSON retrieval.

If index.json is missing or malformed, readJSONFile could throw an error. For production resilience, wrap this in a try/catch or check whether the file exists before parsing.

Here is an optional snippet:

+try {
  const manifest = await readJSONFile("./index.json");
  const workerManifest = WorkerManifest.parse(manifest);
+} catch (err) {
+  console.error("Failed to read or parse index.json", err);
+  process.exit(1);
+}
packages/core/src/v3/machines/index.ts (3)

1-7: Good move towards narrower MachinePresetResources.

Switching from MachinePreset to MachinePresetResources helps ensure only cpu and memory are passed to these functions. This enhances type safety and avoids handling irrelevant fields.


22-22: Ensure overhead usage is consistent across all references.

Both maxOldSpaceSizeFlag and maxOldSpaceSizeForMachine rely on the same overhead default and logic. Confirm there's no scenario in which overhead or memory is negative and might cause undesired flags.

Also applies to: 26-29


36-36: JSDoc updates reflect the narrower type usage.

Using @param {MachinePresetResources} clarifies that only CPU and memory resources matter here. This consistency across your codebase is beneficial.

Also applies to: 42-42

packages/cli-v3/src/entryPoints/managed/env.ts (3)

6-9: Confirm valid numeric strings for date conversions.

The transformation from string to date using parseInt might fail if the value is not purely numeric or empty. Ensure your environment or upstream code provides valid numeric strings for scheduling timestamps.


38-40: Check for potential initialization issues with date fields.

TRIGGER_POD_SCHEDULED_AT_MS and TRIGGER_DEQUEUED_AT_MS expect valid timestamps. If these are missing, Zod will throw an error. Ensure your deployment or run process always sets these values.


210-222: gatherProcessEnv is minimal; confirm additional env needs.

Only a handful of environment variables (like NODE_ENV and OTEL_EXPORTER_OTLP_ENDPOINT) are passed along. Validate if your tasks need other environment variables (e.g., TRIGGER_PROJECT_ID). If so, extend this helper to avoid missing configurations in subprocesses.

packages/cli-v3/src/entryPoints/managed/poller.ts (2)

5-12: Validate or document Poller options.

Consider clarifying validation requirements or providing defaults for fields like runFriendlyId and snapshotFriendlyId, ensuring they're never empty at runtime.


14-41: Initialization and logging approach looks solid.

Your constructor effectively initializes all required properties and logs helpful metadata. This sets a good example for maintainable code.

packages/cli-v3/src/dev/devSupervisor.ts (5)

52-52: Confirm nullability of config property.

Declaring config as optional implies call sites must handle undefined. Verify all usage carefully or consider throwing earlier if config is mandatory.


96-96: Handle potential socket creation errors.

If #createSocket() fails or returns undefined, your property assignment remains incomplete. Ensure you handle or log errors in socket creation routines.


190-192: Good defensive check for config property.

Throwing an error here ensures you fail fast if config is missing, preventing more obscure downstream issues.


423-423: Validate concurrency on run notifications.

Multiple run notifications could arrive close together. Verify if the code can handle race conditions or partial concurrency (e.g., notifications for the same run).


466-466: Check socket.io version for 'socket.active' support.

In some older socket.io releases or alternative TypeScript definitions, active may not exist. Confirm that this property is indeed reliable.

packages/cli-v3/src/entryPoints/managed/controller.ts (4)

23-88: Verify constructor incomplete-initialization scenarios.

If an error is thrown mid-constructor, partially initialized properties may remain in an inconsistent state. Consider a guarded approach or explicit cleanup for robust error handling.


90-115: Getter-based environment references are well-structured.

Exposing environment-based data through getters clarifies usage and avoids direct mutable access patterns.


499-507: Check partial cancellation edge cases.

If multiple attempts are queued or in partial states, cancelAttempt might not cover them all. Clarify or expand your logic to ensure a safe rollback for partially completed steps.


541-551: Enhanced logging with appended runner metrics looks good.

Attaching additional run metrics in each debug log fosters transparency and helps in post-incident analysis.

packages/cli-v3/src/entryPoints/managed/execution.ts (2)

210-216: Double-check string comparison for snapshot.friendlyId.

You're comparing snapshot.friendlyId and this.currentSnapshotId with < (line 210). If these IDs are strings (e.g., UUIDs, timestamps, or any non-numeric format), the comparison may be lexicographical, not numeric. Confirm that friendlyId is guaranteed to be sequentially sortable as a string format, or convert it to a number (if valid) to ensure correct ordering.


670-714: Ensure test coverage for all run attempt statuses.

Multiple statuses (e.g., RUN_FINISHED, RUN_PENDING_CANCEL, RETRY_QUEUED, etc.) can produce different code paths. Confirm that unit tests or integration tests exist for each case to ensure correct handling of edge conditions, such as unexpected fallback or race conditions.

@nicktrn nicktrn merged commit 3f97cf4 into main Apr 15, 2025
12 checks passed
@nicktrn nicktrn deleted the fix/controller-rework branch April 15, 2025 15:08
matt-aitken pushed a commit that referenced this pull request Apr 16, 2025
* update nypm to support text-based bun lockfiles

* fix retry spans

* only download debug logs if admin

* add nypm changeset

* pull out env override logic

* use runner env gather helper

* handle dev flushing failures gracefully

* fix path normalization for init.ts

* add logger

* add execution heartbeat service

* add snapshot poller service

* fix poller

* add changesets

* create socket in constructor

* enable strictPropertyInitialization

* deprecate dequeue from version

* start is not async

* dependency injection in prep for tests

* add warm start count to all controller logs

* add restore count

* pull out run execution logic

* temp disable pre

* add a controller log when starting an execution

* refactor execution and squash some bugs

* cleanup completed docker containers by default

* execution fixes and logging improvements

* don't throw afet abort cleanup

* poller should use private interval

* rename heartbeat service file

* rename HeartbeatService to IntervalService

* restore old heartbeat service but deprecate it

* use the new interval service everywhere

* Revert "temp disable pre"

This reverts commit e03f417.

* add changeset

* replace all run engine find uniques with find first
matt-aitken added a commit that referenced this pull request Apr 17, 2025
* WIP on secret env vars

* Editing individual env var values is working

* Sort the env vars by the key

* Deleting values

* Allowing setting secret env vars

* Added medium switch style

* Many style changes to the env var form

* “Copy text” -> “Copy”

* Draw a divider between hidden buttons

* Env var tweaks

* Don’t show Dev:you anymore

* Grouping the same env var keys together

* Styles improved

* Improved styling of edit panel

* Fix bun detection, dev flushing, and init command (#1914)

* update nypm to support text-based bun lockfiles

* add nypm changeset

* handle dev flushing failures gracefully

* fix path normalization for init.ts

* add changesets

* chore: remove pre.json after exiting pre mode

* init command to install v4-beta packages

* Revert "chore: remove pre.json after exiting pre mode"

This reverts commit f5694fd.

* make init default to cli version for all packages

* Release 4.0.0-v4-beta.1 (#1916)

* chore: Update version for release (v4-beta)

* Release 4.0.0-v4-beta.1

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: nicktrn <55853254+nicktrn@users.noreply.github.com>

* Both run engines will only lock to versions they can handle (#1922)

* run engine v1 will only lock to v1 deployments

* run engine v2 will only lock to managed v2 deployments

* test: create background worker and deployment with correct engine version

* Add links to and from deployments (#1921)

* link from deployments tasks to filtered runs view

* jump to deployment

* don't add version links for dev (yet)

* Fix current worker deployment getter (#1924)

* only return last v1 deployment in the shared queue consumer

* be explicit about only returning managed deployments

* Add a docs page for the human-in-the-loop example project (#1919)

* Add a docs page for the human-in-the-loop example project

* Order guides, example projects and example tasks alphabetically in the docs list

* Managed run controller revamp (#1927)

* update nypm to support text-based bun lockfiles

* fix retry spans

* only download debug logs if admin

* add nypm changeset

* pull out env override logic

* use runner env gather helper

* handle dev flushing failures gracefully

* fix path normalization for init.ts

* add logger

* add execution heartbeat service

* add snapshot poller service

* fix poller

* add changesets

* create socket in constructor

* enable strictPropertyInitialization

* deprecate dequeue from version

* start is not async

* dependency injection in prep for tests

* add warm start count to all controller logs

* add restore count

* pull out run execution logic

* temp disable pre

* add a controller log when starting an execution

* refactor execution and squash some bugs

* cleanup completed docker containers by default

* execution fixes and logging improvements

* don't throw afet abort cleanup

* poller should use private interval

* rename heartbeat service file

* rename HeartbeatService to IntervalService

* restore old heartbeat service but deprecate it

* use the new interval service everywhere

* Revert "temp disable pre"

This reverts commit e03f417.

* add changeset

* replace all run engine find uniques with find first

* Release 4.0.0-v4-beta.2 (#1928)

* chore: Update version for release (v4-beta)

* Release 4.0.0-v4-beta.2

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: nicktrn <55853254+nicktrn@users.noreply.github.com>

* Remove batch ID carryover for non-batch waits (#1930)

* add failing test case

* do not carry over previous batch id when blocking with waitpoint

* delete irrelevant test

* Delete project (#1913)

* Delete project

- Don’t schedule tasks if the project is deleted
- Delete queues from the master queues
- Add the old delete project UI back in

* Mark the project as deleted last

* Fix for overriding local variable

* Added a todo for deleting env queues

* Remove todo

* Improve usage flushing (#1931)

* add flush to global usage api

* enable controller debug logs

* initialize usage manager after env overrides

* add previous run id to more debug logs

* add changeset

* For secret env vars, don’t return the value

* Added a new env var repository function for getting secrets with redactions

* Test task for env vars

* Delete heartbeat file, merge mess up

---------

Co-authored-by: nicktrn <55853254+nicktrn@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Saadi Myftija <saadi.myftija@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants