Run Engine 2.0 (alpha) #1575

nicktrn · 2024-12-17T18:23:00Z

Good luck @coderabbitai

Summary by CodeRabbit

New Features
- Introduced new UI elements such as animated time indicators and pause icons.
- Added endpoints for advanced task run management, including waitpoint handling and worker connections.
- Implemented new API routes for handling developer presence and managing run attempts.
Improvements
- Enhanced engine rate limiting and batch processing for smoother, more stable execution.
- Refined developer presence monitoring and environment configuration for improved reliability.
- Expanded functionality for managing worker group tokens and instances.
Chores
- Updated dependencies and streamlined migration instructions for better system compatibility.
- Modified test configurations and improved error handling in various components.

…TimeUseToken, maxDurationInSeconds

… conditions

…nly now. Updated docs

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

apps/webapp/app/presenters/v3/SpanPresenter.server.ts (1)
437-501: ⚠️ Potential issue

Fix switch statement variable scoping issues.

The switch statement has potential variable scoping issues with declarations that can leak between cases.

Wrap each case in block scope to prevent variable declaration leakage:
 switch (span.entity.type) {
   case "waitpoint": {
+    // Wrap in block to prevent variable declaration leakage
     const waitpoint = await this._replica.waitpoint.findFirst({
       where: {
         friendlyId: span.entity.id,
       },
       // ...
     });

     if (!waitpoint) {
       logger.error(`SpanPresenter: Waitpoint not found`, {
         spanId,
         waitpointFriendlyId: span.entity.id,
       });
       return { ...data, entity: null };
     }

     const output =
       waitpoint.outputType === "application/store"
         ? `/resources/packets/${environmentId}/${waitpoint.output}`
         : typeof waitpoint.output !== "undefined" && waitpoint.output !== null
         ? await prettyPrintPacket(waitpoint.output, waitpoint.outputType ?? undefined)
         : undefined;

     let isTimeout = false;
     if (waitpoint.outputIsError && output) {
       if (isWaitpointOutputTimeout(output)) {
         isTimeout = true;
       }
     }

     return {
       ...data,
       entity: {
         type: "waitpoint" as const,
         object: {
           // ...
         },
       },
     };
+    }

   default:
+    // Validate entity type
+    if (!["waitpoint", "span"].includes(span.entity.type)) {
+      logger.warn(`Unknown entity type: ${span.entity.type}`, { spanId });
+    }
     return { ...data, entity: null };
 }
🧰 Tools

🪛 Biome (1.9.4)

[error] 439-455: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

[error] 465-470: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

[error] 472-472: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

🧹 Nitpick comments (14)

CONTRIBUTING.md (3)
233-236: Migration Creation Instruction Clarity and Code Block Language Specifier
The updated instruction "Create a migration" along with the command pnpm run db:migrate:dev:create clearly separates migration creation from its execution. This change improves clarity by prompting users to review the generated migration before applying it. Additionally, it would be beneficial to add a language specifier (e.g., sh) to the fenced code block to enhance readability and conform with markdownlint guidelines.
-   ```
-   pnpm run db:migrate:dev:create
-   ```
+   ```sh
+   pnpm run db:migrate:dev:create
+   ```
🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

235-235: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

239-240: Improve Indexing Guidance Wording
The note advising the use of CONCURRENTLY for adding indexes is crucial to prevent unintended table locks. However, the sentence could be rephrased for better clarity and to address the static analysis feedback regarding punctuation. For example:
-   This creates a migration file. Check the migration file does only what you want. If you're adding any database indexes they must use `CONCURRENTLY`, otherwise they'll lock the table when executed.
+   This creates a migration file. Please review it to ensure it matches your requirements. If you're adding any database indexes, be sure to use `CONCURRENTLY` to avoid locking the table during execution.
🧰 Tools

🪛 LanguageTool

[typographical] ~239-~239: The word “otherwise” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...ding any database indexes they must use CONCURRENTLY, otherwise they'll lock the table when executed. ...

(THUS_SENTENCE)

241-247: Separation of Migration Execution Steps and Code Block Enhancements
The new step "Run the migration" with the subsequent commands (pnpm run db:migrate:deploy and pnpm run generate) enhances the instructional clarity by distinctly separating migration creation from its execution. To further improve the document, adding a language specifier to the fenced code block is recommended:
-  ```
-  pnpm run db:migrate:deploy
-  pnpm run generate
-  ```
+  ```sh
+  pnpm run db:migrate:deploy
+  pnpm run generate
+  ```
🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

243-243: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (1)
1076-1284: Well-structured span entity rendering.

The new SpanEntity component cleanly handles different entity types, particularly for waitpoints, with appropriate status displays and controls.

There's an unnecessary Fragment at lines 1235-1252 that could be removed since it only contains one child element. You can simplify this by removing the fragment:
- {span.entity.object.type === "MANUAL" && (
-   <>
-     <Property.Item>
-       <Property.Label>Timeout at</Property.Label>
-       <Property.Value>
-         <div className="flex w-full flex-wrap items-center justify-between gap-1">
-           {span.entity.object.completedAfter ? (
-             <DateTimeAccurate date={span.entity.object.completedAfter} />
-           ) : (
-             "–"
-           )}
-           {span.entity.object.status === "PENDING" && (
-             <ForceTimeout waitpoint={span.entity.object} />
-           )}
-         </div>
-       </Property.Value>
-     </Property.Item>
-   </>
- )}
+ {span.entity.object.type === "MANUAL" && (
+   <Property.Item>
+     <Property.Label>Timeout at</Property.Label>
+     <Property.Value>
+       <div className="flex w-full flex-wrap items-center justify-between gap-1">
+         {span.entity.object.completedAfter ? (
+           <DateTimeAccurate date={span.entity.object.completedAfter} />
+         ) : (
+           "–"
+         )}
+         {span.entity.object.status === "PENDING" && (
+           <ForceTimeout waitpoint={span.entity.object} />
+         )}
+       </div>
+     </Property.Value>
+   </Property.Item>
+ )}
🧰 Tools

🪛 Biome (1.9.4)

[error] 1235-1252: Avoid using unnecessary Fragment.

A fragment is redundant if it contains only one child, or if it is the child of a html element, and is not a keyed fragment.
Unsafe fix: Remove the Fragment

(lint/complexity/noUselessFragments)
internal-packages/database/prisma/schema.prisma (5)

454-464: New Project Fields for Versioning and Worker Group Relationships
The Project model now includes several new fields:

version with a default of V2

engine with a default of V1

Associations with workerGroups, workers, and the optional defaultWorkerGroup (plus its id field)
These additions extend the model to support Run Engine versioning and attach worker group information. Please verify that the relationships and default values (especially for run engine versioning) align with downstream logic and that any queries or mutations have been updated accordingly.

1714-1719: Enhanced Queue Fields in TaskRun
A new block has been added in TaskRun:

masterQueue (defaulting to "main")

An optional secondaryMasterQueue

attemptNumber (to be defined after a run is dequeued for engine v2+)
These additions will support finer control of queue positioning and tracking of attempts in Run Engine 2.0. Please ensure that the computation and downstream processing of these fields are consistent and that any related indexes or query logic are updated for optimal performance.

1770-1775: Waitpoint Relationships in TaskRun
Two new relationships have been introduced on TaskRun:

associatedWaitpoint, which will be marked when a run finishes

blockedByWaitpoints, which holds any waitpoints preventing the run’s execution
These changes integrate task run execution with waitpoint handling. It is important to review how these relationships are used in job scheduling—ensure that they do not introduce circular dependencies or unwanted cascade effects.

1933-1988: TaskRunExecutionSnapshot Model for Run Engine v2
The TaskRunExecutionSnapshot model has been added to track execution states. Key points include:

The engine field now defaults to V2, signifying that snapshots apply to the new engine.

It captures execution status, debug description, validity indication, and relationships to TaskRun, batch, environment, and optionally a checkpoint and worker.
This model is central for tracking execution state changes; please ensure that maintenance (such as data retention policies) is addressed elsewhere in your system design.

2037-2099: Enhanced Waitpoint Model for Run Blocking and Idempotency
The updated Waitpoint model now includes several new fields:

idempotencyKey with an expiration and a flag for user-provided input

An inactiveIdempotencyKey field for when keys are rotated after completion

Relations to mark completion by a task run or batch

Indexes ensuring uniqueness on [environmentId, idempotencyKey]
These changes support robust waitpoint management. One note: the commented-out property (deactivateIdempotencyKeyWhenCompleted) is still present as a comment—if it’s intended for future use, consider adding a “TODO” with an issue reference.

apps/webapp/app/routes/engine.v1.runs.$runFriendlyId.waitpoints.tokens.$waitpointFriendlyId.wait.ts (1)

19-19: Unused request body parameter.
The body parameter in the callback is not referenced. If there is no plan to consume request data here, consider removing it to keep the code clean and avoid confusion.
apps/webapp/app/presenters/v3/SpanPresenter.server.ts (4)
45-57: Improve error handling in early return path.

The early return approach for non-existent parentRun silently fails without providing any error detail. Since other code paths have explicit error throws, this inconsistency could make debugging more difficult.

Consider replacing the silent return with an appropriate error:
 if (!parentRun) {
-  return;
+  throw new Error("Parent run not found");
 }
217-236: Simplify conditional logic with a cleaner approach.

The current approach creates redundant code between the finishedData variable and the existing finishedAttempt. The engine-based condition creates additional complexity.

Consider consolidating the logic:
 const finishedData =
-  run.engine === "V2"
-    ? run
-    : isFinished
-    ? await this._replica.taskRunAttempt.findFirst({
-        select: {
-          output: true,
-          outputType: true,
-          error: true,
-        },
-        where: {
-          status: { in: FINAL_ATTEMPT_STATUSES },
-          taskRunId: run.id,
-        },
-        orderBy: {
-          createdAt: "desc",
-        },
-      })
-    : null;
+  run.engine === "V2" ? run : finishedAttempt;
457-463: Enhance error handling logic for better traceability.

The current error logging is good, but the return path doesn't adequately capture the error state for the UI layer.

Consider enhancing the error return path to provide more context:
 if (!waitpoint) {
   logger.error(`SpanPresenter: Waitpoint not found`, {
     spanId,
     waitpointFriendlyId: span.entity.id,
   });
-  return { ...data, entity: null };
+  return {
+    ...data,
+    entity: {
+      type: "waitpoint" as const,
+      error: {
+        message: "Waitpoint not found",
+        code: "NOT_FOUND"
+      },
+      object: null
+    }
+  };
 }
472-477: Simplify timeout detection logic.

The current approach uses a variable assignment and conditional check that could be consolidated.

Consider simplifying the timeout detection:
-let isTimeout = false;
-if (waitpoint.outputIsError && output) {
-  if (isWaitpointOutputTimeout(output)) {
-    isTimeout = true;
-  }
-}
+const isTimeout = waitpoint.outputIsError && output ? isWaitpointOutputTimeout(output) : false;
🧰 Tools

🪛 Biome (1.9.4)

[error] 472-472: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32b89a2 and c6f6eac.

📒 Files selected for processing (13)

CONTRIBUTING.md (1 hunks)
apps/coordinator/src/checkpointer.ts (2 hunks)
apps/docker-provider/src/index.ts (1 hunks)
apps/webapp/app/assets/icons/AnimatedHourglassIcon.tsx (1 hunks)
apps/webapp/app/presenters/v3/SpanPresenter.server.ts (13 hunks)
apps/webapp/app/routes/api.v1.waitpoints.tokens.$waitpointFriendlyId.complete.ts (1 hunks)
apps/webapp/app/routes/engine.v1.runs.$runFriendlyId.waitpoints.tokens.$waitpointFriendlyId.wait.ts (1 hunks)
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.$projectParam.waitpoints.$waitpointFriendlyId.complete/route.tsx (1 hunks)
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (11 hunks)
internal-packages/database/package.json (1 hunks)
internal-packages/database/prisma/schema.prisma (19 hunks)
internal-packages/redis-worker/src/queue.ts (1 hunks)
internal-packages/run-engine/src/engine/tests/waitpoints.test.ts (1 hunks)

✅ Files skipped from review due to trivial changes (1)

apps/docker-provider/src/index.ts

🚧 Files skipped from review as they are similar to previous changes (4)

internal-packages/database/package.json
apps/webapp/app/assets/icons/AnimatedHourglassIcon.tsx
internal-packages/run-engine/src/engine/tests/waitpoints.test.ts
apps/coordinator/src/checkpointer.ts

🧰 Additional context used

🪛 LanguageTool

CONTRIBUTING.md

[typographical] ~239-~239: The word “otherwise” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...ding any database indexes they must use CONCURRENTLY, otherwise they'll lock the table when executed. ...

(THUS_SENTENCE)

🪛 markdownlint-cli2 (0.17.2)

CONTRIBUTING.md

235-235: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

243-243: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

🪛 Biome (1.9.4)

apps/webapp/app/presenters/v3/SpanPresenter.server.ts

[error] 439-455: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

[error] 465-470: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

[error] 472-472: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx

[error] 1235-1252: Avoid using unnecessary Fragment.

A fragment is redundant if it contains only one child, or if it is the child of a html element, and is not a keyed fragment.
Unsafe fix: Remove the Fragment

(lint/complexity/noUselessFragments)

🔇 Additional comments (29)

internal-packages/redis-worker/src/queue.ts (1)

60-60: Increased Redis retry resilience for better reliability

The change from the default maxRetriesPerRequest: 3 to 20 significantly improves the queue's resilience to transient Redis connection issues. This enhancement is valuable for the Run Engine 2.0 alpha, especially in distributed environments where network hiccups are common.

While this change improves reliability, it's worth verifying how it might affect failure detection and timeout behavior. Please confirm that increased retries won't mask persistent Redis issues or significantly delay final failure responses. You might want to consider adding metrics/monitoring for Redis connection quality to identify persistent issues that are being masked by retries.

apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (10)

51-51: Access control implementation looks good.

Adding useHasAdminAccess hook for permission-based rendering is a good security practice to ensure admin-only features are properly gated.

196-196: Code looks good.

Proper use of SpanTitle with appropriate props.

232-292: Improved structure with conditional rendering.

The refactoring of SpanBody to conditionally render either detailed properties or the new SpanEntity component improves code organization and separation of concerns.

311-311: Proper authorization check.

Good implementation of the admin access check that will be used for conditional rendering of admin-only information.

322-330: Enhanced UI for cached runs.

The improved RunIcon usage with conditional naming for cached runs provides better visual feedback to users. This makes the UI more informative by clearly distinguishing cached runs.

500-515: Good addition of idempotency information.

Adding idempotency key and expiration time improves observability, allowing users to track and debug idempotency-related issues more effectively.

548-551: Enhanced run information with engine version.

Adding the engine version display is useful for debugging and understanding which engine processed the run.

552-563: Admin-only information properly gated.

Good implementation of conditional rendering for admin-only properties (masterQueue and secondaryMasterQueue) based on user permissions.

701-709: Improved error and output handling.

The code now handles errors and outputs more cleanly with proper conditional rendering, ensuring outputs are only shown when there's no error and output data exists.

728-728: Better UX with context-aware button text.

Changing the button text to "Jump to original run" for cached runs improves clarity and user understanding.

internal-packages/database/prisma/schema.prisma (10)

151-152: Positive Naming for Worker Associations
The newly added fields workerGroups and workerInstances on the Organization model now use plural names, which addresses the previous comment (“Ideally this would be plural”). This naming clearly indicates that these fields hold arrays of associated entities.

1578-1579: BackgroundWorker Engine Version
The addition of the engine field (defaulting to V1) in the BackgroundWorker model explicitly captures the run engine version for background workers. This clear versioning will help differentiate behavior as features evolve.

1603-1605: Optional Worker Group Association in BackgroundWorker
The new optional relation fields workerGroup and workerGroupId in BackgroundWorker allow workers to be associated with a worker group. Be sure that the onDelete: SetNull behavior is correct for your use case and that any related business logic is updated to handle a missing worker group gracefully.

1741-1742: First Attempt Start Timestamp for Run Engine 2.0+
The new nullable field firstAttemptStartedAt in TaskRun provides an explicit marker for when the first attempt begins under Run Engine 2.0. This will be valuable for performance tracking and debugging. Consider whether indexing this field is required if you plan to filter or sort on it.

1755-1758: Priority Offset Field in TaskRun
The addition of the priorityMs field introduces a mechanism for adjusting the run’s queue positioning via a negative time offset (in milliseconds). This design can offer flexible scheduling; however, please confirm that all components that compute or utilize the effective queue timestamp correctly accommodate both positive and negative values.

1868-1872: RunEngineVersion Enum Introduction
A new enum RunEngineVersion now distinguishes between legacy and new engine versions (V1 and V2). The inline comment (“The original version that uses marqs v1 and Graphile”) is helpful. Consider documenting plans for future versions if applicable.

2010-2030: TaskRunCheckpoint Model Consistency
The TaskRunCheckpoint model now encapsulates details around checkpoints, including the friendly identifier, type, location, image reference, associated project, and runtime environment. It also relates to multiple execution snapshots. The structure is clear and appears consistent with other models.

2115-2145: TaskRunWaitpoint Model for Associating Runs with Waitpoints
The new TaskRunWaitpoint model bridges TaskRun and Waitpoint, adding support for batch-related fields (such as batchId, batchIndex) and a span identifier for cache completion. The unique constraint on [taskRunId, waitpointId, batchIndex] is critical to ensure each association is unique even in a batched context. Double-check that your SQL engine enforces this constraint as expected.

2720-2724: WorkerDeploymentType Enum Extension
The WorkerDeploymentType enum now includes three values: MANAGED, UNMANAGED, and V1. This explicit inclusion of V1 clarifies which deployment type follows legacy behavior. Confirm that all logic conditioned on the deployment type is updated to handle the V1 value properly.

2739-2740: Explicit Deployment Versioning in WorkerDeployment
A new field type of type WorkerDeploymentType has been added to the WorkerDeployment model, with a default value of V1. This change ensures that deployments are explicitly versioned and can be differentiated in the system. Review any deployment workflows to guarantee they respect this new field.

apps/webapp/app/routes/engine.v1.runs.$runFriendlyId.waitpoints.tokens.$waitpointFriendlyId.wait.ts (2)

16-16: Question the 10KB content length limit.
While 10KB may be sufficient in most scenarios, consider verifying with stakeholders whether 10KB adequately covers all potential request data for this endpoint. If larger payloads are expected, increasing this limit or making it configurable might help avoid unexpected 413 errors.

37-39: Consider concurrency scenarios for blocking runs.
If multiple clients or processes can request to block the same run with the same waitpoint simultaneously, ensure that engine.blockRunWithWaitpoint handles concurrency gracefully (e.g., verifying the run’s current state and re-checking waitpoint availability).

apps/webapp/app/routes/api.v1.waitpoints.tokens.$waitpointFriendlyId.complete.ts (2)

22-22: Verify environment variable usage.
env.TASK_PAYLOAD_MAXIMUM_SIZE governs the maxContentLength for this route. Confirm that this environment variable is set appropriately across all deployments to avoid unexpected 413 errors or overly permissive sizes.

62-64: Enhance error handling specificity.
Consider differentiating between validation errors, database conflicts, or other potential error types, rather than catching everything in a single generic 500 handler.

apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.$projectParam.waitpoints.$waitpointFriendlyId.complete/route.tsx (2)

285-303: Complete the dependency array for submitForm callback.
You reference both submit and formAction inside submitForm, but only [currentJson] is in the dependency array. Including all referenced variables (e.g., submit and formAction) in the dependency array would prevent stale references if those ever changed.

346-348: Fix invalid HTML attributes.
React doesn't recognize min-height or max-height as valid JSX props. Replace them with minHeight and maxHeight to avoid console warnings and ensure proper styling.
apps/webapp/app/presenters/v3/SpanPresenter.server.ts (2)

82-84: Good job updating method signatures to match implementation needs.

The updated method signatures for getRun and getSpan now properly reflect the parameters they need. This improves the API's clarity and maintainability.

Also applies to: 405-407

472-477:
✅ Verification successful

Add test coverage for the timeout detection functionality.

The timeout detection logic is important but doesn't appear to have associated test coverage.

🏁 Script executed:
#!/bin/bash
# Check for tests covering the waitpoint timeout detection

# First, let's find files that might contain tests for this functionality
fd "test|spec" --type f | grep -i "waitpoint\|timeout" | xargs grep -l "isWaitpointOutputTimeout" || echo "No direct test references found"

# Let's also check if the isWaitpointOutputTimeout function itself is tested
fd "test|spec" --type f | xargs grep -l "isWaitpointOutputTimeout" || echo "No tests for isWaitpointOutputTimeout found"
Length of output: 294
Test coverage for timeout detection already exists.
The timeout detection behavior—via the isWaitpointOutputTimeout function—is exercised in the test file at internal-packages/run-engine/src/engine/tests/waitpoints.test.ts. This confirms that the functionality is appropriately covered. If any integration-specific cases within SpanPresenter.server.ts remain untested, please consider adding additional tests; otherwise, no further action is necessary.

🧰 Tools

🪛 Biome (1.9.4)

[error] 472-472: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

coderabbitai · 2025-02-26T19:12:30Z

internal-packages/database/prisma/schema.prisma

+  /// Friendly IDs
  runIds       String[] @default([])
  runCount     Int      @default(0)
  payload      String?
  payloadType  String   @default("application/json")
  options      Json?
  batchVersion String   @default("v1")

+  //engine v2
+  /// Snapshots that reference this batch
+  executionSnapshots TaskRunExecutionSnapshot[]
+  /// Specific run blockers,
+  runsBlocked        TaskRunWaitpoint[]
+  /// Waitpoints that are blocked by this batch.
+  /// When a Batch is created it blocks execution of the associated parent run (for andWait)
+  waitpoints         Waitpoint[]
+


🛠️ Refactor suggestion

BatchTaskRun Model Updates for Engine v2 Batching
The BatchTaskRun model now has several new columns:

Arrays for runIds and a counter runCount for friendly ID purposes

Payload-related fields (payload, payloadType, options, batchVersion)

Engine v2–specific fields: executionSnapshots, runsBlocked, and waitpoints
These additions aim to better manage batch processing in the new engine mode. Please review the integration points where batch runs are created and processed to ensure these fields are appropriately updated and queried.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.$projectParam.waitpoints.$waitpointFriendlyId.complete/route.tsx (2)
314-314: 🛠️ Refactor suggestion

Complete the dependency array for submitForm.

The submitForm callback is missing dependencies in its dependency array.
   const submitForm = useCallback(
     (e: React.FormEvent<HTMLFormElement>) => {
       // ... function body ...
     },
-    [currentJson]
+    [currentJson, submit, formAction]
   );
358-359: 🛠️ Refactor suggestion

Fix invalid HTML attributes.

The JSONEditor component has invalid HTML attributes.
     height="100%"
-    min-height="100%"
-    max-height="100%"
+    minHeight="100%"
+    maxHeight="100%"
   />

🧹 Nitpick comments (2)

apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.$projectParam.waitpoints.$waitpointFriendlyId.complete/route.tsx (2)
340-341: Fix typo in the tooltip content.

There's a small typo in the tooltip text.
-              "This is will immediately complete this waitpoint with the payload you specify. This is useful during development for testing."
+              "This will immediately complete this waitpoint with the payload you specify. This is useful during development for testing."
345-347: Remove duplicate scrollable container.

There are two nested scrollable containers with identical styles, which is unnecessary and could cause scrolling issues.
-        <div className="overflow-y-auto bg-charcoal-900 scrollbar-thin scrollbar-track-transparent scrollbar-thumb-charcoal-600">
           <div className="max-h-[70vh] min-h-40 overflow-y-auto bg-charcoal-900 scrollbar-thin scrollbar-track-transparent scrollbar-thumb-charcoal-600">

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c6f6eac and e8b7bee.

📒 Files selected for processing (1)

apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.$projectParam.waitpoints.$waitpointFriendlyId.complete/route.tsx (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Analyze (javascript-typescript)

coderabbitai · 2025-02-27T15:20:55Z

....$organizationSlug.projects.$projectParam.waitpoints.$waitpointFriendlyId.complete/route.tsx

+          <div className="flex items-center gap-1">
+            <AnimatedHourglassIcon
+              className="text-dimmed-dimmed size-4"
+              delay={(waitpoint.completedAfter.getMilliseconds() - Date.now()) / 1000}


⚠️ Potential issue

Fix delay calculation in AnimatedHourglassIcon.

The delay calculation for the AnimatedHourglassIcon appears incorrect. It uses milliseconds from completedAfter instead of calculating the difference correctly.

- delay={(waitpoint.completedAfter.getMilliseconds() - Date.now()) / 1000} + delay={(waitpoint.completedAfter.getTime() - Date.now()) / 1000}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

delay={(waitpoint.completedAfter.getMilliseconds() - Date.now()) / 1000}

delay={(waitpoint.completedAfter.getTime() - Date.now()) / 1000}

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (18)

apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.logs.debug.ts (4)
37-39: Improve error response details.

The error response doesn't provide enough information about what went wrong. Consider adding more specific error details to help API consumers understand the issue.
  if (!run) {
-    throw new Response("You don't have permissions for this run", { status: 401 });
+    return new Response(
+      JSON.stringify({ error: "Run not found or you don't have access to it" }),
+      { status: 401, headers: { "Content-Type": "application/json" } }
+    );
  }
56-63: Add response headers and body for error cases.

The response objects for error cases lack Content-Type headers and error details in the body, which could make debugging harder for API consumers.
  switch (eventResult.code) {
    case "FAILED_TO_RECORD_EVENT":
-     return new Response(null, { status: 400 }); // send a 400 to prevent retries
+     return new Response(
+       JSON.stringify({ error: "Failed to record event" }),
+       { status: 400, headers: { "Content-Type": "application/json" } }
+     ); // send a 400 to prevent retries
    case "RUN_NOT_FOUND":
-     return new Response(null, { status: 404 });
+     return new Response(
+       JSON.stringify({ error: "Run not found" }),
+       { status: 404, headers: { "Content-Type": "application/json" } }
+     );
    default:
      return assertExhaustive(eventResult.code);
  }
64-70: Enhance error handling with structured logging.

The current error logging might not properly serialize complex error objects. Consider using a more structured approach.
  try {
    // ...existing code
  } catch (error) {
    logger.error("Failed to record dev log", {
      environmentId: authentication.environment.id,
-     error,
+     error: error instanceof Error ? 
+       { message: error.message, stack: error.stack } : 
+       String(error),
    });
    throw error;
  }
53-54: Add Content-Type header to success response.

The success response lacks a Content-Type header, which is good practice even for empty responses.
  if (eventResult.success) {
-   return new Response(null, { status: 204 });
+   return new Response(null, { 
+     status: 204,
+     headers: { "Content-Type": "application/json" }
+   });
  }
apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.latest.ts (1)
1-28: Consider enhancing the error handling with detailed logs

The implementation for retrieving the latest snapshot is well-structured, but the error handling could be improved. When the snapshot retrieval fails, the code throws a generic error without logging details that would be helpful for debugging.

Consider adding error logging before throwing the error to make troubleshooting easier:
  if (!executionData) {
+     logger.error("Failed to retrieve latest snapshot", { runFriendlyId });
      throw new Error("Failed to retrieve latest snapshot");
  }
Also, consider adding a debug log at the beginning of the function to track when this API is being called, similar to other routes in this PR.
apps/webapp/app/routes/engine.v1.dev.config.ts (1)
18-23: Add validation for environment variables

The code retrieves configuration values from environment variables but doesn't validate that they exist or have valid values.

Consider adding validation to ensure the environment variables are defined and have appropriate values:
  try {
+     if (typeof env.DEV_DEQUEUE_INTERVAL_WITH_RUN !== 'number' || 
+         typeof env.DEV_DEQUEUE_INTERVAL_WITHOUT_RUN !== 'number') {
+       logger.warn("Missing or invalid dequeue interval environment variables", {
+         withRun: env.DEV_DEQUEUE_INTERVAL_WITH_RUN,
+         withoutRun: env.DEV_DEQUEUE_INTERVAL_WITHOUT_RUN,
+       });
+     }
      return json({
        environmentId: authentication.environment.id,
        dequeueIntervalWithRun: env.DEV_DEQUEUE_INTERVAL_WITH_RUN,
        dequeueIntervalWithoutRun: env.DEV_DEQUEUE_INTERVAL_WITHOUT_RUN,
      });
apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.latest.ts (2)
34-36: Improve the unauthorized error message.

The error message could be more specific about why permissions are insufficient. Consider providing more context about what permissions are required.
-        throw new Response("You don't have permissions for this run", { status: 401 });
+        throw new Response("You don't have permissions to access this run in the current environment", { status: 401 });
47-53: Enhance error handling for better debugging.

The current error handling catches, logs, and re-throws the error without adding context. Consider enriching the error with additional information before re-throwing.
    } catch (error) {
      logger.error("Failed to get latest snapshot", {
        environmentId: authentication.environment.id,
        error,
      });
-      throw error;
+      throw error instanceof Error 
+        ? new Error(`Failed to get latest snapshot: ${error.message}`) 
+        : new Error("Failed to get latest snapshot");
    }
apps/webapp/app/routes/engine.v1.worker-actions.deployments.$deploymentFriendlyId.dequeue.ts (1)
39-47: Improve readability with if-else instead of ternary.

The ternary operator makes this code section harder to read. Consider using an if-else statement for better readability.
-    const dequeuedMessages = (await isCurrentDeployment(deployment.id, deployment.environmentId))
-      ? await authenticatedWorker.dequeueFromEnvironment(
-          deployment.worker.id,
-          deployment.environmentId
-        )
-      : await authenticatedWorker.dequeueFromVersion(
-          deployment.worker.id,
-          searchParams.maxRunCount
-        );
+    let dequeuedMessages;
+    if (await isCurrentDeployment(deployment.id, deployment.environmentId)) {
+      dequeuedMessages = await authenticatedWorker.dequeueFromEnvironment(
+        deployment.worker.id,
+        deployment.environmentId
+      );
+    } else {
+      dequeuedMessages = await authenticatedWorker.dequeueFromVersion(
+        deployment.worker.id,
+        searchParams.maxRunCount
+      );
+    }
apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.logs.debug.ts (1)

15-37: Consider logging success scenarios.
While successful recording returns a 204, adding a debug/info log upon success may help trace the lifecycle of debug logs and facilitate future troubleshooting.

apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.heartbeat.ts (1)

24-56: Consider using 403 or 404 response code.
Currently returning 401 if the run is not found or not allowed. If authentication is valid but permissions are insufficient, using 403 typically better reflects the scenario. Alternatively, using 404 can help obscure resource existence.

apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.complete.ts (1)

27-61: Refine unauthorized response code.
Like in the heartbeat route, consider returning 403 or 404 instead of 401 if the user lacks permissions or the resource is not found, to differentiate between authentication vs. authorization issues.

apps/webapp/app/routes/engine.v1.dev.presence.ts (2)

43-45: Address the in-code TODO comment
The code references a plan to set a string with expire on the same call. This can reduce complexity and avoid potential race conditions. Consider addressing this TODO or removing the comment if no longer necessary.

66-78: Ensure proper cleanup for unexpected terminations
The cleanup handler is triggered during normal disconnections, but might not fire if the process abruptly terminates. Evaluate whether additional safeguards or keepalive checks are needed to handle abrupt termination scenarios.

apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.start.ts (1)

38-48: Secure error handling when run is not found
Throwing a 401 is correct for unauthorized access, but consider returning a more descriptive message or distinct status code (e.g., 404 for not found if that is more aligned with domain logic). This helps clarify the reason for failures.

apps/webapp/app/routes/engine.v1.dev.dequeue.ts (3)

16-16: Be cautious with a fixed max dequeue count
Hardcoding env.DEV_DEQUEUE_MAX_RUNS_PER_PULL could limit throughput unexpectedly for certain workloads. Provide a configuration surface or documented rationale for the chosen limit.

30-62: Add error handling for background worker dequeue operations
If any error happens while calling engine.dequeueFromBackgroundWorkerMasterQueue, the loop will break silently. Consider catching and logging errors more explicitly to ensure issues are visible.

64-74: Consider factoring out resource checks to a dedicated module
The resource-based queuing logic is repeated for background workers and environment master queue. Refactoring it into a reusable method can improve maintainability and reduce duplication.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8b7bee and c63915f.

📒 Files selected for processing (20)

apps/webapp/app/routes/engine.v1.dev.config.ts (1 hunks)
apps/webapp/app/routes/engine.v1.dev.dequeue.ts (1 hunks)
apps/webapp/app/routes/engine.v1.dev.presence.ts (1 hunks)
apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.logs.debug.ts (1 hunks)
apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.complete.ts (1 hunks)
apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.start.ts (1 hunks)
apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.heartbeat.ts (1 hunks)
apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.latest.ts (1 hunks)
apps/webapp/app/routes/engine.v1.worker-actions.connect.ts (1 hunks)
apps/webapp/app/routes/engine.v1.worker-actions.deployments.$deploymentFriendlyId.dequeue.ts (1 hunks)
apps/webapp/app/routes/engine.v1.worker-actions.dequeue.ts (1 hunks)
apps/webapp/app/routes/engine.v1.worker-actions.heartbeat.ts (1 hunks)
apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.logs.debug.ts (1 hunks)
apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.complete.ts (1 hunks)
apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.start.ts (1 hunks)
apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.continue.ts (1 hunks)
apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.heartbeat.ts (1 hunks)
apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.restore.ts (1 hunks)
apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.suspend.ts (1 hunks)
apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.latest.ts (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (23)

apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.logs.debug.ts (2)

29-35: LGTM! Good database query security.

The database query correctly filters by both friendlyId and runtimeEnvironmentId, ensuring that users can only access runs from their own environment.

41-50: LGTM! Good use of structured logging parameters.

The function passes the necessary parameters to recordRunDebugLog in a structured manner, including properly formatting the run ID using RunId.fromFriendlyId().

apps/webapp/app/routes/engine.v1.worker-actions.heartbeat.ts (1)

1-13: Clean implementation of worker heartbeat API route.

This implementation follows a good pattern for API route handlers with proper type safety using TypeScript. The route has a single responsibility and handles the worker heartbeat action effectively.

apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.heartbeat.ts (1)

1-26: Well-structured run snapshot heartbeat API route with good parameter validation.

This implementation effectively uses Zod for route parameter validation and follows the established pattern for API routes. The destructuring of parameters and passing to the heartbeat method is clean and maintainable.

apps/webapp/app/routes/engine.v1.worker-actions.dequeue.ts (1)

1-16: Clean implementation of worker dequeue API route with appropriate response handling.

This endpoint correctly follows the established pattern for API routes while properly handling the response from the dequeue operation. The direct return of the dequeued result as JSON is appropriate for this use case.

apps/webapp/app/routes/engine.v1.worker-actions.connect.ts (1)

1-19: Well-implemented worker connect API route with informative response.

This endpoint follows the established pattern while providing a more detailed response that includes the worker group information. The structured response with both status and worker metadata is helpful for clients consuming this API.

apps/webapp/app/routes/engine.v1.dev.config.ts (1)

10-10: Clarify the purpose of the findResource function

The findResource function simply returns 1 without any apparent purpose. This seems like a placeholder or a required parameter that doesn't serve a meaningful function in this context.

Is this a required parameter for the createLoaderApiRoute function? If so, consider adding a comment explaining why it returns 1 or refactor it to serve a meaningful purpose.

apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.restore.ts (1)

2-5: Address schema naming mismatch

There's a mismatch between the route name ("restore") and the schemas being used ("WorkerApiSuspendRun*"). This creates confusion since suspension and restoration are opposite operations.

Confirm that these are the correct schemas to use for a restoration operation. If not, update to use the appropriate schemas for restoration.

apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.start.ts (1)

1-32: Well-structured API route implementation!

The implementation correctly handles starting a run attempt with proper validation and error handling. The code follows good practices with type safety, input validation using Zod, and a focused handler function.

apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.latest.ts (1)

12-12: Consider clarifying the resource finder implementation.

The resource finder always returns 1, which seems unusual. If this is intentional, consider adding a comment explaining why this approach is used rather than finding an actual resource.

apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.complete.ts (1)

1-33: Well-implemented API route!

The code correctly handles completing a run attempt with proper validation and error handling. The implementation follows good practices with type safety, input validation using Zod, and a focused handler function.

apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.logs.debug.ts (2)

1-6: Good import structure.
No issues found here. The imports are straightforward and properly scoped.

8-14: Well-structured route definition.
The use of Zod schema validation for parameters and body is clear and robust.

apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.suspend.ts (2)

1-9: Straightforward imports.
No concerns. The imports are reasonable and well-organized.

10-17: Clean route definition.
Parameter and body validators are clearly defined.

apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.heartbeat.ts (3)

1-12: Imports look good.
Nothing to note regarding syntax or usage.

16-23: Route definition is concise.
The structure is consistent and clear.

59-60: Module export is fine.
No issues found here.

apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.complete.ts (3)

1-17: Import statements are valid.
Everything appears in order.

19-26: Config and schema usage are clear.
Nicely defined parameters and body schema.

64-65: Export statement is consistent.
No additional feedback.

apps/webapp/app/routes/engine.v1.dev.presence.ts (1)

61-65: Validate the periodic presence update frequency
Frequent calls to Redis (based on the defined interval) could increase load. Ensure the chosen interval balances real-time accuracy with resource usage.

apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.start.ts (1)

50-53: Verify engine dependency injection
The engine.startRunAttempt call relies on an external engine import. Consider verifying that it’s correctly initialized with relevant configuration for the environment to prevent unexpected behavior.

coderabbitai · 2025-02-27T15:48:26Z

apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.logs.debug.ts

+  async ({
+    authentication,
+    body,
+    params,
+  }): Promise<TypedResponse<WorkerApiRunAttemptStartResponseBody>> => {


🛠️ Refactor suggestion

Fix return type inconsistency.

The function's return type is specified as Promise<TypedResponse<WorkerApiRunAttemptStartResponseBody>>, but the function actually returns regular Response objects with no body data. This could cause type errors.

Update the return type to match the actual implementation:

async ({ authentication, body, params, - }): Promise<TypedResponse<WorkerApiRunAttemptStartResponseBody>> => { + }): Promise<Response> => {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async ({

authentication,

body,

params,

}): Promise<TypedResponse<WorkerApiRunAttemptStartResponseBody>> => {

async ({

authentication,

body,

params,

}): Promise<Response> => {

coderabbitai · 2025-02-27T15:48:26Z

...outes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.continue.ts

+
+      return json(continuationResult);
+    } catch (error) {
+      logger.error("Failed to suspend run", { runFriendlyId, snapshotFriendlyId, error });


⚠️ Potential issue

Fix incorrect error message

The function is for continuing run execution, but the error message says "Failed to suspend run". This appears to be a copy-paste error.

Update the error message to correctly reflect the operation:

- logger.error("Failed to suspend run", { runFriendlyId, snapshotFriendlyId, error }); + logger.error("Failed to continue run", { runFriendlyId, snapshotFriendlyId, error });

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

logger.error("Failed to suspend run", { runFriendlyId, snapshotFriendlyId, error });

logger.error("Failed to continue run", { runFriendlyId, snapshotFriendlyId, error });

coderabbitai · 2025-02-27T15:48:26Z

...routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.restore.ts

+    if (!body.success) {
+      // TODO: we could create a debug span here
+      logger.error("Failed to restore run", {
+        runFriendlyId,
+        snapshotFriendlyId,
+        error: body.error,
+      });
+
+      return json({ ok: true });
+    }


🛠️ Refactor suggestion

Revisit error handling strategy

The current implementation returns { ok: true } for both successful and failed operations (when body.success is false). This makes it difficult for API consumers to distinguish between success and failure.

Consider returning an error status or different response for failures:

logger.error("Failed to restore run", { runFriendlyId, snapshotFriendlyId, error: body.error, }); - return json({ ok: true }); + return json({ ok: false, error: body.error?.message || "Unknown error" });

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (!body.success) {

// TODO: we could create a debug span here

logger.error("Failed to restore run", {

runFriendlyId,

snapshotFriendlyId,

error: body.error,

});

return json({ ok: true });

}

if (!body.success) {

// TODO: we could create a debug span here

logger.error("Failed to restore run", {

runFriendlyId,

snapshotFriendlyId,

error: body.error,

});

return json({ ok: false, error: body.error?.message || "Unknown error" });

}

coderabbitai · 2025-02-27T15:48:27Z

...routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.restore.ts

+    logger.debug("Restoring run", { runFriendlyId, snapshotFriendlyId, body });
+
+    if (!body.success) {
+      // TODO: we could create a debug span here


💡 Verification agent

🧩 Analysis chain

Address TODO comment

There's a TODO comment about creating a debug span that should be addressed before finalizing this code.

🏁 Script executed:

#!/bin/bash # Check if the debug span has been implemented in similar routes grep -r "debug span" --include="*.ts" apps/webapp/app/routes/

Length of output: 379

Action Required: Implement Debug Span Functionality

The file still contains a TODO comment for creating a debug span (line 28) and no implementation has been added—as confirmed by similar examples in related routes. Please either implement the debug span or remove the TODO comment if it's no longer applicable.

coderabbitai · 2025-02-27T15:48:27Z

apps/webapp/app/routes/engine.v1.worker-actions.deployments.$deploymentFriendlyId.dequeue.ts

+    if (!deployment) {
+      throw new Error("Deployment not found");
+    }
+
+    if (!deployment.worker) {
+      throw new Error("Worker not found");
+    }


🛠️ Refactor suggestion

Improve error handling with proper status codes.

Instead of throwing generic errors, consider using structured responses with appropriate status codes for better API consistency.

if (!deployment) { - throw new Error("Deployment not found"); + throw new Response("Deployment not found", { status: 404 }); } if (!deployment.worker) { - throw new Error("Worker not found"); + throw new Response("Worker not found for this deployment", { status: 404 }); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (!deployment) {

throw new Error("Deployment not found");

}

if (!deployment.worker) {

throw new Error("Worker not found");

}

if (!deployment) {

throw new Response("Deployment not found", { status: 404 });

}

if (!deployment.worker) {

throw new Response("Worker not found for this deployment", { status: 404 });

}

coderabbitai · 2025-02-27T15:48:27Z

...routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.suspend.ts

+  async ({
+    authenticatedWorker,
+    params,
+    body,
+  }): Promise<TypedResponse<WorkerApiSuspendRunResponseBody>> => {
+    const { runFriendlyId, snapshotFriendlyId } = params;
+
+    logger.debug("Suspending run", { runFriendlyId, snapshotFriendlyId, body });
+
+    if (!body.success) {
+      // TODO: we could create a debug span here
+      logger.error("Failed to suspend run", {
+        runFriendlyId,
+        snapshotFriendlyId,
+        error: body.error,
+      });
+
+      return json({ ok: true });
+    }
+
+    try {
+      await authenticatedWorker.createCheckpoint({
+        runFriendlyId,
+        snapshotFriendlyId,
+        checkpoint: body.checkpoint,
+      });
+
+      return json({ ok: true });
+    } catch (error) {
+      logger.error("Failed to suspend run", { runFriendlyId, snapshotFriendlyId, error });
+      throw error;
+    }
+  }


🛠️ Refactor suggestion

Clarify handling of failure states.
When body.success is false, the code logs an error but still returns { ok: true }. This might mask underlying issues, making debugging more difficult. Consider returning a non-200 response or including a more descriptive payload to reflect the failed suspension.

coderabbitai · 2025-02-27T15:48:27Z

apps/webapp/app/routes/engine.v1.dev.presence.ts

+const redis = new Redis({
+  port: env.RUN_ENGINE_DEV_PRESENCE_REDIS_PORT ?? undefined,
+  host: env.RUN_ENGINE_DEV_PRESENCE_REDIS_HOST ?? undefined,
+  username: env.RUN_ENGINE_DEV_PRESENCE_REDIS_USERNAME ?? undefined,
+  password: env.RUN_ENGINE_DEV_PRESENCE_REDIS_PASSWORD ?? undefined,
+  enableAutoPipelining: true,
+  ...(env.RUN_ENGINE_DEV_PRESENCE_REDIS_TLS_DISABLED === "true" ? {} : { tls: {} }),
+});


🛠️ Refactor suggestion

Consider improved error handling when connecting to Redis
If Redis fails to connect due to incorrect host/port/password, the app could silently fail. Implementing retry logic or throwing an explicit error if the connection cannot be created would improve resiliency.

coderabbitai · 2025-02-27T15:48:27Z

...app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.start.ts

+  }
+);
+
+async function getEnvVars(
+  environment: RuntimeEnvironment,
+  runId: string,
+  machinePreset: MachinePreset
+): Promise<Record<string, string>> {
+  const variables = await resolveVariablesForEnvironment(environment);
+
+  const jwt = await generateJWTTokenForEnvironment(environment, {
+    run_id: runId,
+    machine_preset: machinePreset.name,
+  });
+
+  variables.push(
+    ...[
+      { key: "TRIGGER_JWT", value: jwt },
+      { key: "TRIGGER_RUN_ID", value: runId },
+      { key: "TRIGGER_MACHINE_PRESET", value: machinePreset.name },
+    ]
+  );
+
+  return variables.reduce((acc: Record<string, string>, curr) => {
+    acc[curr.key] = curr.value;
+    return acc;
+  }, {});
+}


⚠️ Potential issue

Safeguard sensitive token data
You are generating JWT tokens and storing them in the environment variables array. Ensure they are not inadvertently logged or exposed through debug statements to avoid leaking tokens.

- logger.debug("Created environment vars:", variables); + // Avoid logging actual JWT tokens or any secret credentials + logger.debug("Created environment vars: [filtered for sensitive data]");

Committable suggestion skipped: line range outside the PR's diff.

…ot well for triggerAndWait yet

# Conflicts: # apps/coordinator/src/checkpointer.ts # apps/webapp/app/components/code/CodeBlock.tsx # apps/webapp/app/presenters/v3/SpanPresenter.server.ts # apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.deployments/route.tsx # apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam/route.tsx # apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.runs.electric.$runParam/route.tsx # apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx # apps/webapp/app/v3/eventRepository.server.ts # apps/webapp/app/v3/services/cancelTaskRun.server.ts # apps/webapp/app/v3/services/completeAttempt.server.ts # apps/webapp/app/v3/services/finalizeDeployment.server.ts # apps/webapp/app/v3/services/rollbackDeployment.server.ts # apps/webapp/app/v3/services/triggerTask.server.ts # apps/webapp/test/fairDequeuingStrategy.test.ts # apps/webapp/test/realtimeClient.test.ts # internal-packages/database/prisma/schema.prisma # packages/cli-v3/src/apiClient.ts # packages/cli-v3/src/cli/index.ts # packages/cli-v3/src/dev/workerRuntime.ts # packages/cli-v3/src/entryPoints/deploy-run-controller.ts # packages/cli-v3/src/entryPoints/dev-run-worker.ts # packages/cli-v3/src/entryPoints/managed-run-worker.ts # packages/core/src/v3/errors.ts # packages/core/src/v3/schemas/index.ts # packages/core/src/v3/schemas/schemas.ts # packages/core/src/v3/types/tasks.ts # packages/core/src/v3/workers/index.ts # packages/trigger-sdk/src/v3/shared.ts

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (3)

apps/webapp/app/presenters/v3/SpanPresenter.server.ts (1)
459-523: 🛠️ Refactor suggestion

Wrap variable declarations within the switch case block to avoid scoping issues.

Biome flags that variable declarations in the "waitpoint" case can be accessed by other case clauses. Enclose each case body in a block {} to limit variable scope safely.
switch (span.entity.type) {
  case "waitpoint": {
    {
      const waitpoint = ...
      ...
    }
    {
      const output = ...
      ...
    }
    ...
    return ...;
  }
  default:
    return { ...data, entity: null };
}
🧰 Tools

🪛 Biome (1.9.4)

[error] 461-477: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

[error] 487-492: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

[error] 494-494: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)
apps/webapp/app/v3/services/batchTriggerV4.server.ts (1)
319-335: 🛠️ Refactor suggestion

Consider updating batch status on max attempts.

The code logs an error when the max attempt threshold (MAX_ATTEMPTS) is reached but does not explicitly mark the batch as "failed" or "canceled." Consider updating the batch status to reflect a terminal state for better visibility and potentially halting further retries.
if ($attemptCount > MAX_ATTEMPTS) {
  logger.error("[BatchTriggerV3][processBatchTaskRun] Max attempts reached", {
    options,
    attemptCount: $attemptCount,
  });
  // You might want to update the batch status to failed here
+  await this._prisma.batchTaskRun.update({
+    where: { id: options.batchId },
+    data: { status: "FAILED" }
+  });
  return;
}
apps/webapp/app/v3/services/triggerTaskV2.server.ts (1)
406-409: 🛠️ Refactor suggestion

Consider limiting recursion depth.

The method calls itself recursively when handling a duplicate idempotency key error, which could lead to stack overflow if the issue persists across many attempts.
if (error instanceof RunDuplicateIdempotencyKeyError) {
+  if (attempt >= MAX_ATTEMPTS - 1) {
+    throw new ServiceValidationError(
+      `Cannot trigger ${taskId} due to persistent idempotency key conflicts after ${attempt} attempts.`
+    );
+  }
  return await this.call({ taskId, environment, body, options, attempt: attempt + 1 });
}

🧹 Nitpick comments (49)

apps/webapp/test/GCRARateLimiter.test.ts (4)
11-13: Consider adding Redis connection cleanup.

You're now creating a new Redis client for each test case, but there's no explicit cleanup of these connections when tests complete. This could potentially lead to resource leaks or connection pool exhaustion during test runs.
  redisTest("should allow a single request when under the rate limit", async ({ redisOptions }) => {
    const redis = new Redis(redisOptions);
+   try {
      const limiter = new GCRARateLimiter({
        redis,
        emissionInterval: 1000, // 1 request per second on average
        burstTolerance: 3000, // Allows a burst of 4 requests (3 * 1000 + 1)
        keyPrefix: "test:ratelimit:",
      });

      const result = await limiter.check("user:1");
      expect(result.allowed).toBe(true);
+   } finally {
+     await redis.quit();
+   }
  });
27-29: Add Redis connection cleanup in burst test.

Similar to the previous test, you should ensure proper cleanup of the Redis connection to prevent resource leaks.
  async ({ redisOptions }) => {
    const redis = new Redis(redisOptions);
+   try {
      const limiter = new GCRARateLimiter({
        redis,
        emissionInterval: 1000,
        burstTolerance: 3000, // With an emission interval of 1000ms, burstTolerance of 3000ms allows 4 rapid requests.
        keyPrefix: "test:ratelimit:",
      });

      // Test logic here...
+   } finally {
+     await redis.quit();
+   }
  }
55-56: Consider extracting Redis setup/teardown logic.

Since the pattern of creating and (ideally) cleaning up Redis connections is repeated in every test, consider extracting this to a helper function or using test fixtures to reduce duplication and ensure consistent cleanup.
// Add this helper at the top of the file
+ async function withRedis(redisOptions, callback) {
+   const redis = new Redis(redisOptions);
+   try {
+     return await callback(redis);
+   } finally {
+     await redis.quit();
+   }
+ }

// Then use it in tests like:
  redisTest(
    "should allow a request after the required waiting period",
    async ({ redisOptions }) => {
-     const redis = new Redis(redisOptions);
+     await withRedis(redisOptions, async (redis) => {
        const limiter = new GCRARateLimiter({
          redis,
          emissionInterval: 1000,
          burstTolerance: 3000,
          keyPrefix: "test:ratelimit:",
        });
        
        // Test logic here...
+     });
    }
  );
232-243: Consider using a cleaner mocking approach.

The current approach of modifying and restoring the Redis object's methods works, but consider using Vitest's mocking capabilities more directly for cleaner test code.
- // @ts-expect-error
- const originalGcra = redis.gcra;
- // @ts-ignore
- redis.gcra = vi.fn(() => {
-   throw new Error("Simulated Redis error");
- });

- await expect(limiter.check("user:error")).rejects.toThrow("Simulated Redis error");

- // Restore the original command.
- // @ts-expect-error
- redis.gcra = originalGcra;

+ // Using Vitest's spyOn for cleaner mocking
+ const gcraSpy = vi.spyOn(redis, 'gcra').mockImplementation(() => {
+   throw new Error("Simulated Redis error");
+ });

+ await expect(limiter.check("user:error")).rejects.toThrow("Simulated Redis error");

+ // Vitest automatically restores mocks after each test when using restoreMocks option
+ // But for clarity, you can explicitly restore:
+ gcraSpy.mockRestore();
internal-packages/run-engine/src/engine/eventBus.ts (4)
5-178: Well-structured event type definitions with a few areas for improvement.

The EventBusEvents type provides comprehensive coverage of the task run lifecycle events, with clear payload structures for each event type. The consistent inclusion of time and run identifiers across all events is a good practice.

Consider adding JSDoc comments to document the purpose and usage context for each event type, which would improve maintainability and developer experience.

Examples of JSDoc comments you could add:
 export type EventBusEvents = {
+  /**
+   * Emitted when a run attempt is started
+   * @param time The time when the attempt started
+   * @param run Information about the run
+   * @param organization The organization the run belongs to
+   */
   runAttemptStarted: [
     {
       time: Date;
       run: {
         id: string;
         attemptNumber: number;
         baseCostInCents: number;
       };
       organization: {
         id: string;
       };
     },
   ];
162-177: Consider enhancing incomingCheckpointDiscarded with more detailed metadata.

The incomingCheckpointDiscarded event has a generic metadata field defined as Record<string, unknown>. Consider defining a more specific type for the checkpoint metadata if the structure is known, which would improve type safety and developer experience.
 checkpoint: {
-  metadata: Record<string, unknown>;
+  metadata: {
+    // Define specific fields if known
+    checkpointType?: string;
+    source?: string;
+    timestamp?: Date;
+    // Add other known fields
+    [key: string]: unknown;
+  };
   discardReason: string;
 };
5-178: Consider using discriminated unions to improve type safety.

Currently, event types are differentiated by their key in the EventBusEvents interface. Consider refactoring to use discriminated unions by adding a type field to each event payload, which would enhance type safety when handling events.
 export type EventBusEvents = {
   runAttemptStarted: [
     {
+      type: 'runAttemptStarted';
       time: Date;
       run: {
         id: string;
         attemptNumber: number;
         baseCostInCents: number;
       };
       organization: {
         id: string;
       };
     },
   ];
   // Apply similar changes to other event types
This would enable more robust type checking when handling events, especially in switch statements.

67-68: Ensure consistent error handling across related events.

The runFailed, runAttemptFailed, and runCancelled events all include error information but with slight structural differences. Consider standardizing the error property structure across all error-related events for consistency.

Also applies to: 101-103
apps/webapp/app/v3/services/cancelTaskRunV1.server.ts (7)
33-115: Overall structure is good, but consider transaction handling for atomic cancellation operations.

This service is well-structured with clear separation of concerns. However, the cancellation process consists of multiple database operations that should ideally be atomic to maintain data consistency.

Consider wrapping the core cancellation operations in a transaction to ensure atomicity:
  public async call(taskRun: TaskRun, options?: CancelTaskRunServiceOptions) {
    const opts = {
      reason: "Task run was cancelled by user",
      cancelAttempts: true,
      cancelledAt: new Date(),
      ...options,
    };

    // Make sure the task run is in a cancellable state
    if (!isCancellableRunStatus(taskRun.status)) {
      logger.error("Task run is not in a cancellable state", {
        runId: taskRun.id,
        status: taskRun.status,
      });
      return;
    }

+   return this._prisma.$transaction(async (tx) => {
      const finalizeService = new FinalizeTaskRunService();
      const cancelledTaskRun = await finalizeService.call({
        id: taskRun.id,
        status: "CANCELED",
        // ...remaining parameters
      });

      // Continue with event cancellation and other operations
      // ...

      return {
        id: cancelledTaskRun.id,
      };
+   });
  }
42-49: Return a more descriptive result when a task run can't be cancelled.

Currently, the function returns undefined when a task run is not in a cancellable state, which doesn't give the caller any information about what happened.

Consider returning a structured result that includes success/failure status and a reason:
  // Make sure the task run is in a cancellable state
  if (!isCancellableRunStatus(taskRun.status)) {
    logger.error("Task run is not in a cancellable state", {
      runId: taskRun.id,
      status: taskRun.status,
    });
-   return;
+   return {
+     success: false,
+     reason: `Task run with status ${taskRun.status} cannot be cancelled`,
+     id: taskRun.id
+   };
  }
87-105: Add error handling for event cancellation.

The code cancels in-progress events but doesn't handle potential failures during cancellation. If any event cancellation fails, the overall operation might continue without proper error reporting.

Add error handling to capture and log failures during event cancellation:
- await Promise.all(
-   inProgressEvents.map((event) => {
-     return eventRepository.cancelEvent(event, opts.cancelledAt, opts.reason);
-   })
- );
+ try {
+   await Promise.all(
+     inProgressEvents.map((event) => {
+       return eventRepository.cancelEvent(event, opts.cancelledAt, opts.reason);
+     })
+   );
+ } catch (error) {
+   logger.error("Failed to cancel some in-progress events", {
+     runId: taskRun.id,
+     error,
+   });
+   // Consider whether to continue or throw based on business requirements
+ }
117-176: Refactor the complex attempt cancellation logic for better maintainability.

The current implementation is well-structured but contains nested conditionals and switches that could be refactored for better readability and maintainability.

Consider extracting the status-specific cancellation logic into separate methods:
  async #cancelPotentiallyRunningAttempts(
    run: ExtendedTaskRun,
    attempts: ExtendedTaskRunAttempt[]
  ) {
    for (const attempt of attempts) {
      await CancelTaskAttemptDependenciesService.enqueue(attempt.id, this._prisma);

      if (run.runtimeEnvironment.type === "DEVELOPMENT") {
        await this.#cancelAttemptInDevelopment(attempt, run);
      } else {
-       switch (attempt.status) {
-         case "EXECUTING": {
-           // We need to send a cancel message to the coordinator
-           socketIo.coordinatorNamespace.emit("REQUEST_ATTEMPT_CANCELLATION", {
-             version: "v1",
-             attemptId: attempt.id,
-             attemptFriendlyId: attempt.friendlyId,
-           });
-           break;
-         }
-         // ... remaining cases
-       }
+       await this.#cancelAttemptByStatus(attempt, run);
      }
    }
  }

+ async #cancelAttemptInDevelopment(attempt: ExtendedTaskRunAttempt, run: ExtendedTaskRun) {
+   // Signal the task run attempt to stop
+   await devPubSub.publish(
+     `backgroundWorker:${attempt.backgroundWorkerId}:${attempt.id}`,
+     "CANCEL_ATTEMPT",
+     {
+       attemptId: attempt.friendlyId,
+       backgroundWorkerId: attempt.backgroundWorker.friendlyId,
+       taskRunId: run.friendlyId,
+     }
+   );
+ }

+ async #cancelAttemptByStatus(attempt: ExtendedTaskRunAttempt, run: ExtendedTaskRun) {
+   switch (attempt.status) {
+     case "EXECUTING": {
+       // We need to send a cancel message to the coordinator
+       socketIo.coordinatorNamespace.emit("REQUEST_ATTEMPT_CANCELLATION", {
+         version: "v1",
+         attemptId: attempt.id,
+         attemptFriendlyId: attempt.friendlyId,
+       });
+       break;
+     }
+     // ... remaining cases
+   }
+ }
178-192: Add error handling for socket communication failures.

The method broadcasts a cancellation message to all coordinators but doesn't handle potential communication failures.

Consider adding error handling for socket communication:
  async #cancelRemainingRunWorkers(run: ExtendedTaskRun) {
    if (run.runtimeEnvironment.type === "DEVELOPMENT") {
      // Nothing to do
      return;
    }

    // Broadcast cancel message to all coordinators
+   try {
      socketIo.coordinatorNamespace.emit("REQUEST_RUN_CANCELLATION", {
        version: "v1",
        runId: run.id,
        // Give the attempts some time to exit gracefully. If the runs supports lazy attempts, it also supports exit delays.
        delayInMs: run.lockedToVersion?.supportsLazyAttempts ? 5_000 : undefined,
      });
+     logger.debug("Successfully broadcasted run cancellation request", {
+       runId: run.id,
+     });
+   } catch (error) {
+     logger.error("Failed to broadcast run cancellation request", {
+       runId: run.id,
+       error,
+     });
+     // Consider whether to throw or continue based on business requirements
+   }
  }
34-40: Consider further validating cancellation options.

The code sets default options but doesn't validate the provided options, such as checking if cancelledAt is in the past.

Add validation for the provided options:
  public async call(taskRun: TaskRun, options?: CancelTaskRunServiceOptions) {
    const opts = {
      reason: "Task run was cancelled by user",
      cancelAttempts: true,
      cancelledAt: new Date(),
      ...options,
    };

+   // Validate that cancelledAt is not in the future
+   if (opts.cancelledAt > new Date()) {
+     logger.warn("cancelledAt is in the future, using current time instead", {
+       providedTime: opts.cancelledAt,
+       currentTime: new Date(),
+     });
+     opts.cancelledAt = new Date();
+   }
112-115: Enhance the return value with more context.

The current return value only includes the task run ID. Consider providing more context about the cancellation result.

Enhance the return value with more information:
  return {
    id: cancelledTaskRun.id,
+   success: true,
+   cancelledAt: opts.cancelledAt,
+   cancelledAttempts: opts.cancelAttempts ? attempts.length : 0,
+   cancelledEvents: inProgressEvents.length,
  };
apps/webapp/app/env.server.ts (8)
262-263: Consider avoiding hardcoded secrets in code defaults.
While providing a default secret ("managed-secret") may be convenient for local development, storing secrets in code can pose security risks if used in production. It is advisable to store real production secrets in a secure vault or environment variable management system to avoid accidental leakage.

413-422: Validate default poll interval and concurrency for production.
The default of 100 ms (RUN_ENGINE_WORKER_POLL_INTERVAL) may become chatty under higher loads, and concurrency values (e.g., RUN_ENGINE_WORKER_COUNT=4 with 10 tasks per worker) can quickly scale. Confirm these values align with expected production traffic to avert unnecessary resource usage.

423-453: Reduce duplication in Redis configuration.
These transformations for worker Redis environment variables follow the same pattern seen elsewhere in the file. Consider extracting a reusable helper function to minimize repetitive code and improve maintainability.

454-484: Consolidate redundant run queue Redis environment config.
Similar to the worker Redis pattern, these environment variable transformations repeat the same logic. Factoring them into a common helper would enhance readability and maintainability.

485-515: Eliminate repetitive run lock Redis logic.
Again, each Redis configuration uses the same chain of optional transforms. A small shared utility could keep these environment declarations concise and consistent.

516-546: Refactor Dev Presence Redis configuration.
The transformations here also duplicate the pattern from other Redis sections. A utility or shared function would centralize this logic, improve clarity, and prevent mistakes.

547-562: Fix comment mismatches and spelling issues in rate limit configuration.
There is a minor inconsistency between the comment and the default value for RUN_ENGINE_RATE_LIMIT_MAX (1200 vs. 750) and a small spelling mistake ("refix" → "refill").
-  RUN_ENGINE_RATE_LIMIT_MAX: z.coerce.number().int().default(1200), // allow bursts of 750 requests
+  RUN_ENGINE_RATE_LIMIT_MAX: z.coerce.number().int().default(1200), // allow bursts of 1200 requests
-  RUN_ENGINE_RATE_LIMIT_REFILL_RATE: z.coerce.number().int().default(400), // refix 250 tokens every 10 seconds
+  RUN_ENGINE_RATE_LIMIT_REFILL_RATE: z.coerce.number().int().default(400), // refill 250 tokens every 10 seconds
563-573: Double-check dev presence intervals and test coverage.
The frequent polling intervals (e.g., DEV_DEQUEUE_INTERVAL_WITH_RUN=250 ms) may be appropriate for local testing but might be expensive for certain environments. Consider verifying performance and adding test scenarios to ensure these defaults work as intended.
internal-packages/run-engine/src/engine/errors.ts (1)
5-8: Consider adding validation for null/undefined error input

The current implementation assumes that a valid TaskRunError object is always passed. Consider adding a check to handle cases where error might be null or undefined to make the function more robust.
export function runStatusFromError(error: TaskRunError): TaskRunStatus {
+  if (!error) {
+    return "SYSTEM_FAILURE";
+  }
  if (error.type !== "INTERNAL_ERROR") {
    return "COMPLETED_WITH_ERRORS";
  }
apps/webapp/app/v3/services/worker/workerGroupTokenService.server.ts (1)
657-700: Consider caching environment variables for performance.

The startRunAttempt method fetches environment variables for each run attempt. For frequently used environments, consider caching these values to reduce database queries.

You could add a simple cache implementation for environment variables:
+  private envVarsCache: Map<string, {variables: Record<string, string>, timestamp: number}> = new Map();
+  private readonly ENV_VARS_CACHE_TTL = 5 * 60 * 1000; // 5 minutes
+
   private async getEnvVars(
     environment: RuntimeEnvironment,
     runId: string,
     machinePreset: MachinePreset
   ): Promise<Record<string, string>> {
+    const cacheKey = `${environment.id}:${machinePreset.name}`;
+    const cached = this.envVarsCache.get(cacheKey);
+    
+    if (cached && (Date.now() - cached.timestamp) < this.ENV_VARS_CACHE_TTL) {
+      // Update run-specific variables
+      return {
+        ...cached.variables,
+        TRIGGER_RUN_ID: runId,
+      };
+    }
+    
     const variables = await resolveVariablesForEnvironment(environment);

     const jwt = await generateJWTTokenForEnvironment(environment, {
       run_id: runId,
       machine_preset: machinePreset.name,
     });

     variables.push(
       ...[
         { key: "TRIGGER_JWT", value: jwt },
         { key: "TRIGGER_RUN_ID", value: runId },
         { key: "TRIGGER_MACHINE_PRESET", value: machinePreset.name },
       ]
     );

-    return variables.reduce((acc: Record<string, string>, curr) => {
+    const result = variables.reduce((acc: Record<string, string>, curr) => {
       acc[curr.key] = curr.value;
       return acc;
     }, {});
+    
+    // Cache the result but exclude run-specific variables
+    this.envVarsCache.set(cacheKey, {
+      variables: {...result},
+      timestamp: Date.now()
+    });
+    
+    return result;
   }
internal-packages/database/prisma/schema.prisma (2)

1758-1761: Priority Field for TaskRun Scheduling
The addition of priorityMs (with a default of 0) to the TaskRun model appears intended to provide a negative offset for queue timestamps. Please verify that the logic interpreting this field (e.g., in the run scheduling system) properly handles negative offsets and that its documentation clearly outlines its role and expected values.

1930-1934: RunEngineVersion Enum Introduction
The new RunEngineVersion enum, which distinguishes between V1 and V2, neatly encapsulates the versioning strategy for the run engine. For clarity, consider adding more extensive inline comments or documentation that explain the behavioral differences between these versions.
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (2)
71-71: Check if this import is needed.
It appears that SpanLink is not referenced in the code. If it’s truly unused, consider removing it.
-import { SpanLink } from "~/v3/eventRepository.server";
863-1076: SpanEntity logic is solid; remove the unnecessary Fragment.
The new component cleanly differentiates normal spans and waitpoints. However, lines 1026–1043 contain a single element <Property.Item> inside a Fragment, which is unnecessary. Removing that Fragment would simplify the code without altering functionality.
-{span.entity.object.type === "MANUAL" && (
-  <>
-    <Property.Item>
-      ...
-    </Property.Item>
-  </>
-)}
+{span.entity.object.type === "MANUAL" && (
+  <Property.Item>
+    ...
+  </Property.Item>
+)}
🧰 Tools

🪛 Biome (1.9.4)

[error] 1026-1043: Avoid using unnecessary Fragment.

A fragment is redundant if it contains only one child, or if it is the child of a html element, and is not a keyed fragment.
Unsafe fix: Remove the Fragment

(lint/complexity/noUselessFragments)
apps/webapp/app/v3/runEngineHandlers.server.ts (3)

21-58: Consider wrapping event completion calls in a transaction.

The runSucceeded handler writes to the event repository using multiple calls in a single block. If any of these calls fail, partial database updates can lead to inconsistent state. Consider wrapping them in a transaction to maintain atomicity and guarantee rollback on intermediate failures.

73-170: Reduce duplication by extracting the shared completion logic.

The runFailed handler completes events and in-progress child events with similar logic. Consider creating a helper function or utility within the repository to DRY (Don't Repeat Yourself) any repeated calls and ensure consistent error handling.

394-419: Add error handling around snapshot event recording.

While you handle when eventResult.success is false, an additional check could ensure partial writes or partially successful calls do not leave any inconsistent data. Log or handle partially updated states if the snapshot data fails to record.

apps/webapp/app/v3/taskEventStore.server.ts (1)

155-172: Perform thorough testing of raw queries.

The raw SQL for non-partitioned events includes the newly added isDebug field. Ensure integration tests confirm that returned trace events map accurately to the extended TraceEvent type, particularly under stress or concurrent usage scenarios.
apps/webapp/app/presenters/v3/SpanPresenter.server.ts (4)
47-62: Upgrade null-check or throw for missing run details.

After fetching parentRun at lines 47-62, you return early if no run is found. You might want a more explicit error message to help debugging, especially if the user expects a valid run for the given slug. This clarifies whether the run is missing or if there's some deeper issue.

82-90: Confirm fallback logic for non-run spans.

Here you call #getSpan if the entity is not recognized as a run. Make sure that there's no scenario where an entity is partially a run and partially something else, leading to an ambiguous fallback. Explicitly clarifying the logic with a short comment may prevent confusion.

102-108: Use early return for clarity.

When checking for the span result in #getRun, lines 123-125, you could reduce nesting by returning early upon a missing span, improving readability:
if (!span) {
  return;
}

// ...
168-168: Check for potential null references in metadata.

You safely parse metadata, though consider adding an extra safety check if the structure is unexpectedly large or partial. If the underlying data is invalid JSON, ensure that an error is logged or handled gracefully.
apps/webapp/app/v3/eventRepository.server.ts (3)
1664-1685: Consider handling falsy values in rehydrateAttribute.

The function immediately returns if !value, which will treat falsy values like 0, '', and false as if the attribute doesn't exist. Consider checking for strict undefined/null if you need to preserve these values.
  const value = properties[key];

- if (!value) return;
+ if (value === undefined || value === null) return;

  return value as T;
1687-1711: Consider adding caching for run lookups.

The findRunForEventCreation function retrieves run information from the database. If this function is called frequently with the same runId, consider implementing a caching mechanism to reduce database load.
+ // Simple in-memory cache for run lookups
+ const runCache = new Map<string, any>();
+ const CACHE_TTL_MS = 60000; // 1 minute cache

export async function findRunForEventCreation(runId: string) {
+  // Check cache first
+  const cachedRun = runCache.get(runId);
+  if (cachedRun && cachedRun.timestamp > Date.now() - CACHE_TTL_MS) {
+    return cachedRun.data;
+  }

-  return prisma.taskRun.findFirst({
+  const run = await prisma.taskRun.findFirst({
    where: {
      id: runId,
    },
    // ...rest of the query
  });
  
+  // Store in cache if found
+  if (run) {
+    runCache.set(runId, { data: run, timestamp: Date.now() });
+  }
+  
+  return run;
}
1773-1798: Document that recordRunDebugLog overwrites any existing isDebug value.

The function sets isDebug: true which will overwrite any existing value in options?.attributes. While likely intentional, this should be documented to prevent confusion.
/**
 * Records a debug log event for a run
 * @param runId - The ID of the run
 * @param message - The message to log
 * @param options - The event options
+ * @note This function will always set isDebug to true, overriding any existing value
 */
export async function recordRunDebugLog(
  runId: string,
  message: string,
  options: Omit<TraceEventOptions, "environment" | "taskSlug" | "startTime"> & {
    duration?: number;
    parentId?: string;
    startTime?: Date;
  }
): Promise<
  | {
      success: true;
    }
  | {
      success: false;
      code: "RUN_NOT_FOUND" | "FAILED_TO_RECORD_EVENT";
      error?: unknown;
    }
> {
  return recordRunEvent(runId, message, {
    ...options,
    attributes: {
      ...options?.attributes,
      isDebug: true,
    },
  });
}
apps/webapp/app/v3/services/batchTriggerV4.server.ts (1)
271-312: Use Enum for processing strategy for improved type safety.

The strategy parameter uses a string literal type but is used in switch statements. Consider using a proper TypeScript enum for better type safety and autocompletion.
-export const BatchProcessingStrategy = z.enum(["sequential", "parallel"]);
-export type BatchProcessingStrategy = z.infer<typeof BatchProcessingStrategy>;
+export enum BatchProcessingStrategy {
+  SEQUENTIAL = "sequential",
+  PARALLEL = "parallel"
+}
+export const BatchProcessingStrategySchema = z.nativeEnum(BatchProcessingStrategy);
Then update switch statements and references to use the enum values.
apps/webapp/app/v3/services/triggerTaskV1.server.ts (4)
47-48: Use more specific redirection in deprecation notice.

The deprecation comment should direct users to a specific alternative implementation rather than just mentioning a class name.
-/** @deprecated Use TriggerTaskService in `triggerTask.server.ts` instead. */
+/** @deprecated Use TriggerTaskService in `~/services/triggerTask.server.ts` instead, which contains version-specific routing. */
691-697: Use consistent logging patterns.

The code uses console.log here instead of the logger service used throughout the rest of the file.
-      console.log("Failed to get queue name: No task found", {
+      logger.debug("Failed to get queue name: No task found", {
        taskId,
        environmentId: environment.id,
      });
481-488: Use consistent logging patterns.

The code uses console.log here instead of the logger service used throughout the rest of the file.
-      console.log("Failed to get queue name: Invalid queue config", {
+      logger.debug("Failed to get queue name: Invalid queue config", {
        taskId,
        environmentId: environment.id,
        queueConfig: task.queueConfig,
      });
603-659: Simplify complex error handling with early returns.

The error handling logic for Prisma constraint violations is complex and deeply nested. Consider using early returns or extraction to separate methods for improved readability.

You could restructure the error handling to reduce nesting and improve clarity:
if (!(error instanceof Prisma.PrismaClientKnownRequestError)) {
  throw error;
}

logger.debug("TriggerTask: Prisma transaction error", {
  code: error.code,
  message: error.message,
  meta: error.meta,
});

if (error.code !== "P2002") {
  throw error;
}

const target = error.meta?.target;
if (!Array.isArray(target) || target.length === 0) {
  throw error;
}

// Handle one-time-use token case
if (typeof target[0] === "string" && target[0].includes("oneTimeUseToken")) {
  throw new ServiceValidationError(
    `Cannot trigger ${taskId} with a one-time use token as it has already been used.`
  );
}

// Handle queue creation case
if (/* queue creation error condition */) {
  // Handle queue creation error
}

// Handle idempotency key case
if (/* idempotency key error condition */) {
  // Retry with incremented attempt
  return await this.call(taskId, environment, body, options, attempt + 1);
}

// Default case
throw new ServiceValidationError(
  `Cannot trigger ${taskId} as it has already been triggered with the same idempotency key.`
);
apps/webapp/app/v3/services/triggerTaskV2.server.ts (4)
470-476: Use consistent logging patterns.

The code uses console.log here instead of the logger service used elsewhere.
-      console.log("Failed to get queue name: No task found", {
+      logger.debug("Failed to get queue name: No task found", {
        taskId,
        environmentId: environment.id,
      });
481-488: Use consistent logging patterns.

The code uses console.log here instead of the logger service used elsewhere.
-      console.log("Failed to get queue name: Invalid queue config", {
+      logger.debug("Failed to get queue name: Invalid queue config", {
        taskId,
        environmentId: environment.id,
        queueConfig: task.queueConfig,
      });
559-561: Add safeguard against negative queue size calculations.

If the engine returns an incorrect queue size or if there's an error in calculation, the projected size might be incorrectly calculated.
const queueSize = await engine.lengthOfEnvQueue(environment);
-const projectedSize = queueSize + itemsToAdd;
+const projectedSize = Math.max(0, queueSize + itemsToAdd);
36-37: Use more specific redirection in deprecation notice.

The deprecation comment should direct users to a specific alternative implementation rather than just mentioning a class name.
-/** @deprecated Use TriggerTaskService in `triggerTask.server.ts` instead. */
+/** @deprecated Use TriggerTaskService in `~/services/triggerTask.server.ts` instead, which provides version-specific routing. */

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2bb3db4 and 4a72898.

⛔ Files ignored due to path filters (1)

pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml

📒 Files selected for processing (38)

apps/coordinator/src/checkpointer.ts (1 hunks)
apps/docker-provider/src/index.ts (1 hunks)
apps/kubernetes-provider/tsconfig.json (1 hunks)
apps/webapp/app/components/code/CodeBlock.tsx (1 hunks)
apps/webapp/app/components/runs/v3/RunFilters.tsx (1 hunks)
apps/webapp/app/components/runs/v3/SpanTitle.tsx (1 hunks)
apps/webapp/app/entry.server.tsx (2 hunks)
apps/webapp/app/env.server.ts (2 hunks)
apps/webapp/app/presenters/v3/SpanPresenter.server.ts (12 hunks)
apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.deployments/route.tsx (1 hunks)
apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam/route.tsx (5 hunks)
apps/webapp/app/routes/admin.api.v1.environments.$environmentId.ts (2 hunks)
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (11 hunks)
apps/webapp/app/utils/taskEvent.ts (1 hunks)
apps/webapp/app/v3/authenticatedSocketConnection.server.ts (2 hunks)
apps/webapp/app/v3/eventRepository.server.ts (19 hunks)
apps/webapp/app/v3/handleSocketIo.server.ts (4 hunks)
apps/webapp/app/v3/marqs/devQueueConsumer.server.ts (1 hunks)
apps/webapp/app/v3/runEngineHandlers.server.ts (1 hunks)
apps/webapp/app/v3/services/batchTriggerV3.server.ts (1 hunks)
apps/webapp/app/v3/services/batchTriggerV4.server.ts (1 hunks)
apps/webapp/app/v3/services/cancelTaskRun.server.ts (2 hunks)
apps/webapp/app/v3/services/cancelTaskRunV1.server.ts (1 hunks)
apps/webapp/app/v3/services/changeCurrentDeployment.server.ts (1 hunks)
apps/webapp/app/v3/services/createCheckpoint.server.ts (2 hunks)
apps/webapp/app/v3/services/finalizeDeployment.server.ts (2 hunks)
apps/webapp/app/v3/services/triggerTaskV1.server.ts (1 hunks)
apps/webapp/app/v3/services/triggerTaskV2.server.ts (1 hunks)
apps/webapp/app/v3/services/worker/workerGroupTokenService.server.ts (1 hunks)
apps/webapp/app/v3/taskEventStore.server.ts (5 hunks)
apps/webapp/test/GCRARateLimiter.test.ts (6 hunks)
apps/webapp/test/fairDequeuingStrategy.test.ts (8 hunks)
apps/webapp/test/realtimeClient.test.ts (3 hunks)
internal-packages/database/prisma/schema.prisma (19 hunks)
internal-packages/redis-worker/src/queue.ts (1 hunks)
internal-packages/run-engine/src/engine/errors.ts (1 hunks)
internal-packages/run-engine/src/engine/eventBus.ts (1 hunks)
internal-packages/run-engine/src/engine/types.ts (1 hunks)

✅ Files skipped from review due to trivial changes (1)

apps/webapp/app/v3/services/changeCurrentDeployment.server.ts

🚧 Files skipped from review as they are similar to previous changes (20)

apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.deployments/route.tsx
apps/webapp/app/components/runs/v3/RunFilters.tsx
apps/webapp/app/routes/admin.api.v1.environments.$environmentId.ts
apps/webapp/app/v3/services/batchTriggerV3.server.ts
internal-packages/redis-worker/src/queue.ts
apps/webapp/app/components/code/CodeBlock.tsx
apps/webapp/test/realtimeClient.test.ts
apps/webapp/app/components/runs/v3/SpanTitle.tsx
internal-packages/run-engine/src/engine/types.ts
apps/webapp/app/v3/marqs/devQueueConsumer.server.ts
apps/coordinator/src/checkpointer.ts
apps/webapp/app/entry.server.tsx
apps/webapp/app/utils/taskEvent.ts
apps/webapp/app/v3/services/finalizeDeployment.server.ts
apps/webapp/test/fairDequeuingStrategy.test.ts
apps/webapp/app/v3/services/createCheckpoint.server.ts
apps/webapp/app/v3/authenticatedSocketConnection.server.ts
apps/webapp/app/v3/services/cancelTaskRun.server.ts
apps/docker-provider/src/index.ts
apps/kubernetes-provider/tsconfig.json

🧰 Additional context used

🪛 Biome (1.9.4)

apps/webapp/app/presenters/v3/SpanPresenter.server.ts

[error] 461-477: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

[error] 487-492: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

[error] 494-494: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx

[error] 1026-1043: Avoid using unnecessary Fragment.

A fragment is redundant if it contains only one child, or if it is the child of a html element, and is not a keyed fragment.
Unsafe fix: Remove the Fragment

(lint/complexity/noUselessFragments)

apps/webapp/app/v3/services/triggerTaskV1.server.ts

[error] 244-245: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: units / 🧪 Unit Tests

🔇 Additional comments (78)

apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam/route.tsx (5)

73-73: Verified import addition provides better type access.

The addition of Run to the imports from RunPresenter.server improves type access in the component, allowing for better type checking and IntelliSense support when working with run objects.

469-473: Good admin-only debug feature implementation.

The addition of the showDebug state variable and admin access check is well-implemented. This allows for enhanced debugging capabilities while properly restricting access to admin users only.

492-492: Appropriate conditional filtering for debug events.

The updated tree filtering logic now correctly toggles between showing all events and filtering out debug events based on the showDebug state. This ensures that debug information is only shown when explicitly requested by an admin.

517-525: Well-implemented admin-only debug toggle with keyboard shortcut.

The debug toggle is properly gated behind an admin check and includes a helpful keyboard shortcut (Shift+D). This provides a convenient way for admins to access additional debugging information while keeping the UI clean for regular users.

228-230: Enhanced user experience with keyboard shortcut.

Adding the keyboard shortcut C to the cancel run button improves accessibility and efficiency for power users. This is a good addition that aligns with existing keyboard shortcut patterns in the application.

apps/webapp/app/v3/handleSocketIo.server.ts (7)

51-76: Excellent security implementation for worker namespaces.

This approach of separate namespaces with distinct authentication mechanisms provides proper security isolation between production and development workers. The production worker namespace using WorkerGroupTokenService ensures proper authentication, while the dev worker namespace has appropriate environment type verification.

447-456: Clean implementation of header extraction utility.

The headersFromHandshake function is well-implemented, with proper handling of non-string values in the headers. This creates a standard Headers object from Socket.io handshake data, making it easier to use with the Request API.

470-499: Robust authentication middleware implementation.

The middleware properly authenticates socket connections using the provided authentication function. It has good error handling and logging, and correctly disconnects unauthorized sockets.

537-543: Prevent potential log overflow and consider concurrency impacts.

In createWorkerNamespace, the debug interval (every 5s) logging active rooms might create noise at scale. You could throttle or conditionally enable this at debug level only. Also, the eventBus subscription is properly unsubscribed on disconnect, preventing memory leaks. The overall approach is sound, but be mindful that multiple worker connections could introduce concurrency overhead.

579-612: Well-implemented error handling for batch operations.

The use of Promise.allSettled is excellent for handling errors in batch operations without failing the entire process. The implementation properly logs any individual subscription failures while allowing successful subscriptions to proceed.

506-533: Consider enabling conditional worker notifications based on environment.

The worker notification handler has a good conditional check based on the RUN_ENGINE_DEBUG_WORKER_NOTIFICATIONS environment variable. This allows for detailed debugging when needed without overwhelming logs in production environments.

670-672: Simple but effective room naming strategy.

The roomFromFriendlyRunId function provides a consistent way to generate room names from run IDs, which helps maintain a standardized approach throughout the codebase.

apps/webapp/test/GCRARateLimiter.test.ts (4)

5-5: Import of Redis client is appropriate.

Adding the Redis import to support the new pattern of creating Redis clients from options.

111-112: Test validates gradual rate limit recovery.

Good test case for validating that the retry time gradually decreases as time passes, which is an important aspect of the GCRA algorithm's behavior. The implementation is correct.

169-170: LGTM: Testing key prefix isolation.

This test correctly verifies that different rate limiters with different key prefixes don't interfere with each other, which is important for safely using multiple rate limiters in the same Redis instance.

221-222: Error handling test is robust.

This test effectively verifies error propagation when Redis operations fail. The mock implementation and error assertion are well implemented.
internal-packages/run-engine/src/engine/eventBus.ts (3)

1-4: Dependencies look good.

The imports appropriately bring in the necessary types from database and core packages, which will be used in the event definitions.

180-180: Great utility type for type-safe event argument extraction.

The EventBusEventArgs type is a well-designed utility type that enables type-safe extraction of event arguments based on the event type key. This will make it easier to implement type-safe event handlers.

129-140:
❓ Verification inconclusive

Consider adding more context to workerNotification payload.

The workerNotification event contains minimal information compared to other events. Depending on its usage, you might want to include additional context like notification type, message, or severity to make it more informative for consumers.

🏁 Script executed:
#!/bin/bash
# Find usages of the workerNotification event to understand if additional context is needed

# Look for implementations or usages of workerNotification
echo "Searching for workerNotification usages..."
rg -A 3 -B 3 "workerNotification" --type ts
Length of output: 14264
Action: Evaluate Enhancing the workerNotification Payload

The workerNotification event currently carries only the minimal fields:

time

run (with id)

snapshot (with id and executionStatus)

Our search across the codebase shows that both the event emitter (in files such as internal-packages/run-engine/src/engine/eventBus.ts and related tests) and the consumers (e.g., in apps/webapp/app/v3/runEngineHandlers.server.ts) rely on this limited structure. Although its minimalism appears sufficient for the current usage, adding additional context—such as a notification type, message, or severity—could improve debuggability and align with the enriched payloads used by other events.

Please review whether enhancing this payload would benefit consumers in terms of clarity and future extensibility, and if so, update both the emitter and listener implementations accordingly.
apps/webapp/app/v3/services/cancelTaskRunV1.server.ts (1)

51-85: Validate concurrency edge cases for finalizing runs.

When calling "FinalizeTaskRunService" to mark the task run as canceled, ensure concurrent requests don't overwrite the updated status before or after "finalizeService.call" finishes. If multiple components call "cancel" simultaneously, record-based locking or transaction-level consistency may be needed to prevent race conditions.
internal-packages/run-engine/src/engine/errors.ts (3)

1-58: Well-structured error handling implementation!

The runStatusFromError function effectively maps TaskRunError types to appropriate TaskRunStatus values with clear organization and thorough error code coverage. The comments explaining the distinction between "CRASHED" and "SYSTEM_FAILURE" are particularly helpful.

55-57: Good use of exhaustive type checking

Using assertExhaustive in the default case ensures proper type safety by verifying that all possible error codes are handled. This helps catch any missing mappings when new error codes are added to the system.

42-42:
✅ Verification successful

Verify if "N0" is a typo in the error code

I noticed that "TASK_HAS_N0_EXECUTION_SNAPSHOT" uses the digit "0" instead of the letter "O". If this is unintentional, it could lead to incorrect error handling.

🏁 Script executed:
#!/bin/bash
# Search for other occurrences of this error code to verify if it's consistently using "N0"
echo "Searching for TASK_HAS_N0_EXECUTION_SNAPSHOT:"
rg "TASK_HAS_N0_EXECUTION_SNAPSHOT" --type ts -A 1 -B 1

echo -e "\nSearching for similar error codes with 'NO' instead of 'N0':"
rg "TASK_HAS_NO_EXECUTION_SNAPSHOT" --type ts -A 1 -B 1

echo -e "\nChecking TaskRunError definition to confirm correct error code:"
rg "TaskRunError.*=.*{" --type ts -A 50 | grep -E "('|\")TASK_HAS_N0_EXECUTION_SNAPSHOT('|\")"
Length of output: 1440
Action: Confirm that "TASK_HAS_N0_EXECUTION_SNAPSHOT" is intentional

After verifying the codebase, we found that the error code is used consistently in several files:

packages/core/src/v3/schemas/common.ts

packages/core/src/v3/errors.ts

internal-packages/run-engine/src/engine/errors.ts

The search for a similar error code using "TASK_HAS_NO_EXECUTION_SNAPSHOT" returned no results, which indicates that the usage of "N0" (digit zero) is deliberate rather than a typo. Although the regex check for the TaskRunError definition produced an error (due to a regex syntax issue), the available evidence confirms that the codebase uniformly uses "TASK_HAS_N0_EXECUTION_SNAPSHOT".

Please confirm with the domain requirements or documentation if this naming convention is correct. Otherwise, no change is necessary.
apps/webapp/app/v3/services/worker/workerGroupTokenService.server.ts (14)

31-36: Well-structured token configuration.

Good job on setting up a clear token configuration with constants for prefix, length, and allowed characters. Using customAlphabet from nanoid provides a secure way to generate random tokens with the specified character set.

37-51: Token generation and storage is well-implemented.

The token creation flow is secure, generating a random token with a hash that's stored in the database while returning both the hash and plaintext version to the caller. This approach prevents storing plaintext tokens in the database while still allowing for token verification.

53-70: Effective token validation logic with appropriate logging.

The findWorkerGroup method properly validates tokens by computing the hash and looking it up. The warning log when no match is found is helpful for debugging.

72-105: Token rotation is implemented securely.

The token rotation flow correctly checks for the existence of the worker group and token before attempting rotation, with appropriate error logging if issues occur. This ensures that token rotation is robust and secure.

107-119: Secure token hashing.

Using SHA-256 for token hashing is a secure approach. The token generation function encapsulates the logic for creating both the plaintext token and its hash, which is a good practice.

121-199: Authentication flow has thorough validation and security checks.

The authentication method thoroughly validates the request, checking for the presence of the token, proper token prefix, instance name, and additional security measures for managed workers. The use of proper error logging provides good visibility for debugging authentication issues.

155-185: Secure verification of managed worker secrets.

The implementation uses timingSafeEqual for comparing secrets, which is a crucial security practice to prevent timing attacks. The length check before comparison is also important to ensure the timing-safe comparison works correctly.

200-252: Comprehensive validation for different worker types.

The worker type-specific validation ensures that each type has the required context (environment, deployment, etc.) before returning the authenticated instance. This helps prevent issues with missing data in later operations.

254-463: Robust worker instance creation with transaction support.

The getOrCreateWorkerInstance method properly handles both managed and unmanaged worker types, with appropriate validation for each. The use of database transactions ($transaction) ensures atomicity when creating/updating related records. The handling of race conditions with Prisma.PrismaClientKnownRequestError is a good practice.

465-475: Secure header sanitization.

The sanitizeHeaders method prevents sensitive headers like authorization from being logged, which is a good security practice to avoid exposing sensitive information in logs.

533-585: Conditional dequeue implementation based on worker type.

The dequeue method handles both managed and unmanaged workers appropriately, with specific dequeue strategies for each type. The check for required properties and update of the lastDequeueAt timestamp are good practices.

587-629: Good separation of dequeue methods for different contexts.

The specialized dequeue methods (dequeueFromVersion and dequeueFromEnvironment) for managed workers provide good flexibility while maintaining appropriate type checks and error handling.

781-805: Efficient variable resolution and injection.

The method for resolving environment variables and injecting run-specific values is well-implemented. The approach of using reduce to convert the array of variables to an object is clean and efficient.

808-824: Clear type definitions for authentication responses.

The union type for WorkerGroupTokenAuthenticationResponse clearly defines the structure for both managed and unmanaged worker types, which is important for type safety when consuming these responses.

internal-packages/database/prisma/schema.prisma (10)

151-152: New Relationships in Organization Model
The addition of the workerGroups and workerInstances fields to the Organization model is well aligned with the goal of managing multiple worker entities per organization. Please verify that all queries or mutations that previously assumed a singular worker structure are updated to handle these array relationships appropriately.

454-463: Enhancements in the Project Model for Engine and Worker Associations
Introducing the engine field (of type RunEngineVersion with default V1) augments the Project model with versioning support for the new run engine. In addition, adding relationships for workerGroups, workers, and the optional defaultWorkerGroup (with its associated ID) is a strong move toward improved resource management. Please ensure that any business logic or migration scripts interacting with this model take these new fields into account.

1578-1579: BackgroundWorker Engine Field Addition
The new engine field in the BackgroundWorker model (defaulting to V1) helps in distinguishing between legacy and new engine operations. Confirm that this field is consistently used across the codebase for any logic that conditionally handles background workers based on engine version.

1603-1605: Optional Worker Group Relation for BackgroundWorker
Adding the workerGroup relation (and its corresponding workerGroupId field) in the BackgroundWorker model is a beneficial change. The use of onDelete: SetNull is appropriate if a worker group is removed, but please double-check that your deployment and cleanup logic properly handles cases where this relationship becomes null.

1687-1688: TaskRun Model Engine Field Integration
The inclusion of the engine field in the TaskRun model (defaulting to V1) ensures that each run is tagged with the correct engine version. This change is essential for downstream processing—please verify that run processors and metrics correctly use this field for conditional logic.

1744-1746: Tracking First Attempt Start for Run Engine 2.0
The new firstAttemptStartedAt field in TaskRun, marked for use with Run Engine 2.0+, will be valuable for tracking the execution lifecycle. Ensure that when a run is processed by the new engine, this timestamp is correctly recorded.

1944-1950: TaskRunExecutionSnapshot Engine Field Update
Within the TaskRunExecutionSnapshot model, the engine field now defaults to V2, reflecting its intended use with the new run engine architecture. This change is critical for differentiating snapshot data based on execution contexts; please ensure that any snapshot retention or processing routines are aware of this versioning.

2730-2734: WorkerDeploymentType Enum Update
The updated WorkerDeploymentType enum now includes an additional V1 value alongside MANAGED and UNMANAGED. This adjustment appears to support legacy behavior in deployments. Make sure that any conditional logic based on this enum is updated accordingly.

2749-2775: WorkerDeployment Model Enhancements
In the WorkerDeployment model, introducing the type field (defaulting to V1) along with new relationships for promotions, alerts, and workerInstance enriches the deployment metadata. These additions will likely facilitate more nuanced deployment tracking and alerting. Please ensure that the orchestration and alerting systems are revised to incorporate these new fields, and that any related indexing remains performant.

2541-2557:

❓ Verification inconclusive

Extensive Enhancements in BatchTaskRun for Engine v2
The BatchTaskRun model now includes several new columns:
• Friendly Identifiers: runIds (a string array) and runCount for easier tracking of batch items.
• Payload & Options: Fields like payload, payloadType, options, and batchVersion standardize batch data handling.
• Engine v2 Fields: Introducing executionSnapshots, runsBlocked, and waitpoints will enable finer control over batch processing and its dependencies.

These changes are significant and should be accompanied by thorough testing—both in data migration scripts and in runtime scenarios—to ensure consistency and performance.

Action Required: Ensure Comprehensive Testing for Engine v2 Enhancements in BatchTaskRun

The enhancements in the BatchTaskRun model—introducing fields for friendly IDs (runIds, runCount), standardized payload handling (payload, payloadType, options, batchVersion), and new engine v2-specific relationships (executionSnapshots, runsBlocked, waitpoints)—are significant. These changes clearly improve tracking and batch processing control but come with the inherent risk of migration and runtime inconsistencies.

Please confirm that exhaustive testing (both for data migration and runtime behavior) has been planned or implemented to validate these enhancements.

apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (13)

62-62: No concerns with the added import.

72-75: Import statements look good.
Confirmed that CompleteWaitpointForm and ForceTimeout are used downstream.

198-198: Use of <SpanTitle> is appropriate.
No issues identified in the component usage here.

234-294: The new “detail” tab logic seems correct.
All status checks and property displays for the span appear to be handled well. No obvious issues.

313-313: Verify correctness of admin-only access.
You’re checking isAdmin to conditionally display sensitive metadata. Confirm that this is aligned with your org policies, ensuring only the correct users see these fields.

324-324: Different icon for cached runs is well-implemented.
No further suggestions.

329-332: Conditionally appending "(cached)" is clear.
This improves clarity for users observing cached run identifiers.

502-517: Idempotency fields appear well-handled.
No concerns regarding the logic to display and expire the key.

550-565: Field additions for engine and master queues look good.
The conditional rendering for admins only is a good practice.

703-704: Run error display logic is correct.
Conditionally rendering the error component is appropriate.

709-711: Properly showing output only when no error is present.
This clarifies the run’s result state.

720-725: Dynamic link for focusing or jumping to the run is correct.
No issues noted.

730-730: Label change for cached vs. non-cached is straightforward.
Good user experience improvement.

apps/webapp/app/v3/runEngineHandlers.server.ts (3)

60-71: Ensure queued alerts are idempotent.

In the runFailed handler's alert logic, multiple identical calls in a short time window could lead to duplicate alerts if not guarded by a check or unique constraint. Confirm that repeated calls won't create redundant or conflicting alerts.

172-206: Evaluate concurrency for partial run failures.

The runAttemptFailed handler marks in-progress events as crashed. If multiple parallel failures occur, overlapping attempts might concurrently crash the same events, leading to potential race conditions. Confirm that the underlying repository methods are concurrency-safe or implement idempotent updates to avoid inconsistent states.

299-325: Prevent double cancellation.

The runCancelled handler loops over in-progress events and cancels them. If the same run is canceled multiple times in quick succession, calls to eventRepository.cancelEvent could conflict. Consider concurrency checks (e.g., version numbers or optimistic locking) to avoid double cancellation side effects.

apps/webapp/app/v3/taskEventStore.server.ts (2)

23-23: Confirm usage of the isDebug field.

Introducing the isDebug attribute in the TraceEvent type can help classify debug logs versus production logs. Ensure that any related filtering or access controls are adjusted so that only authorized contexts retrieve debug data where necessary.

126-142: Validate time window alignment when partitioning is enabled.

While selecting partitioned events, the code adds a time buffer to handle partition window offsets. Confirm that the additional window logic aligns with boundary conditions (e.g., crossing the partition boundary at exact times) to avoid missing or duplicating trace events.

apps/webapp/app/presenters/v3/SpanPresenter.server.ts (3)

66-74: Check for concurrency or partial updates when retrieving run detail.

You call #getRun with a final event store and timestamps. If another process updates the run concurrently (e.g., changing completion times), ensure that the final data reflects these changes or that you handle them gracefully.

115-121: Consider referencing concurrency checks for getSpan.

Similar to other event-based functions, concurrent updates might cause stale data. Confirm the eventRepository.getSpan call supports concurrency-safe queries or use version checks to ensure data consistency.

130-156: Validate that the run model is consistent with the new engine fields.

The newly added fields engine, masterQueue, secondaryMasterQueue, error, and output are fetched here. Confirm these fields are set consistently across the codebase, and that partial updates or missing fields won't cause undefined behaviors in views or downstream logic.

apps/webapp/app/v3/eventRepository.server.ts (8)

13-13: Type refinement from AuthenticatedEnvironment to TaskEventEnvironment.

The change from AuthenticatedEnvironment to TaskEventEnvironment provides a more specific and appropriate type for event handling. This is a good refinement to improve type safety and better represent the environment context needed for task events.

Also applies to: 88-88

62-62: Well-implemented debug flag propagation through type system.

The isDebug property has been consistently added to all relevant type definitions, ensuring proper type checking throughout the event flow. This enables better debugging capabilities while maintaining type safety.

Also applies to: 129-129, 174-174, 206-206

289-314: Good use of TypeScript's satisfies operator.

Using the satisfies operator ensures the completedEvent object conforms to the CreatableEvent type while preserving the specific literal types of the object properties, which is more type-safe than a simple cast.

665-677: Clean implementation of entity metadata extraction.

The code properly extracts entity type and ID from the event properties, which enhances the event data model. This maintains good separation of concerns by isolating the extraction logic.

835-848: Enhanced recordEvent method with improved flexibility.

The method signature now includes optional duration and parentId parameters, providing more flexibility in event recording. The implementation now properly calculates duration when an endTime is provided.

868-872: Consistent debug flag handling for UI presentation.

The code now properly updates the icon based on the debug status, which helps distinguish debug events visually in the UI. This is a good UX improvement.

886-891: Proper propagation of debug state to event level.

Debug events are now correctly assigned "WARN" level instead of "TRACE", making them more visible in logging systems. This is a good practice for ensuring debug information stands out appropriately.

1713-1771: Well-structured error handling in recordRunEvent.

The function has a clear return type that explicitly describes success and failure cases, with proper error codes and error information. The try/catch block with appropriate logging enhances the reliability of the event recording system.
apps/webapp/app/v3/services/batchTriggerV4.server.ts (1)

531-537:
❓ Verification inconclusive

Handle potential NULL run case.

If processBatchTaskRunItem returns undefined, the code throws a generic error but continues to try processing subsequent items. This could lead to inconsistent batch states.

Consider handling this more robustly by either:

Using an early return pattern after logging the error

Implementing a retry mechanism for the specific failed item

Ensuring proper transaction rollback to maintain data consistency

🏁 Script executed:
#!/bin/bash
# Check if there are any instances where a batch run contains null/undefined run IDs
rg -A 10 $'runIds: \[' apps/webapp/app
Length of output: 1944
Review Action Required: Enhance Handling of Undefined Task Runs

The current implementation logs an error and throws a generic exception when a task run returns undefined. This approach can lead to continued processing of subsequent items and potential batch state inconsistencies. Based on the grep output (which shows initialized empty run arrays), it looks like there’s attention given to run identifiers, but explicit handling for undefined task runs remains absent.

Please consider one of the following improvements:

Early Exit: After logging, return immediately to halt further processing.

Retry Mechanism: Implement retry logic for the failed item.

Transaction Rollback: Ensure a proper rollback of the batch process if an item fails.

Review these changes to ensure they align with your transaction and error-handling strategies.
apps/webapp/app/v3/services/triggerTaskV2.server.ts (1)

43-45: Check default attempts and over-retries.

There's an attempt parameter that defaults to 0. Combined with MAX_ATTEMPTS, the code raises an error after attempt > MAX_ATTEMPTS. Carefully verify that you don't introduce an endless loop if an unexpected race condition resets it.

I'd recommend adding a safety mechanism like a retry counter to ensure we don't enter an infinite loop due to race conditions. For instance, you might want to introduce logic to detect cyclical retry patterns.

coderabbitai · 2025-03-04T18:45:59Z

internal-packages/run-engine/src/engine/errors.ts

+  //e.g. a bug
+  switch (error.code) {
+    case "RECURSIVE_WAIT_DEADLOCK":
+      return "COMPLETED_WITH_ERRORS";
+    case "TASK_RUN_CANCELLED":
+      return "CANCELED";
+    case "MAX_DURATION_EXCEEDED":
+      return "TIMED_OUT";
+    case "TASK_PROCESS_OOM_KILLED":
+    case "TASK_PROCESS_MAYBE_OOM_KILLED":
+    case "TASK_PROCESS_SIGSEGV":
+    case "DISK_SPACE_EXCEEDED":
+    case "OUTDATED_SDK_VERSION":
+    case "HANDLE_ERROR_ERROR":
+    case "TASK_RUN_CRASHED":
+    case "TASK_PROCESS_EXITED_WITH_NON_ZERO_CODE":
+      return "CRASHED";
+    case "COULD_NOT_FIND_EXECUTOR":
+    case "COULD_NOT_FIND_TASK":
+    case "COULD_NOT_IMPORT_TASK":
+    case "CONFIGURED_INCORRECTLY":
+    case "TASK_ALREADY_RUNNING":
+    case "TASK_PROCESS_SIGKILL_TIMEOUT":
+    case "TASK_RUN_HEARTBEAT_TIMEOUT":
+    case "TASK_DEQUEUED_INVALID_STATE":
+    case "TASK_DEQUEUED_QUEUE_NOT_FOUND":
+    case "TASK_RUN_DEQUEUED_MAX_RETRIES":
+    case "TASK_RUN_STALLED_EXECUTING":
+    case "TASK_RUN_STALLED_EXECUTING_WITH_WAITPOINTS":
+    case "TASK_HAS_N0_EXECUTION_SNAPSHOT":
+    case "GRACEFUL_EXIT_TIMEOUT":
+    case "TASK_INPUT_ERROR":
+    case "TASK_OUTPUT_ERROR":
+    case "POD_EVICTED":
+    case "POD_UNKNOWN_ERROR":
+    case "TASK_EXECUTION_ABORTED":
+    case "TASK_EXECUTION_FAILED":
+    case "TASK_PROCESS_SIGTERM":
+    case "TASK_DEQUEUED_INVALID_RETRY_CONFIG":
+    case "TASK_DEQUEUED_NO_RETRY_CONFIG":
+    case "TASK_DID_CONCURRENT_WAIT":
+      return "SYSTEM_FAILURE";


💡 Verification agent

🧩 Analysis chain

Consider adding unit tests

This function handles critical error mapping logic that affects how task runs are reported. To ensure reliability, consider adding comprehensive unit tests that cover all error code mappings, especially for edge cases.

🏁 Script executed:

#!/bin/bash # Search for existing tests for this function echo "Searching for tests for runStatusFromError:" rg "runStatusFromError" --type test rg "runStatusFromError" --type spec # Look for test directory structure echo -e "\nLooking for test files in the run-engine package:" find internal-packages/run-engine -name "*.test.ts" -o -name "*.spec.ts" | grep -i error

Length of output: 552

Action Required: Add Comprehensive Unit Tests for Error Mapping Logic

The function in internal-packages/run-engine/src/engine/errors.ts (lines 13–54) that maps error codes to statuses currently appears to lack targeted unit tests. Our initial search for tests referencing runStatusFromError did not yield any results. Given the critical nature of this error-mapping functionality—which directly impacts task-run reporting—we strongly recommend adding comprehensive unit tests that cover all error code cases as well as potential edge cases.

Location: internal-packages/run-engine/src/engine/errors.ts (inside the error mapping switch statement)

Recommendation: Implement unit tests that validate each case, ensuring that unexpected or edge-case error codes are handled appropriately.

coderabbitai · 2025-03-04T18:45:59Z

apps/webapp/app/v3/services/worker/workerGroupTokenService.server.ts

+  // FIXME: Required for unmanaged workers
+  readonly isLatestDeployment = true;
+


💡 Verification agent

🧩 Analysis chain

Address the FIXME comment for isLatestDeployment.

There's a FIXME comment indicating that the isLatestDeployment property is hardcoded to true. This could lead to incorrect behavior if not properly addressed.

Consider implementing a proper check to determine if a deployment is the latest version:

🏁 Script executed:

#!/bin/bash # Description: Search for any existing logic in the codebase that determines if a deployment is the latest # Check for logic that determines latest deployment echo "Searching for code that determines latest deployment..." rg -A 10 -B 10 "latest.*deployment|deployment.*latest" --type ts

Length of output: 5128

Action Required: Replace the hardcoded true with a dynamic check for the latest deployment

The search results reveal that our codebase already implements logic to verify the latest deployment (e.g., in apps/webapp/app/v3/services/retryDeploymentIndexing.server.ts and within the API client in packages/cli-v3/src/apiClient.ts). Instead of forcing isLatestDeployment to always be true, it would be more robust to align its behavior with the existing implementations that compare deployment IDs or fetch the latest deployment status.

Change Required: Refactor the isLatestDeployment property in apps/webapp/app/v3/services/worker/workerGroupTokenService.server.ts (lines 505-507) to dynamically assess whether the deployment is indeed the latest.

Suggestion: Consider reusing or adapting the logic from the retry deployment indexing service or the API client methods that fetch the latest deployment details. If unmanaged workers can’t leverage the full dynamic check, update the FIXME comment with an explanation and a plan to address this gap later.

coderabbitai · 2025-03-04T18:45:59Z

apps/webapp/app/v3/services/batchTriggerV4.server.ts

+      const batch = await this._prisma.batchTaskRun.create({
+        data: {
+          id: BatchId.fromFriendlyId(batchId),
+          friendlyId: batchId,
+          runtimeEnvironmentId: environment.id,
+          runCount: body.items.length,
+          runIds: [],
+          payload: payloadPacket.data,
+          payloadType: payloadPacket.dataType,
+          options,
+          batchVersion: "runengine:v1",
+          oneTimeUseToken: options.oneTimeUseToken,
+        },
+      });
+
+      if (body.parentRunId && body.resumeParentOnCompletion) {
+        await this._engine.blockRunWithCreatedBatch({
+          runId: RunId.fromFriendlyId(body.parentRunId),
+          batchId: batch.id,
+          environmentId: environment.id,
+          projectId: environment.projectId,
+          organizationId: environment.organizationId,
+        });
+      }


🛠️ Refactor suggestion

Refactor duplicated batch creation logic.

There's significant code duplication between the inline and async batch creation paths. Consider extracting the common batch creation logic into a private method.

+async #createBatchTaskRun( + batchId: string, + payloadPacket: IOPacket, + environment: AuthenticatedEnvironment, + oneTimeUseToken?: string, + options?: any +) { + return await this._prisma.batchTaskRun.create({ + data: { + id: BatchId.fromFriendlyId(batchId), + friendlyId: batchId, + runtimeEnvironmentId: environment.id, + runCount: body.items.length, + runIds: [], + payload: payloadPacket.data, + payloadType: payloadPacket.dataType, + options, + batchVersion: "runengine:v1", + oneTimeUseToken: oneTimeUseToken, + }, + }); +}

Then call this method from both places.

Also applies to: 245-269

coderabbitai · 2025-03-04T18:46:00Z

apps/webapp/app/v3/services/batchTriggerV4.server.ts

+              start: result.workingIndex,
+              // This will be the remainder of the slice
+              // for example if the original range was 0-50 and the workingIndex is 25, the new range will be 25-25
+              // if the original range was 51-100 and the workingIndex is 75, the new range will be 75-25
+              count: options.range.count - result.workingIndex - options.range.start,
+            },


⚠️ Potential issue

Fix incorrect calculation of remaining range count.

The current calculation for determining the new count in parallel processing could lead to incorrect values. When calculating the remaining items, only the difference between the original range's start and the working index should be considered.

count: options.range.count - result.workingIndex - options.range.start, +count: options.range.count - (result.workingIndex - options.range.start),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

start: result.workingIndex,

// This will be the remainder of the slice

// for example if the original range was 0-50 and the workingIndex is 25, the new range will be 25-25

// if the original range was 51-100 and the workingIndex is 75, the new range will be 75-25

count: options.range.count - result.workingIndex - options.range.start,

},

start: result.workingIndex,

// This will be the remainder of the slice

// for example if the original range was 0-50 and the workingIndex is 25, the new range will be 25-25

// if the original range was 51-100 and the workingIndex is 75, the new range will be 75-25

count: options.range.count - (result.workingIndex - options.range.start),

},

coderabbitai · 2025-03-04T18:46:00Z

apps/webapp/app/v3/services/triggerTaskV1.server.ts

+        dependentBatchRun &&
+        dependentBatchRun.dependentTaskAttempt &&
+        (isFinalAttemptStatus(dependentBatchRun.dependentTaskAttempt.status) ||
+          isFinalRunStatus(dependentBatchRun.dependentTaskAttempt.taskRun.status))
+      ) {


⚠️ Potential issue

Use optional chaining for safer property access.

The current code may cause runtime errors if dependentBatchRun exists but dependentTaskAttempt is undefined.

if ( dependentBatchRun && - dependentBatchRun.dependentTaskAttempt && - (isFinalAttemptStatus(dependentBatchRun.dependentTaskAttempt.status) || - isFinalRunStatus(dependentBatchRun.dependentTaskAttempt.taskRun.status)) + dependentBatchRun.dependentTaskAttempt?.status && + (isFinalAttemptStatus(dependentBatchRun.dependentTaskAttempt.status) || + isFinalRunStatus(dependentBatchRun.dependentTaskAttempt.taskRun.status)) ) {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

dependentBatchRun &&

dependentBatchRun.dependentTaskAttempt &&

(isFinalAttemptStatus(dependentBatchRun.dependentTaskAttempt.status) ||

isFinalRunStatus(dependentBatchRun.dependentTaskAttempt.taskRun.status))

) {

if (

dependentBatchRun &&

dependentBatchRun.dependentTaskAttempt?.status &&

(isFinalAttemptStatus(dependentBatchRun.dependentTaskAttempt.status) ||

isFinalRunStatus(dependentBatchRun.dependentTaskAttempt.taskRun.status))

) {

🧰 Tools

🪛 Biome (1.9.4)

[error] 244-245: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)

coderabbitai · 2025-03-04T18:46:00Z

apps/webapp/app/v3/services/triggerTaskV2.server.ts

+                if (!workerGroup) {
+                  logger.error("Default worker group not found", {
+                    projectId: environment.projectId,
+                  });
+
+                  return;
+                }
+


⚠️ Potential issue

Handle missing worker group more gracefully.

If no default worker group is found, the function logs an error and returns undefined, which could lead to a null reference exception when the caller attempts to use the return value.

if (!workerGroup) { logger.error("Default worker group not found", { projectId: environment.projectId, }); - return; + throw new ServiceValidationError( + `Cannot trigger ${taskId}: no default worker group found for project ${environment.projectId}` + ); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (!workerGroup) {

logger.error("Default worker group not found", {

projectId: environment.projectId,

});

return;

}

if (!workerGroup) {

logger.error("Default worker group not found", {

projectId: environment.projectId,

});

throw new ServiceValidationError(

`Cannot trigger ${taskId}: no default worker group found for project ${environment.projectId}`

);

}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

internal-packages/database/prisma/schema.prisma (2)

1755-1757: Priority Management in TaskRun
The new priorityMs field in TaskRun enables adjusting the effective queue timestamp by applying a negative offset. This is a useful mechanism for prioritizing certain runs. It might be beneficial to document the expected range or behavior of this value so that future maintainers understand its impact.

2538-2554: BatchTaskRun Model Enhancements for Engine v2
The BatchTaskRun model has been significantly extended:

New fields for friendly IDs (runIds), runCount, payload, payloadType, options, and batchVersion have been added.

For engine v2, arrays to capture executionSnapshots, runsBlocked, and waitpoints are included.

This expanded model should provide the necessary data to manage batch processing more effectively. Please ensure that the relationships and unique constraints between batch items, snapshots, and waitpoints are well tested and that migration scripts are provided if needed.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a72898 and 6531e72.

📒 Files selected for processing (2)

internal-packages/database/prisma/migrations/20250304184614_remove_task_run_first_attempt_started_at_column/migration.sql (1 hunks)
internal-packages/database/prisma/schema.prisma (18 hunks)

✅ Files skipped from review due to trivial changes (1)

internal-packages/database/prisma/migrations/20250304184614_remove_task_run_first_attempt_started_at_column/migration.sql

⏰ Context from checks skipped due to timeout of 90000ms (5)

GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: typecheck / typecheck
GitHub Check: units / 🧪 Unit Tests
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (15)

internal-packages/database/prisma/schema.prisma (15)

151-152: Addition of Worker Fields to Organization
The new fields workerGroups and workerInstances have been added to the Organization model to associate worker groups and instances. This design seems appropriate for grouping and managing these entities.

454-456: Introduction of Engine Field in Project Model
The Project model now includes a new field engine of type RunEngineVersion with a default of V1 alongside the version field set to V2. Given that this PR is for Run Engine 2.0 (alpha), please verify that the default engine version is intentional. Consider whether the alpha testing should instead default to V2 or if explicit switching is desired.

459-464: Enhancements to Project Worker Associations
Additional arrays for workerGroups and workers, as well as the optional relation defaultWorkerGroup (with its ID field), have been added to the Project model. These changes improve the linkage between projects and worker group entities. Ensure that these relations are in sync with your application logic and that any index or constraint considerations for optimizing query performance are addressed.

424-427: Expanded RuntimeEnvironment Relationships
The RuntimeEnvironment model now includes arrays for taskRunCheckpoints, waitpoints, workerInstances, and executionSnapshots. This extension should help in tracking task execution progress and system state at the environment level. Please verify that the engine logic which updates these fields maintains consistency, and consider adding documentation on how they interact.

1578-1579: BackgroundWorker Engine Version Field Added
The BackgroundWorker model now declares an engine field of type RunEngineVersion with a default value of V1. This change sets the groundwork for differentiating between run engine versions at the worker level. Confirm that this default is aligned with your intended rollout strategy for Engine 2.0 alpha.

1603-1605: Linking BackgroundWorkers to Worker Groups
A new relation field workerGroup (and its corresponding workerGroupId) has been added to the BackgroundWorker model. This improves the grouping and management of background workers. Ensure that cascading and nullability behavior (i.e. using onDelete: SetNull) is consistent with your overall data integrity requirements.

1687-1687: TaskRun Engine Field Introduction
The TaskRun model now gets an engine field (of type RunEngineVersion) with a default of V1. This is an important addition for enabling engine version–based behavior throughout task runs. Please verify that related business logic correctly discriminates between V1 and V2 as needed.

1714-1719: New Queue and Attempt Fields in TaskRun
The additions of masterQueue (with default "main"), an optional secondaryMasterQueue, and attemptNumber fields to the TaskRun model provide enhanced control over task queuing and execution tracking for engine v2+. These fields will be key for managing the run order and retries. Ensure that their usage is thoroughly documented and tested in the engine workflow.

1770-1774: Waitpoint Integration in TaskRun
By introducing the associatedWaitpoint (as an optional relation) and blockedByWaitpoints (as an array of TaskRunWaitpoint), the model now embeds waitpoint logic directly into the task run. This should help in preventing execution when there are pending waitpoints. Please make sure that transactional updates properly reflect the completion state of these waitpoints across the run lifecycle.

1784-1784: Execution Snapshots Captured in TaskRun
The addition of the executionSnapshots array in TaskRun further aligns with the new tracking mechanisms introduced in the run engine. Confirm that data consistency is maintained and that snapshots are not duplicated or left stale between TaskRun and TaskRunExecutionSnapshot.

1927-1931: New Enum: RunEngineVersion
The new RunEngineVersion enum distinguishes between V1 and V2 versions of the run engine. This is fundamental to supporting multiple engine implementations simultaneously. Ensure that all engine-dependent logic across your codebase correctly references this enum, and update documentation accordingly.

1938-1944: TaskRunExecutionSnapshot Enhancements
In the TaskRunExecutionSnapshot model, the default engine is now set to V2, which indicates that snapshots are intended for runs using the new engine. The inline comments clarifying that V1 did not use snapshots are very helpful. Please verify that any historical data or migration scripts are adjusted accordingly.

2727-2731: New WorkerDeploymentType Enum Values
The enum WorkerDeploymentType now includes a V1 member alongside MANAGED and UNMANAGED. This change appears designed to provide backwards compatibility for deployment records. Ensure that your migration strategy covers upgrading legacy deployments if needed, and consider whether a similar option for V2 might be required in future iterations.

2746-2747: WorkerDeployment Type Field Update
The WorkerDeployment model now explicitly includes a type field of type WorkerDeploymentType with a default of V1. This clarifies the deployment model and ties it to the appropriate engine version. Double-check that your deployment orchestration logic respects this field when triggering builds and rollouts.

2070-2105:
❓ Verification inconclusive

Comprehensive Updates to the Waitpoint Model
Several new fields have been introduced:

inactiveIdempotencyKey for storing expired keys,

completedByTaskRunId with its relation to mark the run that completed the waitpoint,

completedAfter for DATETIME/MANUAL waitpoints,

completedByBatchId for batch associations,

blockingTaskRuns to list task runs blocked by this waitpoint, and

completedExecutionSnapshots, along with output fields for storing waitpoint results.

These additions significantly enhance waitpoint tracking and actionability. Please verify that the unique constraint @@unique([environmentId, idempotencyKey]) and other indexes do not introduce conflicts when multiple waitpoints are created. Also, ensure that downstream logic properly deactivates and re-generates idempotency keys as documented in the comments.

🏁 Script executed:
#!/bin/bash
# Search across the repository for occurrences of "idempotencyKey"
rg "idempotencyKey" .

# Search for occurrences of "inactiveIdempotencyKey" to compare references
rg "inactiveIdempotencyKey" .
Length of output: 65749
Ensure Downstream Idempotency Key Logic Handles New Constraints Correctly
The new fields in the schema—including inactiveIdempotencyKey, completedByTaskRunId, completedAfter, completedByBatchId, blockingTaskRuns, and completedExecutionSnapshots—integrate well with the updated waitpoint model. The unique constraint (@@unique([environmentId, idempotencyKey])) aligns with our migration files (e.g. Waitpoint_environmentId_idempotencyKey_key), and similar logic for deactivating and regenerating idempotency keys is evident in the engine (see internal-packages/run-engine/src/engine/index.ts where inactiveIdempotencyKey is set and idempotencyKey regenerated).

Next Steps:

Verify that when a waitpoint is completed, the active idempotencyKey is properly deactivated (saved in inactiveIdempotencyKey) and regenerated to avoid conflicts, especially under concurrent waitpoint creations.

Check that downstream processes (e.g., task triggers and waitpoint handling) consistently use the new fields and adhere to the unique constraint.

Ensure that any tests covering waitpoint key regeneration and conflict resolution are updated and passing.

matt-aitken · 2025-03-05T14:25:11Z

We're going to merge this to make working on it easier, but this does not mean that Run Engine 2.0 is available (yet). We're in the last stages before it will go into cloud for beta (and then we'll get the self-hosted version released).

nicktrn and others added 30 commits December 6, 2024 10:27

Merge remote-tracking branch 'origin/main' into run-engine-2

b9c9c75

bump worker version

71f04cf

Suggested glossary for the RunEngine, TBC

3fe3c39

Removed BatchTaskRun changes from this branch, they were done in main

2635c58

Set the BatchTaskRun status to completed when all runs are completed

fb6c33d

When dequeuing respect passed in maxResources

7653c22

Ported over the new run props: idempotencyKeyExpiresAt, versions, one…

ff369f3

…TimeUseToken, maxDurationInSeconds

Didn’t hit save… the new props when triggering tasks passed through

fcb4e3b

Idempotency expiration + waitpoint edge case

fd58ce6

WIP on creating checkpoint, parking for now

d7a9e2b

fix worker routes

b95dd63

upgrade webapp node types to support generic event emitter

d434746

separate event bus handler singleton and run failure alerts

eb69641

duration waits

8262320

fix execution snapshot debug spans

05b7e30

task waits

a6d5ec4

fix event bus types

d4a638e

temporary fix for react hook run handle type

4d41221

disable run notifications for now

d707a94

convert any typecasts to expect errors to more easily fix later

3cee74c

fix webapp types after node types upgrade

b4739b7

updateEnvConcurrencyLimits across marqs and the runqueue

c9a51f1

Pass proper values into the run engine

9da544e

RunQueue settings and removed unused rebalancing workers

ff79a77

Remove rebalancing prop

03c9899

Tidied more things up

c84b506

Update/remove queue limits for MARQS and RunQueue

160e219

taskQueue/concurrencyLimit changes ported back into the RunEngine

086c1bf

Reworked completing waitpoints to improve performance and reduce race…

e1177f7

… conditions

Improved test robustness

b57e913

matt-aitken added 4 commits February 26, 2025 14:32

Completing and skipping waitpoints is working

9c28bde

Remove the database prisma dev command, since we need to use create o…

632881f

…nly now. Updated docs

Added skip timeout, reworked the UI

91538c9

Tweaked spacing

c6f6eac

coderabbitai bot reviewed Feb 26, 2025

View reviewed changes

matt-aitken added 2 commits February 26, 2025 19:48

Added payload limit to waitpoint token completion from dashboard

76e49f7

Test idempotency works on wait.for and wait.until

e8b7bee

coderabbitai bot reviewed Feb 27, 2025

View reviewed changes

matt-aitken added 2 commits February 27, 2025 15:23

Moved the worker-actions to /engine/ from /api/

ef1f7e4

Moved dev engine endpoints to /engine/ from /api/

c63915f

coderabbitai bot reviewed Feb 27, 2025

View reviewed changes

matt-aitken added 8 commits February 27, 2025 16:42

Separate /engine/ rate limiter

2bb3db4

Added parallel wait prevention, it’s working for duration waits but n…

56d2297

…ot well for triggerAndWait yet

WIP post-merge conflicts

6fd1fd9

Set taskEventStore column in the new engine

fe2fb82

Remove duplicate keys

d465144

Post-merge fixes

897b512

Fix for span merge layout

4a72898

coderabbitai bot reviewed Mar 4, 2025

View reviewed changes

Use executedAt instead of firstAttemptStartedAt

6531e72

coderabbitai bot reviewed Mar 4, 2025

View reviewed changes

ericallam approved these changes Mar 5, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into run-engine-2

3557a3d

ericallam merged commit e97704d into main Mar 5, 2025
7 of 17 checks passed

ericallam deleted the run-engine-2 branch March 5, 2025 14:40

github-actions bot mentioned this pull request Mar 5, 2025

chore: Update version for release (v4-beta) #1755

Merged

coderabbitai bot mentioned this pull request Apr 14, 2025

Add links to and from deployments #1921

Merged

coderabbitai bot mentioned this pull request Apr 23, 2025

feat: v4 deadlock detection #1970

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run Engine 2.0 (alpha) #1575

Run Engine 2.0 (alpha) #1575

nicktrn commented Dec 17, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot left a comment

coderabbitai bot Feb 26, 2025

coderabbitai bot left a comment

coderabbitai bot Feb 27, 2025

coderabbitai bot left a comment

coderabbitai bot Feb 27, 2025

coderabbitai bot Feb 27, 2025

coderabbitai bot Feb 27, 2025

coderabbitai bot Feb 27, 2025

coderabbitai bot Feb 27, 2025

coderabbitai bot Feb 27, 2025

coderabbitai bot Feb 27, 2025

coderabbitai bot Feb 27, 2025

coderabbitai bot left a comment

coderabbitai bot Mar 4, 2025

coderabbitai bot Mar 4, 2025

coderabbitai bot Mar 4, 2025

coderabbitai bot Mar 4, 2025

coderabbitai bot Mar 4, 2025

coderabbitai bot Mar 4, 2025

coderabbitai bot left a comment

matt-aitken commented Mar 5, 2025

	delay={(waitpoint.completedAfter.getMilliseconds() - Date.now()) / 1000}
	delay={(waitpoint.completedAfter.getTime() - Date.now()) / 1000}

	logger.error("Failed to suspend run", { runFriendlyId, snapshotFriendlyId, error });
	logger.error("Failed to continue run", { runFriendlyId, snapshotFriendlyId, error });

		// FIXME: Required for unmanaged workers
		readonly isLatestDeployment = true;

Run Engine 2.0 (alpha) #1575

Run Engine 2.0 (alpha) #1575

Conversation

nicktrn commented Dec 17, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Feb 26, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Feb 27, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Feb 27, 2025

Choose a reason for hiding this comment

coderabbitai bot Feb 27, 2025

Choose a reason for hiding this comment

coderabbitai bot Feb 27, 2025

Choose a reason for hiding this comment

coderabbitai bot Feb 27, 2025

Choose a reason for hiding this comment

coderabbitai bot Feb 27, 2025

Choose a reason for hiding this comment

coderabbitai bot Feb 27, 2025

Choose a reason for hiding this comment

coderabbitai bot Feb 27, 2025

Choose a reason for hiding this comment

coderabbitai bot Feb 27, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Mar 4, 2025

Choose a reason for hiding this comment

coderabbitai bot Mar 4, 2025

Choose a reason for hiding this comment

coderabbitai bot Mar 4, 2025

Choose a reason for hiding this comment

coderabbitai bot Mar 4, 2025

Choose a reason for hiding this comment

coderabbitai bot Mar 4, 2025

Choose a reason for hiding this comment

coderabbitai bot Mar 4, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

matt-aitken commented Mar 5, 2025

nicktrn commented Dec 17, 2024 •

edited by coderabbitai bot

Loading