-
-
Notifications
You must be signed in to change notification settings - Fork 710
Run Engine 2.0 (alpha) #1575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run Engine 2.0 (alpha) #1575
Conversation
…TimeUseToken, maxDurationInSeconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
apps/webapp/app/presenters/v3/SpanPresenter.server.ts (1)
437-501
:⚠️ Potential issueFix switch statement variable scoping issues.
The switch statement has potential variable scoping issues with declarations that can leak between cases.
Wrap each case in block scope to prevent variable declaration leakage:
switch (span.entity.type) { case "waitpoint": { + // Wrap in block to prevent variable declaration leakage const waitpoint = await this._replica.waitpoint.findFirst({ where: { friendlyId: span.entity.id, }, // ... }); if (!waitpoint) { logger.error(`SpanPresenter: Waitpoint not found`, { spanId, waitpointFriendlyId: span.entity.id, }); return { ...data, entity: null }; } const output = waitpoint.outputType === "application/store" ? `/resources/packets/${environmentId}/${waitpoint.output}` : typeof waitpoint.output !== "undefined" && waitpoint.output !== null ? await prettyPrintPacket(waitpoint.output, waitpoint.outputType ?? undefined) : undefined; let isTimeout = false; if (waitpoint.outputIsError && output) { if (isWaitpointOutputTimeout(output)) { isTimeout = true; } } return { ...data, entity: { type: "waitpoint" as const, object: { // ... }, }, }; + } default: + // Validate entity type + if (!["waitpoint", "span"].includes(span.entity.type)) { + logger.warn(`Unknown entity type: ${span.entity.type}`, { spanId }); + } return { ...data, entity: null }; }🧰 Tools
🪛 Biome (1.9.4)
[error] 439-455: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 465-470: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 472-472: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
🧹 Nitpick comments (14)
CONTRIBUTING.md (3)
233-236
: Migration Creation Instruction Clarity and Code Block Language Specifier
The updated instruction "Create a migration" along with the commandpnpm run db:migrate:dev:create
clearly separates migration creation from its execution. This change improves clarity by prompting users to review the generated migration before applying it. Additionally, it would be beneficial to add a language specifier (e.g.,sh
) to the fenced code block to enhance readability and conform with markdownlint guidelines.- ``` - pnpm run db:migrate:dev:create - ``` + ```sh + pnpm run db:migrate:dev:create + ```🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
235-235: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
239-240
: Improve Indexing Guidance Wording
The note advising the use ofCONCURRENTLY
for adding indexes is crucial to prevent unintended table locks. However, the sentence could be rephrased for better clarity and to address the static analysis feedback regarding punctuation. For example:- This creates a migration file. Check the migration file does only what you want. If you're adding any database indexes they must use `CONCURRENTLY`, otherwise they'll lock the table when executed. + This creates a migration file. Please review it to ensure it matches your requirements. If you're adding any database indexes, be sure to use `CONCURRENTLY` to avoid locking the table during execution.🧰 Tools
🪛 LanguageTool
[typographical] ~239-~239: The word “otherwise” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...ding any database indexes they must useCONCURRENTLY
, otherwise they'll lock the table when executed. ...(THUS_SENTENCE)
241-247
: Separation of Migration Execution Steps and Code Block Enhancements
The new step "Run the migration" with the subsequent commands (pnpm run db:migrate:deploy
andpnpm run generate
) enhances the instructional clarity by distinctly separating migration creation from its execution. To further improve the document, adding a language specifier to the fenced code block is recommended:- ``` - pnpm run db:migrate:deploy - pnpm run generate - ``` + ```sh + pnpm run db:migrate:deploy + pnpm run generate + ```🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
243-243: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (1)
1076-1284
: Well-structured span entity rendering.The new SpanEntity component cleanly handles different entity types, particularly for waitpoints, with appropriate status displays and controls.
There's an unnecessary Fragment at lines 1235-1252 that could be removed since it only contains one child element. You can simplify this by removing the fragment:
- {span.entity.object.type === "MANUAL" && ( - <> - <Property.Item> - <Property.Label>Timeout at</Property.Label> - <Property.Value> - <div className="flex w-full flex-wrap items-center justify-between gap-1"> - {span.entity.object.completedAfter ? ( - <DateTimeAccurate date={span.entity.object.completedAfter} /> - ) : ( - "–" - )} - {span.entity.object.status === "PENDING" && ( - <ForceTimeout waitpoint={span.entity.object} /> - )} - </div> - </Property.Value> - </Property.Item> - </> - )} + {span.entity.object.type === "MANUAL" && ( + <Property.Item> + <Property.Label>Timeout at</Property.Label> + <Property.Value> + <div className="flex w-full flex-wrap items-center justify-between gap-1"> + {span.entity.object.completedAfter ? ( + <DateTimeAccurate date={span.entity.object.completedAfter} /> + ) : ( + "–" + )} + {span.entity.object.status === "PENDING" && ( + <ForceTimeout waitpoint={span.entity.object} /> + )} + </div> + </Property.Value> + </Property.Item> + )}🧰 Tools
🪛 Biome (1.9.4)
[error] 1235-1252: Avoid using unnecessary Fragment.
A fragment is redundant if it contains only one child, or if it is the child of a html element, and is not a keyed fragment.
Unsafe fix: Remove the Fragment(lint/complexity/noUselessFragments)
internal-packages/database/prisma/schema.prisma (5)
454-464
: New Project Fields for Versioning and Worker Group Relationships
TheProject
model now includes several new fields:
version
with a default ofV2
engine
with a default ofV1
- Associations with
workerGroups
,workers
, and the optionaldefaultWorkerGroup
(plus its id field)
These additions extend the model to support Run Engine versioning and attach worker group information. Please verify that the relationships and default values (especially for run engine versioning) align with downstream logic and that any queries or mutations have been updated accordingly.
1714-1719
: Enhanced Queue Fields in TaskRun
A new block has been added inTaskRun
:
masterQueue
(defaulting to"main"
)- An optional
secondaryMasterQueue
attemptNumber
(to be defined after a run is dequeued for engine v2+)
These additions will support finer control of queue positioning and tracking of attempts in Run Engine 2.0. Please ensure that the computation and downstream processing of these fields are consistent and that any related indexes or query logic are updated for optimal performance.
1770-1775
: Waitpoint Relationships in TaskRun
Two new relationships have been introduced onTaskRun
:
associatedWaitpoint
, which will be marked when a run finishesblockedByWaitpoints
, which holds any waitpoints preventing the run’s execution
These changes integrate task run execution with waitpoint handling. It is important to review how these relationships are used in job scheduling—ensure that they do not introduce circular dependencies or unwanted cascade effects.
1933-1988
: TaskRunExecutionSnapshot Model for Run Engine v2
TheTaskRunExecutionSnapshot
model has been added to track execution states. Key points include:
- The
engine
field now defaults toV2
, signifying that snapshots apply to the new engine.- It captures execution status, debug description, validity indication, and relationships to
TaskRun
, batch, environment, and optionally a checkpoint and worker.
This model is central for tracking execution state changes; please ensure that maintenance (such as data retention policies) is addressed elsewhere in your system design.
2037-2099
: Enhanced Waitpoint Model for Run Blocking and Idempotency
The updatedWaitpoint
model now includes several new fields:
idempotencyKey
with an expiration and a flag for user-provided input- An
inactiveIdempotencyKey
field for when keys are rotated after completion- Relations to mark completion by a task run or batch
- Indexes ensuring uniqueness on
[environmentId, idempotencyKey]
These changes support robust waitpoint management. One note: the commented-out property (deactivateIdempotencyKeyWhenCompleted
) is still present as a comment—if it’s intended for future use, consider adding a “TODO” with an issue reference.apps/webapp/app/routes/engine.v1.runs.$runFriendlyId.waitpoints.tokens.$waitpointFriendlyId.wait.ts (1)
19-19
: Unused request body parameter.
Thebody
parameter in the callback is not referenced. If there is no plan to consume request data here, consider removing it to keep the code clean and avoid confusion.apps/webapp/app/presenters/v3/SpanPresenter.server.ts (4)
45-57
: Improve error handling in early return path.The early return approach for non-existent
parentRun
silently fails without providing any error detail. Since other code paths have explicit error throws, this inconsistency could make debugging more difficult.Consider replacing the silent return with an appropriate error:
if (!parentRun) { - return; + throw new Error("Parent run not found"); }
217-236
: Simplify conditional logic with a cleaner approach.The current approach creates redundant code between the
finishedData
variable and the existingfinishedAttempt
. The engine-based condition creates additional complexity.Consider consolidating the logic:
const finishedData = - run.engine === "V2" - ? run - : isFinished - ? await this._replica.taskRunAttempt.findFirst({ - select: { - output: true, - outputType: true, - error: true, - }, - where: { - status: { in: FINAL_ATTEMPT_STATUSES }, - taskRunId: run.id, - }, - orderBy: { - createdAt: "desc", - }, - }) - : null; + run.engine === "V2" ? run : finishedAttempt;
457-463
: Enhance error handling logic for better traceability.The current error logging is good, but the return path doesn't adequately capture the error state for the UI layer.
Consider enhancing the error return path to provide more context:
if (!waitpoint) { logger.error(`SpanPresenter: Waitpoint not found`, { spanId, waitpointFriendlyId: span.entity.id, }); - return { ...data, entity: null }; + return { + ...data, + entity: { + type: "waitpoint" as const, + error: { + message: "Waitpoint not found", + code: "NOT_FOUND" + }, + object: null + } + }; }
472-477
: Simplify timeout detection logic.The current approach uses a variable assignment and conditional check that could be consolidated.
Consider simplifying the timeout detection:
-let isTimeout = false; -if (waitpoint.outputIsError && output) { - if (isWaitpointOutputTimeout(output)) { - isTimeout = true; - } -} +const isTimeout = waitpoint.outputIsError && output ? isWaitpointOutputTimeout(output) : false;🧰 Tools
🪛 Biome (1.9.4)
[error] 472-472: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
CONTRIBUTING.md
(1 hunks)apps/coordinator/src/checkpointer.ts
(2 hunks)apps/docker-provider/src/index.ts
(1 hunks)apps/webapp/app/assets/icons/AnimatedHourglassIcon.tsx
(1 hunks)apps/webapp/app/presenters/v3/SpanPresenter.server.ts
(13 hunks)apps/webapp/app/routes/api.v1.waitpoints.tokens.$waitpointFriendlyId.complete.ts
(1 hunks)apps/webapp/app/routes/engine.v1.runs.$runFriendlyId.waitpoints.tokens.$waitpointFriendlyId.wait.ts
(1 hunks)apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.$projectParam.waitpoints.$waitpointFriendlyId.complete/route.tsx
(1 hunks)apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx
(11 hunks)internal-packages/database/package.json
(1 hunks)internal-packages/database/prisma/schema.prisma
(19 hunks)internal-packages/redis-worker/src/queue.ts
(1 hunks)internal-packages/run-engine/src/engine/tests/waitpoints.test.ts
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- apps/docker-provider/src/index.ts
🚧 Files skipped from review as they are similar to previous changes (4)
- internal-packages/database/package.json
- apps/webapp/app/assets/icons/AnimatedHourglassIcon.tsx
- internal-packages/run-engine/src/engine/tests/waitpoints.test.ts
- apps/coordinator/src/checkpointer.ts
🧰 Additional context used
🪛 LanguageTool
CONTRIBUTING.md
[typographical] ~239-~239: The word “otherwise” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...ding any database indexes they must use CONCURRENTLY
, otherwise they'll lock the table when executed. ...
(THUS_SENTENCE)
🪛 markdownlint-cli2 (0.17.2)
CONTRIBUTING.md
235-235: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
243-243: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
🪛 Biome (1.9.4)
apps/webapp/app/presenters/v3/SpanPresenter.server.ts
[error] 439-455: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 465-470: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 472-472: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx
[error] 1235-1252: Avoid using unnecessary Fragment.
A fragment is redundant if it contains only one child, or if it is the child of a html element, and is not a keyed fragment.
Unsafe fix: Remove the Fragment
(lint/complexity/noUselessFragments)
🔇 Additional comments (29)
internal-packages/redis-worker/src/queue.ts (1)
60-60
: Increased Redis retry resilience for better reliabilityThe change from the default
maxRetriesPerRequest: 3
to20
significantly improves the queue's resilience to transient Redis connection issues. This enhancement is valuable for the Run Engine 2.0 alpha, especially in distributed environments where network hiccups are common.While this change improves reliability, it's worth verifying how it might affect failure detection and timeout behavior. Please confirm that increased retries won't mask persistent Redis issues or significantly delay final failure responses. You might want to consider adding metrics/monitoring for Redis connection quality to identify persistent issues that are being masked by retries.
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (10)
51-51
: Access control implementation looks good.Adding
useHasAdminAccess
hook for permission-based rendering is a good security practice to ensure admin-only features are properly gated.
196-196
: Code looks good.Proper use of SpanTitle with appropriate props.
232-292
: Improved structure with conditional rendering.The refactoring of SpanBody to conditionally render either detailed properties or the new SpanEntity component improves code organization and separation of concerns.
311-311
: Proper authorization check.Good implementation of the admin access check that will be used for conditional rendering of admin-only information.
322-330
: Enhanced UI for cached runs.The improved RunIcon usage with conditional naming for cached runs provides better visual feedback to users. This makes the UI more informative by clearly distinguishing cached runs.
500-515
: Good addition of idempotency information.Adding idempotency key and expiration time improves observability, allowing users to track and debug idempotency-related issues more effectively.
548-551
: Enhanced run information with engine version.Adding the engine version display is useful for debugging and understanding which engine processed the run.
552-563
: Admin-only information properly gated.Good implementation of conditional rendering for admin-only properties (masterQueue and secondaryMasterQueue) based on user permissions.
701-709
: Improved error and output handling.The code now handles errors and outputs more cleanly with proper conditional rendering, ensuring outputs are only shown when there's no error and output data exists.
728-728
: Better UX with context-aware button text.Changing the button text to "Jump to original run" for cached runs improves clarity and user understanding.
internal-packages/database/prisma/schema.prisma (10)
151-152
: Positive Naming for Worker Associations
The newly added fieldsworkerGroups
andworkerInstances
on theOrganization
model now use plural names, which addresses the previous comment (“Ideally this would be plural”). This naming clearly indicates that these fields hold arrays of associated entities.
1578-1579
: BackgroundWorker Engine Version
The addition of theengine
field (defaulting toV1
) in theBackgroundWorker
model explicitly captures the run engine version for background workers. This clear versioning will help differentiate behavior as features evolve.
1603-1605
: Optional Worker Group Association in BackgroundWorker
The new optional relation fieldsworkerGroup
andworkerGroupId
inBackgroundWorker
allow workers to be associated with a worker group. Be sure that theonDelete: SetNull
behavior is correct for your use case and that any related business logic is updated to handle a missing worker group gracefully.
1741-1742
: First Attempt Start Timestamp for Run Engine 2.0+
The new nullable fieldfirstAttemptStartedAt
inTaskRun
provides an explicit marker for when the first attempt begins under Run Engine 2.0. This will be valuable for performance tracking and debugging. Consider whether indexing this field is required if you plan to filter or sort on it.
1755-1758
: Priority Offset Field in TaskRun
The addition of thepriorityMs
field introduces a mechanism for adjusting the run’s queue positioning via a negative time offset (in milliseconds). This design can offer flexible scheduling; however, please confirm that all components that compute or utilize the effective queue timestamp correctly accommodate both positive and negative values.
1868-1872
: RunEngineVersion Enum Introduction
A new enumRunEngineVersion
now distinguishes between legacy and new engine versions (V1 and V2). The inline comment (“The original version that uses marqs v1 and Graphile”) is helpful. Consider documenting plans for future versions if applicable.
2010-2030
: TaskRunCheckpoint Model Consistency
TheTaskRunCheckpoint
model now encapsulates details around checkpoints, including the friendly identifier, type, location, image reference, associated project, and runtime environment. It also relates to multiple execution snapshots. The structure is clear and appears consistent with other models.
2115-2145
: TaskRunWaitpoint Model for Associating Runs with Waitpoints
The newTaskRunWaitpoint
model bridgesTaskRun
andWaitpoint
, adding support for batch-related fields (such asbatchId
,batchIndex
) and a span identifier for cache completion. The unique constraint on[taskRunId, waitpointId, batchIndex]
is critical to ensure each association is unique even in a batched context. Double-check that your SQL engine enforces this constraint as expected.
2720-2724
: WorkerDeploymentType Enum Extension
TheWorkerDeploymentType
enum now includes three values:MANAGED
,UNMANAGED
, andV1
. This explicit inclusion ofV1
clarifies which deployment type follows legacy behavior. Confirm that all logic conditioned on the deployment type is updated to handle theV1
value properly.
2739-2740
: Explicit Deployment Versioning in WorkerDeployment
A new fieldtype
of typeWorkerDeploymentType
has been added to theWorkerDeployment
model, with a default value ofV1
. This change ensures that deployments are explicitly versioned and can be differentiated in the system. Review any deployment workflows to guarantee they respect this new field.apps/webapp/app/routes/engine.v1.runs.$runFriendlyId.waitpoints.tokens.$waitpointFriendlyId.wait.ts (2)
16-16
: Question the 10KB content length limit.
While 10KB may be sufficient in most scenarios, consider verifying with stakeholders whether 10KB adequately covers all potential request data for this endpoint. If larger payloads are expected, increasing this limit or making it configurable might help avoid unexpected 413 errors.
37-39
: Consider concurrency scenarios for blocking runs.
If multiple clients or processes can request to block the same run with the same waitpoint simultaneously, ensure thatengine.blockRunWithWaitpoint
handles concurrency gracefully (e.g., verifying the run’s current state and re-checking waitpoint availability).apps/webapp/app/routes/api.v1.waitpoints.tokens.$waitpointFriendlyId.complete.ts (2)
22-22
: Verify environment variable usage.
env.TASK_PAYLOAD_MAXIMUM_SIZE
governs themaxContentLength
for this route. Confirm that this environment variable is set appropriately across all deployments to avoid unexpected 413 errors or overly permissive sizes.
62-64
: Enhance error handling specificity.
Consider differentiating between validation errors, database conflicts, or other potential error types, rather than catching everything in a single generic 500 handler.apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.$projectParam.waitpoints.$waitpointFriendlyId.complete/route.tsx (2)
285-303
: Complete the dependency array for submitForm callback.
You reference bothsubmit
andformAction
insidesubmitForm
, but only[currentJson]
is in the dependency array. Including all referenced variables (e.g.,submit
andformAction
) in the dependency array would prevent stale references if those ever changed.
346-348
: Fix invalid HTML attributes.
React doesn't recognizemin-height
ormax-height
as valid JSX props. Replace them withminHeight
andmaxHeight
to avoid console warnings and ensure proper styling.apps/webapp/app/presenters/v3/SpanPresenter.server.ts (2)
82-84
: Good job updating method signatures to match implementation needs.The updated method signatures for
getRun
andgetSpan
now properly reflect the parameters they need. This improves the API's clarity and maintainability.Also applies to: 405-407
472-477
:✅ Verification successful
Add test coverage for the timeout detection functionality.
The timeout detection logic is important but doesn't appear to have associated test coverage.
🏁 Script executed:
#!/bin/bash # Check for tests covering the waitpoint timeout detection # First, let's find files that might contain tests for this functionality fd "test|spec" --type f | grep -i "waitpoint\|timeout" | xargs grep -l "isWaitpointOutputTimeout" || echo "No direct test references found" # Let's also check if the isWaitpointOutputTimeout function itself is tested fd "test|spec" --type f | xargs grep -l "isWaitpointOutputTimeout" || echo "No tests for isWaitpointOutputTimeout found"Length of output: 294
Test coverage for timeout detection already exists.
The timeout detection behavior—via theisWaitpointOutputTimeout
function—is exercised in the test file atinternal-packages/run-engine/src/engine/tests/waitpoints.test.ts
. This confirms that the functionality is appropriately covered. If any integration-specific cases withinSpanPresenter.server.ts
remain untested, please consider adding additional tests; otherwise, no further action is necessary.🧰 Tools
🪛 Biome (1.9.4)
[error] 472-472: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
/// Friendly IDs | ||
runIds String[] @default([]) | ||
runCount Int @default(0) | ||
payload String? | ||
payloadType String @default("application/json") | ||
options Json? | ||
batchVersion String @default("v1") | ||
|
||
//engine v2 | ||
/// Snapshots that reference this batch | ||
executionSnapshots TaskRunExecutionSnapshot[] | ||
/// Specific run blockers, | ||
runsBlocked TaskRunWaitpoint[] | ||
/// Waitpoints that are blocked by this batch. | ||
/// When a Batch is created it blocks execution of the associated parent run (for andWait) | ||
waitpoints Waitpoint[] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
BatchTaskRun Model Updates for Engine v2 Batching
The BatchTaskRun
model now has several new columns:
- Arrays for
runIds
and a counterrunCount
for friendly ID purposes - Payload-related fields (
payload
,payloadType
,options
,batchVersion
) - Engine v2–specific fields:
executionSnapshots
,runsBlocked
, andwaitpoints
These additions aim to better manage batch processing in the new engine mode. Please review the integration points where batch runs are created and processed to ensure these fields are appropriately updated and queried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.$projectParam.waitpoints.$waitpointFriendlyId.complete/route.tsx (2)
314-314
: 🛠️ Refactor suggestionComplete the dependency array for submitForm.
The submitForm callback is missing dependencies in its dependency array.
const submitForm = useCallback( (e: React.FormEvent<HTMLFormElement>) => { // ... function body ... }, - [currentJson] + [currentJson, submit, formAction] );
358-359
: 🛠️ Refactor suggestionFix invalid HTML attributes.
The JSONEditor component has invalid HTML attributes.
height="100%" - min-height="100%" - max-height="100%" + minHeight="100%" + maxHeight="100%" />
🧹 Nitpick comments (2)
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.$projectParam.waitpoints.$waitpointFriendlyId.complete/route.tsx (2)
340-341
: Fix typo in the tooltip content.There's a small typo in the tooltip text.
- "This is will immediately complete this waitpoint with the payload you specify. This is useful during development for testing." + "This will immediately complete this waitpoint with the payload you specify. This is useful during development for testing."
345-347
: Remove duplicate scrollable container.There are two nested scrollable containers with identical styles, which is unnecessary and could cause scrolling issues.
- <div className="overflow-y-auto bg-charcoal-900 scrollbar-thin scrollbar-track-transparent scrollbar-thumb-charcoal-600"> <div className="max-h-[70vh] min-h-40 overflow-y-auto bg-charcoal-900 scrollbar-thin scrollbar-track-transparent scrollbar-thumb-charcoal-600">
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.$projectParam.waitpoints.$waitpointFriendlyId.complete/route.tsx
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (javascript-typescript)
<div className="flex items-center gap-1"> | ||
<AnimatedHourglassIcon | ||
className="text-dimmed-dimmed size-4" | ||
delay={(waitpoint.completedAfter.getMilliseconds() - Date.now()) / 1000} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix delay calculation in AnimatedHourglassIcon.
The delay calculation for the AnimatedHourglassIcon appears incorrect. It uses milliseconds from completedAfter
instead of calculating the difference correctly.
- delay={(waitpoint.completedAfter.getMilliseconds() - Date.now()) / 1000}
+ delay={(waitpoint.completedAfter.getTime() - Date.now()) / 1000}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
delay={(waitpoint.completedAfter.getMilliseconds() - Date.now()) / 1000} | |
delay={(waitpoint.completedAfter.getTime() - Date.now()) / 1000} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Nitpick comments (18)
apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.logs.debug.ts (4)
37-39
: Improve error response details.The error response doesn't provide enough information about what went wrong. Consider adding more specific error details to help API consumers understand the issue.
if (!run) { - throw new Response("You don't have permissions for this run", { status: 401 }); + return new Response( + JSON.stringify({ error: "Run not found or you don't have access to it" }), + { status: 401, headers: { "Content-Type": "application/json" } } + ); }
56-63
: Add response headers and body for error cases.The response objects for error cases lack Content-Type headers and error details in the body, which could make debugging harder for API consumers.
switch (eventResult.code) { case "FAILED_TO_RECORD_EVENT": - return new Response(null, { status: 400 }); // send a 400 to prevent retries + return new Response( + JSON.stringify({ error: "Failed to record event" }), + { status: 400, headers: { "Content-Type": "application/json" } } + ); // send a 400 to prevent retries case "RUN_NOT_FOUND": - return new Response(null, { status: 404 }); + return new Response( + JSON.stringify({ error: "Run not found" }), + { status: 404, headers: { "Content-Type": "application/json" } } + ); default: return assertExhaustive(eventResult.code); }
64-70
: Enhance error handling with structured logging.The current error logging might not properly serialize complex error objects. Consider using a more structured approach.
try { // ...existing code } catch (error) { logger.error("Failed to record dev log", { environmentId: authentication.environment.id, - error, + error: error instanceof Error ? + { message: error.message, stack: error.stack } : + String(error), }); throw error; }
53-54
: Add Content-Type header to success response.The success response lacks a Content-Type header, which is good practice even for empty responses.
if (eventResult.success) { - return new Response(null, { status: 204 }); + return new Response(null, { + status: 204, + headers: { "Content-Type": "application/json" } + }); }apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.latest.ts (1)
1-28
: Consider enhancing the error handling with detailed logsThe implementation for retrieving the latest snapshot is well-structured, but the error handling could be improved. When the snapshot retrieval fails, the code throws a generic error without logging details that would be helpful for debugging.
Consider adding error logging before throwing the error to make troubleshooting easier:
if (!executionData) { + logger.error("Failed to retrieve latest snapshot", { runFriendlyId }); throw new Error("Failed to retrieve latest snapshot"); }
Also, consider adding a debug log at the beginning of the function to track when this API is being called, similar to other routes in this PR.
apps/webapp/app/routes/engine.v1.dev.config.ts (1)
18-23
: Add validation for environment variablesThe code retrieves configuration values from environment variables but doesn't validate that they exist or have valid values.
Consider adding validation to ensure the environment variables are defined and have appropriate values:
try { + if (typeof env.DEV_DEQUEUE_INTERVAL_WITH_RUN !== 'number' || + typeof env.DEV_DEQUEUE_INTERVAL_WITHOUT_RUN !== 'number') { + logger.warn("Missing or invalid dequeue interval environment variables", { + withRun: env.DEV_DEQUEUE_INTERVAL_WITH_RUN, + withoutRun: env.DEV_DEQUEUE_INTERVAL_WITHOUT_RUN, + }); + } return json({ environmentId: authentication.environment.id, dequeueIntervalWithRun: env.DEV_DEQUEUE_INTERVAL_WITH_RUN, dequeueIntervalWithoutRun: env.DEV_DEQUEUE_INTERVAL_WITHOUT_RUN, });apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.latest.ts (2)
34-36
: Improve the unauthorized error message.The error message could be more specific about why permissions are insufficient. Consider providing more context about what permissions are required.
- throw new Response("You don't have permissions for this run", { status: 401 }); + throw new Response("You don't have permissions to access this run in the current environment", { status: 401 });
47-53
: Enhance error handling for better debugging.The current error handling catches, logs, and re-throws the error without adding context. Consider enriching the error with additional information before re-throwing.
} catch (error) { logger.error("Failed to get latest snapshot", { environmentId: authentication.environment.id, error, }); - throw error; + throw error instanceof Error + ? new Error(`Failed to get latest snapshot: ${error.message}`) + : new Error("Failed to get latest snapshot"); }apps/webapp/app/routes/engine.v1.worker-actions.deployments.$deploymentFriendlyId.dequeue.ts (1)
39-47
: Improve readability with if-else instead of ternary.The ternary operator makes this code section harder to read. Consider using an if-else statement for better readability.
- const dequeuedMessages = (await isCurrentDeployment(deployment.id, deployment.environmentId)) - ? await authenticatedWorker.dequeueFromEnvironment( - deployment.worker.id, - deployment.environmentId - ) - : await authenticatedWorker.dequeueFromVersion( - deployment.worker.id, - searchParams.maxRunCount - ); + let dequeuedMessages; + if (await isCurrentDeployment(deployment.id, deployment.environmentId)) { + dequeuedMessages = await authenticatedWorker.dequeueFromEnvironment( + deployment.worker.id, + deployment.environmentId + ); + } else { + dequeuedMessages = await authenticatedWorker.dequeueFromVersion( + deployment.worker.id, + searchParams.maxRunCount + ); + }apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.logs.debug.ts (1)
15-37
: Consider logging success scenarios.
While successful recording returns a 204, adding a debug/info log upon success may help trace the lifecycle of debug logs and facilitate future troubleshooting.apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.heartbeat.ts (1)
24-56
: Consider using 403 or 404 response code.
Currently returning 401 if the run is not found or not allowed. If authentication is valid but permissions are insufficient, using 403 typically better reflects the scenario. Alternatively, using 404 can help obscure resource existence.apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.complete.ts (1)
27-61
: Refine unauthorized response code.
Like in the heartbeat route, consider returning 403 or 404 instead of 401 if the user lacks permissions or the resource is not found, to differentiate between authentication vs. authorization issues.apps/webapp/app/routes/engine.v1.dev.presence.ts (2)
43-45
: Address the in-code TODO comment
The code references a plan to set a string with expire on the same call. This can reduce complexity and avoid potential race conditions. Consider addressing this TODO or removing the comment if no longer necessary.
66-78
: Ensure proper cleanup for unexpected terminations
Thecleanup
handler is triggered during normal disconnections, but might not fire if the process abruptly terminates. Evaluate whether additional safeguards or keepalive checks are needed to handle abrupt termination scenarios.apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.start.ts (1)
38-48
: Secure error handling when run is not found
Throwing a 401 is correct for unauthorized access, but consider returning a more descriptive message or distinct status code (e.g., 404 for not found if that is more aligned with domain logic). This helps clarify the reason for failures.apps/webapp/app/routes/engine.v1.dev.dequeue.ts (3)
16-16
: Be cautious with a fixed max dequeue count
Hardcodingenv.DEV_DEQUEUE_MAX_RUNS_PER_PULL
could limit throughput unexpectedly for certain workloads. Provide a configuration surface or documented rationale for the chosen limit.
30-62
: Add error handling for background worker dequeue operations
If any error happens while callingengine.dequeueFromBackgroundWorkerMasterQueue
, the loop will break silently. Consider catching and logging errors more explicitly to ensure issues are visible.
64-74
: Consider factoring out resource checks to a dedicated module
The resource-based queuing logic is repeated for background workers and environment master queue. Refactoring it into a reusable method can improve maintainability and reduce duplication.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
apps/webapp/app/routes/engine.v1.dev.config.ts
(1 hunks)apps/webapp/app/routes/engine.v1.dev.dequeue.ts
(1 hunks)apps/webapp/app/routes/engine.v1.dev.presence.ts
(1 hunks)apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.logs.debug.ts
(1 hunks)apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.complete.ts
(1 hunks)apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.start.ts
(1 hunks)apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.heartbeat.ts
(1 hunks)apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.latest.ts
(1 hunks)apps/webapp/app/routes/engine.v1.worker-actions.connect.ts
(1 hunks)apps/webapp/app/routes/engine.v1.worker-actions.deployments.$deploymentFriendlyId.dequeue.ts
(1 hunks)apps/webapp/app/routes/engine.v1.worker-actions.dequeue.ts
(1 hunks)apps/webapp/app/routes/engine.v1.worker-actions.heartbeat.ts
(1 hunks)apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.logs.debug.ts
(1 hunks)apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.complete.ts
(1 hunks)apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.start.ts
(1 hunks)apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.continue.ts
(1 hunks)apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.heartbeat.ts
(1 hunks)apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.restore.ts
(1 hunks)apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.suspend.ts
(1 hunks)apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.latest.ts
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (23)
apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.logs.debug.ts (2)
29-35
: LGTM! Good database query security.The database query correctly filters by both
friendlyId
andruntimeEnvironmentId
, ensuring that users can only access runs from their own environment.
41-50
: LGTM! Good use of structured logging parameters.The function passes the necessary parameters to
recordRunDebugLog
in a structured manner, including properly formatting the run ID usingRunId.fromFriendlyId()
.apps/webapp/app/routes/engine.v1.worker-actions.heartbeat.ts (1)
1-13
: Clean implementation of worker heartbeat API route.This implementation follows a good pattern for API route handlers with proper type safety using TypeScript. The route has a single responsibility and handles the worker heartbeat action effectively.
apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.heartbeat.ts (1)
1-26
: Well-structured run snapshot heartbeat API route with good parameter validation.This implementation effectively uses Zod for route parameter validation and follows the established pattern for API routes. The destructuring of parameters and passing to the heartbeat method is clean and maintainable.
apps/webapp/app/routes/engine.v1.worker-actions.dequeue.ts (1)
1-16
: Clean implementation of worker dequeue API route with appropriate response handling.This endpoint correctly follows the established pattern for API routes while properly handling the response from the dequeue operation. The direct return of the dequeued result as JSON is appropriate for this use case.
apps/webapp/app/routes/engine.v1.worker-actions.connect.ts (1)
1-19
: Well-implemented worker connect API route with informative response.This endpoint follows the established pattern while providing a more detailed response that includes the worker group information. The structured response with both status and worker metadata is helpful for clients consuming this API.
apps/webapp/app/routes/engine.v1.dev.config.ts (1)
10-10
: Clarify the purpose of the findResource functionThe
findResource
function simply returns1
without any apparent purpose. This seems like a placeholder or a required parameter that doesn't serve a meaningful function in this context.Is this a required parameter for the
createLoaderApiRoute
function? If so, consider adding a comment explaining why it returns1
or refactor it to serve a meaningful purpose.apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.restore.ts (1)
2-5
: Address schema naming mismatchThere's a mismatch between the route name ("restore") and the schemas being used ("WorkerApiSuspendRun*"). This creates confusion since suspension and restoration are opposite operations.
Confirm that these are the correct schemas to use for a restoration operation. If not, update to use the appropriate schemas for restoration.
apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.start.ts (1)
1-32
: Well-structured API route implementation!The implementation correctly handles starting a run attempt with proper validation and error handling. The code follows good practices with type safety, input validation using Zod, and a focused handler function.
apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.latest.ts (1)
12-12
: Consider clarifying the resource finder implementation.The resource finder always returns
1
, which seems unusual. If this is intentional, consider adding a comment explaining why this approach is used rather than finding an actual resource.apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.complete.ts (1)
1-33
: Well-implemented API route!The code correctly handles completing a run attempt with proper validation and error handling. The implementation follows good practices with type safety, input validation using Zod, and a focused handler function.
apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.logs.debug.ts (2)
1-6
: Good import structure.
No issues found here. The imports are straightforward and properly scoped.
8-14
: Well-structured route definition.
The use of Zod schema validation for parameters and body is clear and robust.apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.suspend.ts (2)
1-9
: Straightforward imports.
No concerns. The imports are reasonable and well-organized.
10-17
: Clean route definition.
Parameter and body validators are clearly defined.apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.heartbeat.ts (3)
1-12
: Imports look good.
Nothing to note regarding syntax or usage.
16-23
: Route definition is concise.
The structure is consistent and clear.
59-60
: Module export is fine.
No issues found here.apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.complete.ts (3)
1-17
: Import statements are valid.
Everything appears in order.
19-26
: Config and schema usage are clear.
Nicely defined parameters and body schema.
64-65
: Export statement is consistent.
No additional feedback.apps/webapp/app/routes/engine.v1.dev.presence.ts (1)
61-65
: Validate the periodic presence update frequency
Frequent calls to Redis (based on the defined interval) could increase load. Ensure the chosen interval balances real-time accuracy with resource usage.apps/webapp/app/routes/engine.v1.dev.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.attempts.start.ts (1)
50-53
: Verify engine dependency injection
Theengine.startRunAttempt
call relies on an external engine import. Consider verifying that it’s correctly initialized with relevant configuration for the environment to prevent unexpected behavior.
async ({ | ||
authentication, | ||
body, | ||
params, | ||
}): Promise<TypedResponse<WorkerApiRunAttemptStartResponseBody>> => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix return type inconsistency.
The function's return type is specified as Promise<TypedResponse<WorkerApiRunAttemptStartResponseBody>>
, but the function actually returns regular Response
objects with no body data. This could cause type errors.
Update the return type to match the actual implementation:
async ({
authentication,
body,
params,
- }): Promise<TypedResponse<WorkerApiRunAttemptStartResponseBody>> => {
+ }): Promise<Response> => {
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
async ({ | |
authentication, | |
body, | |
params, | |
}): Promise<TypedResponse<WorkerApiRunAttemptStartResponseBody>> => { | |
async ({ | |
authentication, | |
body, | |
params, | |
}): Promise<Response> => { |
|
||
return json(continuationResult); | ||
} catch (error) { | ||
logger.error("Failed to suspend run", { runFriendlyId, snapshotFriendlyId, error }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incorrect error message
The function is for continuing run execution, but the error message says "Failed to suspend run". This appears to be a copy-paste error.
Update the error message to correctly reflect the operation:
- logger.error("Failed to suspend run", { runFriendlyId, snapshotFriendlyId, error });
+ logger.error("Failed to continue run", { runFriendlyId, snapshotFriendlyId, error });
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
logger.error("Failed to suspend run", { runFriendlyId, snapshotFriendlyId, error }); | |
logger.error("Failed to continue run", { runFriendlyId, snapshotFriendlyId, error }); |
if (!body.success) { | ||
// TODO: we could create a debug span here | ||
logger.error("Failed to restore run", { | ||
runFriendlyId, | ||
snapshotFriendlyId, | ||
error: body.error, | ||
}); | ||
|
||
return json({ ok: true }); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Revisit error handling strategy
The current implementation returns { ok: true }
for both successful and failed operations (when body.success
is false). This makes it difficult for API consumers to distinguish between success and failure.
Consider returning an error status or different response for failures:
logger.error("Failed to restore run", {
runFriendlyId,
snapshotFriendlyId,
error: body.error,
});
- return json({ ok: true });
+ return json({ ok: false, error: body.error?.message || "Unknown error" });
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if (!body.success) { | |
// TODO: we could create a debug span here | |
logger.error("Failed to restore run", { | |
runFriendlyId, | |
snapshotFriendlyId, | |
error: body.error, | |
}); | |
return json({ ok: true }); | |
} | |
if (!body.success) { | |
// TODO: we could create a debug span here | |
logger.error("Failed to restore run", { | |
runFriendlyId, | |
snapshotFriendlyId, | |
error: body.error, | |
}); | |
return json({ ok: false, error: body.error?.message || "Unknown error" }); | |
} |
logger.debug("Restoring run", { runFriendlyId, snapshotFriendlyId, body }); | ||
|
||
if (!body.success) { | ||
// TODO: we could create a debug span here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Address TODO comment
There's a TODO comment about creating a debug span that should be addressed before finalizing this code.
🏁 Script executed:
#!/bin/bash
# Check if the debug span has been implemented in similar routes
grep -r "debug span" --include="*.ts" apps/webapp/app/routes/
Length of output: 379
Action Required: Implement Debug Span Functionality
The file still contains a TODO comment for creating a debug span (line 28) and no implementation has been added—as confirmed by similar examples in related routes. Please either implement the debug span or remove the TODO comment if it's no longer applicable.
if (!deployment) { | ||
throw new Error("Deployment not found"); | ||
} | ||
|
||
if (!deployment.worker) { | ||
throw new Error("Worker not found"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error handling with proper status codes.
Instead of throwing generic errors, consider using structured responses with appropriate status codes for better API consistency.
if (!deployment) {
- throw new Error("Deployment not found");
+ throw new Response("Deployment not found", { status: 404 });
}
if (!deployment.worker) {
- throw new Error("Worker not found");
+ throw new Response("Worker not found for this deployment", { status: 404 });
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if (!deployment) { | |
throw new Error("Deployment not found"); | |
} | |
if (!deployment.worker) { | |
throw new Error("Worker not found"); | |
} | |
if (!deployment) { | |
throw new Response("Deployment not found", { status: 404 }); | |
} | |
if (!deployment.worker) { | |
throw new Response("Worker not found for this deployment", { status: 404 }); | |
} |
async ({ | ||
authenticatedWorker, | ||
params, | ||
body, | ||
}): Promise<TypedResponse<WorkerApiSuspendRunResponseBody>> => { | ||
const { runFriendlyId, snapshotFriendlyId } = params; | ||
|
||
logger.debug("Suspending run", { runFriendlyId, snapshotFriendlyId, body }); | ||
|
||
if (!body.success) { | ||
// TODO: we could create a debug span here | ||
logger.error("Failed to suspend run", { | ||
runFriendlyId, | ||
snapshotFriendlyId, | ||
error: body.error, | ||
}); | ||
|
||
return json({ ok: true }); | ||
} | ||
|
||
try { | ||
await authenticatedWorker.createCheckpoint({ | ||
runFriendlyId, | ||
snapshotFriendlyId, | ||
checkpoint: body.checkpoint, | ||
}); | ||
|
||
return json({ ok: true }); | ||
} catch (error) { | ||
logger.error("Failed to suspend run", { runFriendlyId, snapshotFriendlyId, error }); | ||
throw error; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Clarify handling of failure states.
When body.success
is false, the code logs an error but still returns { ok: true }
. This might mask underlying issues, making debugging more difficult. Consider returning a non-200 response or including a more descriptive payload to reflect the failed suspension.
const redis = new Redis({ | ||
port: env.RUN_ENGINE_DEV_PRESENCE_REDIS_PORT ?? undefined, | ||
host: env.RUN_ENGINE_DEV_PRESENCE_REDIS_HOST ?? undefined, | ||
username: env.RUN_ENGINE_DEV_PRESENCE_REDIS_USERNAME ?? undefined, | ||
password: env.RUN_ENGINE_DEV_PRESENCE_REDIS_PASSWORD ?? undefined, | ||
enableAutoPipelining: true, | ||
...(env.RUN_ENGINE_DEV_PRESENCE_REDIS_TLS_DISABLED === "true" ? {} : { tls: {} }), | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider improved error handling when connecting to Redis
If Redis fails to connect due to incorrect host/port/password, the app could silently fail. Implementing retry logic or throwing an explicit error if the connection cannot be created would improve resiliency.
} | ||
); | ||
|
||
async function getEnvVars( | ||
environment: RuntimeEnvironment, | ||
runId: string, | ||
machinePreset: MachinePreset | ||
): Promise<Record<string, string>> { | ||
const variables = await resolveVariablesForEnvironment(environment); | ||
|
||
const jwt = await generateJWTTokenForEnvironment(environment, { | ||
run_id: runId, | ||
machine_preset: machinePreset.name, | ||
}); | ||
|
||
variables.push( | ||
...[ | ||
{ key: "TRIGGER_JWT", value: jwt }, | ||
{ key: "TRIGGER_RUN_ID", value: runId }, | ||
{ key: "TRIGGER_MACHINE_PRESET", value: machinePreset.name }, | ||
] | ||
); | ||
|
||
return variables.reduce((acc: Record<string, string>, curr) => { | ||
acc[curr.key] = curr.value; | ||
return acc; | ||
}, {}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Safeguard sensitive token data
You are generating JWT tokens and storing them in the environment variables array. Ensure they are not inadvertently logged or exposed through debug statements to avoid leaking tokens.
- logger.debug("Created environment vars:", variables);
+ // Avoid logging actual JWT tokens or any secret credentials
+ logger.debug("Created environment vars: [filtered for sensitive data]");
Committable suggestion skipped: line range outside the PR's diff.
…ot well for triggerAndWait yet
# Conflicts: # apps/coordinator/src/checkpointer.ts # apps/webapp/app/components/code/CodeBlock.tsx # apps/webapp/app/presenters/v3/SpanPresenter.server.ts # apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.deployments/route.tsx # apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam/route.tsx # apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.runs.electric.$runParam/route.tsx # apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx # apps/webapp/app/v3/eventRepository.server.ts # apps/webapp/app/v3/services/cancelTaskRun.server.ts # apps/webapp/app/v3/services/completeAttempt.server.ts # apps/webapp/app/v3/services/finalizeDeployment.server.ts # apps/webapp/app/v3/services/rollbackDeployment.server.ts # apps/webapp/app/v3/services/triggerTask.server.ts # apps/webapp/test/fairDequeuingStrategy.test.ts # apps/webapp/test/realtimeClient.test.ts # internal-packages/database/prisma/schema.prisma # packages/cli-v3/src/apiClient.ts # packages/cli-v3/src/cli/index.ts # packages/cli-v3/src/dev/workerRuntime.ts # packages/cli-v3/src/entryPoints/deploy-run-controller.ts # packages/cli-v3/src/entryPoints/dev-run-worker.ts # packages/cli-v3/src/entryPoints/managed-run-worker.ts # packages/core/src/v3/errors.ts # packages/core/src/v3/schemas/index.ts # packages/core/src/v3/schemas/schemas.ts # packages/core/src/v3/types/tasks.ts # packages/core/src/v3/workers/index.ts # packages/trigger-sdk/src/v3/shared.ts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (3)
apps/webapp/app/presenters/v3/SpanPresenter.server.ts (1)
459-523
: 🛠️ Refactor suggestionWrap variable declarations within the switch case block to avoid scoping issues.
Biome flags that variable declarations in the
"waitpoint"
case can be accessed by other case clauses. Enclose each case body in a block{}
to limit variable scope safely.switch (span.entity.type) { case "waitpoint": { { const waitpoint = ... ... } { const output = ... ... } ... return ...; } default: return { ...data, entity: null }; }🧰 Tools
🪛 Biome (1.9.4)
[error] 461-477: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 487-492: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 494-494: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
apps/webapp/app/v3/services/batchTriggerV4.server.ts (1)
319-335
: 🛠️ Refactor suggestionConsider updating batch status on max attempts.
The code logs an error when the max attempt threshold (
MAX_ATTEMPTS
) is reached but does not explicitly mark the batch as "failed" or "canceled." Consider updating the batch status to reflect a terminal state for better visibility and potentially halting further retries.if ($attemptCount > MAX_ATTEMPTS) { logger.error("[BatchTriggerV3][processBatchTaskRun] Max attempts reached", { options, attemptCount: $attemptCount, }); // You might want to update the batch status to failed here + await this._prisma.batchTaskRun.update({ + where: { id: options.batchId }, + data: { status: "FAILED" } + }); return; }apps/webapp/app/v3/services/triggerTaskV2.server.ts (1)
406-409
: 🛠️ Refactor suggestionConsider limiting recursion depth.
The method calls itself recursively when handling a duplicate idempotency key error, which could lead to stack overflow if the issue persists across many attempts.
if (error instanceof RunDuplicateIdempotencyKeyError) { + if (attempt >= MAX_ATTEMPTS - 1) { + throw new ServiceValidationError( + `Cannot trigger ${taskId} due to persistent idempotency key conflicts after ${attempt} attempts.` + ); + } return await this.call({ taskId, environment, body, options, attempt: attempt + 1 }); }
🧹 Nitpick comments (49)
apps/webapp/test/GCRARateLimiter.test.ts (4)
11-13
: Consider adding Redis connection cleanup.You're now creating a new Redis client for each test case, but there's no explicit cleanup of these connections when tests complete. This could potentially lead to resource leaks or connection pool exhaustion during test runs.
redisTest("should allow a single request when under the rate limit", async ({ redisOptions }) => { const redis = new Redis(redisOptions); + try { const limiter = new GCRARateLimiter({ redis, emissionInterval: 1000, // 1 request per second on average burstTolerance: 3000, // Allows a burst of 4 requests (3 * 1000 + 1) keyPrefix: "test:ratelimit:", }); const result = await limiter.check("user:1"); expect(result.allowed).toBe(true); + } finally { + await redis.quit(); + } });
27-29
: Add Redis connection cleanup in burst test.Similar to the previous test, you should ensure proper cleanup of the Redis connection to prevent resource leaks.
async ({ redisOptions }) => { const redis = new Redis(redisOptions); + try { const limiter = new GCRARateLimiter({ redis, emissionInterval: 1000, burstTolerance: 3000, // With an emission interval of 1000ms, burstTolerance of 3000ms allows 4 rapid requests. keyPrefix: "test:ratelimit:", }); // Test logic here... + } finally { + await redis.quit(); + } }
55-56
: Consider extracting Redis setup/teardown logic.Since the pattern of creating and (ideally) cleaning up Redis connections is repeated in every test, consider extracting this to a helper function or using test fixtures to reduce duplication and ensure consistent cleanup.
// Add this helper at the top of the file + async function withRedis(redisOptions, callback) { + const redis = new Redis(redisOptions); + try { + return await callback(redis); + } finally { + await redis.quit(); + } + } // Then use it in tests like: redisTest( "should allow a request after the required waiting period", async ({ redisOptions }) => { - const redis = new Redis(redisOptions); + await withRedis(redisOptions, async (redis) => { const limiter = new GCRARateLimiter({ redis, emissionInterval: 1000, burstTolerance: 3000, keyPrefix: "test:ratelimit:", }); // Test logic here... + }); } );
232-243
: Consider using a cleaner mocking approach.The current approach of modifying and restoring the Redis object's methods works, but consider using Vitest's mocking capabilities more directly for cleaner test code.
- // @ts-expect-error - const originalGcra = redis.gcra; - // @ts-ignore - redis.gcra = vi.fn(() => { - throw new Error("Simulated Redis error"); - }); - await expect(limiter.check("user:error")).rejects.toThrow("Simulated Redis error"); - // Restore the original command. - // @ts-expect-error - redis.gcra = originalGcra; + // Using Vitest's spyOn for cleaner mocking + const gcraSpy = vi.spyOn(redis, 'gcra').mockImplementation(() => { + throw new Error("Simulated Redis error"); + }); + await expect(limiter.check("user:error")).rejects.toThrow("Simulated Redis error"); + // Vitest automatically restores mocks after each test when using restoreMocks option + // But for clarity, you can explicitly restore: + gcraSpy.mockRestore();internal-packages/run-engine/src/engine/eventBus.ts (4)
5-178
: Well-structured event type definitions with a few areas for improvement.The
EventBusEvents
type provides comprehensive coverage of the task run lifecycle events, with clear payload structures for each event type. The consistent inclusion oftime
and run identifiers across all events is a good practice.Consider adding JSDoc comments to document the purpose and usage context for each event type, which would improve maintainability and developer experience.
Examples of JSDoc comments you could add:
export type EventBusEvents = { + /** + * Emitted when a run attempt is started + * @param time The time when the attempt started + * @param run Information about the run + * @param organization The organization the run belongs to + */ runAttemptStarted: [ { time: Date; run: { id: string; attemptNumber: number; baseCostInCents: number; }; organization: { id: string; }; }, ];
162-177
: Consider enhancing incomingCheckpointDiscarded with more detailed metadata.The
incomingCheckpointDiscarded
event has a genericmetadata
field defined asRecord<string, unknown>
. Consider defining a more specific type for the checkpoint metadata if the structure is known, which would improve type safety and developer experience.checkpoint: { - metadata: Record<string, unknown>; + metadata: { + // Define specific fields if known + checkpointType?: string; + source?: string; + timestamp?: Date; + // Add other known fields + [key: string]: unknown; + }; discardReason: string; };
5-178
: Consider using discriminated unions to improve type safety.Currently, event types are differentiated by their key in the
EventBusEvents
interface. Consider refactoring to use discriminated unions by adding atype
field to each event payload, which would enhance type safety when handling events.export type EventBusEvents = { runAttemptStarted: [ { + type: 'runAttemptStarted'; time: Date; run: { id: string; attemptNumber: number; baseCostInCents: number; }; organization: { id: string; }; }, ]; // Apply similar changes to other event types
This would enable more robust type checking when handling events, especially in switch statements.
67-68
: Ensure consistent error handling across related events.The
runFailed
,runAttemptFailed
, andrunCancelled
events all include error information but with slight structural differences. Consider standardizing the error property structure across all error-related events for consistency.Also applies to: 101-103
apps/webapp/app/v3/services/cancelTaskRunV1.server.ts (7)
33-115
: Overall structure is good, but consider transaction handling for atomic cancellation operations.This service is well-structured with clear separation of concerns. However, the cancellation process consists of multiple database operations that should ideally be atomic to maintain data consistency.
Consider wrapping the core cancellation operations in a transaction to ensure atomicity:
public async call(taskRun: TaskRun, options?: CancelTaskRunServiceOptions) { const opts = { reason: "Task run was cancelled by user", cancelAttempts: true, cancelledAt: new Date(), ...options, }; // Make sure the task run is in a cancellable state if (!isCancellableRunStatus(taskRun.status)) { logger.error("Task run is not in a cancellable state", { runId: taskRun.id, status: taskRun.status, }); return; } + return this._prisma.$transaction(async (tx) => { const finalizeService = new FinalizeTaskRunService(); const cancelledTaskRun = await finalizeService.call({ id: taskRun.id, status: "CANCELED", // ...remaining parameters }); // Continue with event cancellation and other operations // ... return { id: cancelledTaskRun.id, }; + }); }
42-49
: Return a more descriptive result when a task run can't be cancelled.Currently, the function returns
undefined
when a task run is not in a cancellable state, which doesn't give the caller any information about what happened.Consider returning a structured result that includes success/failure status and a reason:
// Make sure the task run is in a cancellable state if (!isCancellableRunStatus(taskRun.status)) { logger.error("Task run is not in a cancellable state", { runId: taskRun.id, status: taskRun.status, }); - return; + return { + success: false, + reason: `Task run with status ${taskRun.status} cannot be cancelled`, + id: taskRun.id + }; }
87-105
: Add error handling for event cancellation.The code cancels in-progress events but doesn't handle potential failures during cancellation. If any event cancellation fails, the overall operation might continue without proper error reporting.
Add error handling to capture and log failures during event cancellation:
- await Promise.all( - inProgressEvents.map((event) => { - return eventRepository.cancelEvent(event, opts.cancelledAt, opts.reason); - }) - ); + try { + await Promise.all( + inProgressEvents.map((event) => { + return eventRepository.cancelEvent(event, opts.cancelledAt, opts.reason); + }) + ); + } catch (error) { + logger.error("Failed to cancel some in-progress events", { + runId: taskRun.id, + error, + }); + // Consider whether to continue or throw based on business requirements + }
117-176
: Refactor the complex attempt cancellation logic for better maintainability.The current implementation is well-structured but contains nested conditionals and switches that could be refactored for better readability and maintainability.
Consider extracting the status-specific cancellation logic into separate methods:
async #cancelPotentiallyRunningAttempts( run: ExtendedTaskRun, attempts: ExtendedTaskRunAttempt[] ) { for (const attempt of attempts) { await CancelTaskAttemptDependenciesService.enqueue(attempt.id, this._prisma); if (run.runtimeEnvironment.type === "DEVELOPMENT") { await this.#cancelAttemptInDevelopment(attempt, run); } else { - switch (attempt.status) { - case "EXECUTING": { - // We need to send a cancel message to the coordinator - socketIo.coordinatorNamespace.emit("REQUEST_ATTEMPT_CANCELLATION", { - version: "v1", - attemptId: attempt.id, - attemptFriendlyId: attempt.friendlyId, - }); - break; - } - // ... remaining cases - } + await this.#cancelAttemptByStatus(attempt, run); } } } + async #cancelAttemptInDevelopment(attempt: ExtendedTaskRunAttempt, run: ExtendedTaskRun) { + // Signal the task run attempt to stop + await devPubSub.publish( + `backgroundWorker:${attempt.backgroundWorkerId}:${attempt.id}`, + "CANCEL_ATTEMPT", + { + attemptId: attempt.friendlyId, + backgroundWorkerId: attempt.backgroundWorker.friendlyId, + taskRunId: run.friendlyId, + } + ); + } + async #cancelAttemptByStatus(attempt: ExtendedTaskRunAttempt, run: ExtendedTaskRun) { + switch (attempt.status) { + case "EXECUTING": { + // We need to send a cancel message to the coordinator + socketIo.coordinatorNamespace.emit("REQUEST_ATTEMPT_CANCELLATION", { + version: "v1", + attemptId: attempt.id, + attemptFriendlyId: attempt.friendlyId, + }); + break; + } + // ... remaining cases + } + }
178-192
: Add error handling for socket communication failures.The method broadcasts a cancellation message to all coordinators but doesn't handle potential communication failures.
Consider adding error handling for socket communication:
async #cancelRemainingRunWorkers(run: ExtendedTaskRun) { if (run.runtimeEnvironment.type === "DEVELOPMENT") { // Nothing to do return; } // Broadcast cancel message to all coordinators + try { socketIo.coordinatorNamespace.emit("REQUEST_RUN_CANCELLATION", { version: "v1", runId: run.id, // Give the attempts some time to exit gracefully. If the runs supports lazy attempts, it also supports exit delays. delayInMs: run.lockedToVersion?.supportsLazyAttempts ? 5_000 : undefined, }); + logger.debug("Successfully broadcasted run cancellation request", { + runId: run.id, + }); + } catch (error) { + logger.error("Failed to broadcast run cancellation request", { + runId: run.id, + error, + }); + // Consider whether to throw or continue based on business requirements + } }
34-40
: Consider further validating cancellation options.The code sets default options but doesn't validate the provided options, such as checking if
cancelledAt
is in the past.Add validation for the provided options:
public async call(taskRun: TaskRun, options?: CancelTaskRunServiceOptions) { const opts = { reason: "Task run was cancelled by user", cancelAttempts: true, cancelledAt: new Date(), ...options, }; + // Validate that cancelledAt is not in the future + if (opts.cancelledAt > new Date()) { + logger.warn("cancelledAt is in the future, using current time instead", { + providedTime: opts.cancelledAt, + currentTime: new Date(), + }); + opts.cancelledAt = new Date(); + }
112-115
: Enhance the return value with more context.The current return value only includes the task run ID. Consider providing more context about the cancellation result.
Enhance the return value with more information:
return { id: cancelledTaskRun.id, + success: true, + cancelledAt: opts.cancelledAt, + cancelledAttempts: opts.cancelAttempts ? attempts.length : 0, + cancelledEvents: inProgressEvents.length, };apps/webapp/app/env.server.ts (8)
262-263
: Consider avoiding hardcoded secrets in code defaults.
While providing a default secret ("managed-secret") may be convenient for local development, storing secrets in code can pose security risks if used in production. It is advisable to store real production secrets in a secure vault or environment variable management system to avoid accidental leakage.
413-422
: Validate default poll interval and concurrency for production.
The default of 100 ms (RUN_ENGINE_WORKER_POLL_INTERVAL
) may become chatty under higher loads, and concurrency values (e.g.,RUN_ENGINE_WORKER_COUNT=4
with 10 tasks per worker) can quickly scale. Confirm these values align with expected production traffic to avert unnecessary resource usage.
423-453
: Reduce duplication in Redis configuration.
These transformations for worker Redis environment variables follow the same pattern seen elsewhere in the file. Consider extracting a reusable helper function to minimize repetitive code and improve maintainability.
454-484
: Consolidate redundant run queue Redis environment config.
Similar to the worker Redis pattern, these environment variable transformations repeat the same logic. Factoring them into a common helper would enhance readability and maintainability.
485-515
: Eliminate repetitive run lock Redis logic.
Again, each Redis configuration uses the same chain of optional transforms. A small shared utility could keep these environment declarations concise and consistent.
516-546
: Refactor Dev Presence Redis configuration.
The transformations here also duplicate the pattern from other Redis sections. A utility or shared function would centralize this logic, improve clarity, and prevent mistakes.
547-562
: Fix comment mismatches and spelling issues in rate limit configuration.
There is a minor inconsistency between the comment and the default value forRUN_ENGINE_RATE_LIMIT_MAX
(1200 vs. 750) and a small spelling mistake ("refix" → "refill").- RUN_ENGINE_RATE_LIMIT_MAX: z.coerce.number().int().default(1200), // allow bursts of 750 requests + RUN_ENGINE_RATE_LIMIT_MAX: z.coerce.number().int().default(1200), // allow bursts of 1200 requests - RUN_ENGINE_RATE_LIMIT_REFILL_RATE: z.coerce.number().int().default(400), // refix 250 tokens every 10 seconds + RUN_ENGINE_RATE_LIMIT_REFILL_RATE: z.coerce.number().int().default(400), // refill 250 tokens every 10 seconds
563-573
: Double-check dev presence intervals and test coverage.
The frequent polling intervals (e.g.,DEV_DEQUEUE_INTERVAL_WITH_RUN=250 ms
) may be appropriate for local testing but might be expensive for certain environments. Consider verifying performance and adding test scenarios to ensure these defaults work as intended.internal-packages/run-engine/src/engine/errors.ts (1)
5-8
: Consider adding validation for null/undefined error inputThe current implementation assumes that a valid
TaskRunError
object is always passed. Consider adding a check to handle cases whereerror
might be null or undefined to make the function more robust.export function runStatusFromError(error: TaskRunError): TaskRunStatus { + if (!error) { + return "SYSTEM_FAILURE"; + } if (error.type !== "INTERNAL_ERROR") { return "COMPLETED_WITH_ERRORS"; }apps/webapp/app/v3/services/worker/workerGroupTokenService.server.ts (1)
657-700
: Consider caching environment variables for performance.The
startRunAttempt
method fetches environment variables for each run attempt. For frequently used environments, consider caching these values to reduce database queries.You could add a simple cache implementation for environment variables:
+ private envVarsCache: Map<string, {variables: Record<string, string>, timestamp: number}> = new Map(); + private readonly ENV_VARS_CACHE_TTL = 5 * 60 * 1000; // 5 minutes + private async getEnvVars( environment: RuntimeEnvironment, runId: string, machinePreset: MachinePreset ): Promise<Record<string, string>> { + const cacheKey = `${environment.id}:${machinePreset.name}`; + const cached = this.envVarsCache.get(cacheKey); + + if (cached && (Date.now() - cached.timestamp) < this.ENV_VARS_CACHE_TTL) { + // Update run-specific variables + return { + ...cached.variables, + TRIGGER_RUN_ID: runId, + }; + } + const variables = await resolveVariablesForEnvironment(environment); const jwt = await generateJWTTokenForEnvironment(environment, { run_id: runId, machine_preset: machinePreset.name, }); variables.push( ...[ { key: "TRIGGER_JWT", value: jwt }, { key: "TRIGGER_RUN_ID", value: runId }, { key: "TRIGGER_MACHINE_PRESET", value: machinePreset.name }, ] ); - return variables.reduce((acc: Record<string, string>, curr) => { + const result = variables.reduce((acc: Record<string, string>, curr) => { acc[curr.key] = curr.value; return acc; }, {}); + + // Cache the result but exclude run-specific variables + this.envVarsCache.set(cacheKey, { + variables: {...result}, + timestamp: Date.now() + }); + + return result; }internal-packages/database/prisma/schema.prisma (2)
1758-1761
: Priority Field for TaskRun Scheduling
The addition ofpriorityMs
(with a default of 0) to theTaskRun
model appears intended to provide a negative offset for queue timestamps. Please verify that the logic interpreting this field (e.g., in the run scheduling system) properly handles negative offsets and that its documentation clearly outlines its role and expected values.
1930-1934
: RunEngineVersion Enum Introduction
The newRunEngineVersion
enum, which distinguishes betweenV1
andV2
, neatly encapsulates the versioning strategy for the run engine. For clarity, consider adding more extensive inline comments or documentation that explain the behavioral differences between these versions.apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (2)
71-71
: Check if this import is needed.
It appears thatSpanLink
is not referenced in the code. If it’s truly unused, consider removing it.-import { SpanLink } from "~/v3/eventRepository.server";
863-1076
: SpanEntity logic is solid; remove the unnecessary Fragment.
The new component cleanly differentiates normal spans and waitpoints. However, lines 1026–1043 contain a single element<Property.Item>
inside a Fragment, which is unnecessary. Removing that Fragment would simplify the code without altering functionality.-{span.entity.object.type === "MANUAL" && ( - <> - <Property.Item> - ... - </Property.Item> - </> -)} +{span.entity.object.type === "MANUAL" && ( + <Property.Item> + ... + </Property.Item> +)}🧰 Tools
🪛 Biome (1.9.4)
[error] 1026-1043: Avoid using unnecessary Fragment.
A fragment is redundant if it contains only one child, or if it is the child of a html element, and is not a keyed fragment.
Unsafe fix: Remove the Fragment(lint/complexity/noUselessFragments)
apps/webapp/app/v3/runEngineHandlers.server.ts (3)
21-58
: Consider wrapping event completion calls in a transaction.The
runSucceeded
handler writes to the event repository using multiple calls in a single block. If any of these calls fail, partial database updates can lead to inconsistent state. Consider wrapping them in a transaction to maintain atomicity and guarantee rollback on intermediate failures.
73-170
: Reduce duplication by extracting the shared completion logic.The
runFailed
handler completes events and in-progress child events with similar logic. Consider creating a helper function or utility within the repository to DRY (Don't Repeat Yourself) any repeated calls and ensure consistent error handling.
394-419
: Add error handling around snapshot event recording.While you handle when
eventResult.success
is false, an additional check could ensure partial writes or partially successful calls do not leave any inconsistent data. Log or handle partially updated states if the snapshot data fails to record.apps/webapp/app/v3/taskEventStore.server.ts (1)
155-172
: Perform thorough testing of raw queries.The raw SQL for non-partitioned events includes the newly added
isDebug
field. Ensure integration tests confirm that returned trace events map accurately to the extendedTraceEvent
type, particularly under stress or concurrent usage scenarios.apps/webapp/app/presenters/v3/SpanPresenter.server.ts (4)
47-62
: Upgrade null-check or throw for missing run details.After fetching
parentRun
at lines 47-62, you return early if no run is found. You might want a more explicit error message to help debugging, especially if the user expects a valid run for the given slug. This clarifies whether the run is missing or if there's some deeper issue.
82-90
: Confirm fallback logic for non-run spans.Here you call
#getSpan
if the entity is not recognized as a run. Make sure that there's no scenario where an entity is partially a run and partially something else, leading to an ambiguous fallback. Explicitly clarifying the logic with a short comment may prevent confusion.
102-108
: Use early return for clarity.When checking for the
span
result in#getRun
, lines 123-125, you could reduce nesting by returning early upon a missing span, improving readability:if (!span) { return; } // ...
168-168
: Check for potential null references inmetadata
.You safely parse
metadata
, though consider adding an extra safety check if the structure is unexpectedly large or partial. If the underlying data is invalid JSON, ensure that an error is logged or handled gracefully.apps/webapp/app/v3/eventRepository.server.ts (3)
1664-1685
: Consider handling falsy values in rehydrateAttribute.The function immediately returns if
!value
, which will treat falsy values like0
,''
, andfalse
as if the attribute doesn't exist. Consider checking for strict undefined/null if you need to preserve these values.const value = properties[key]; - if (!value) return; + if (value === undefined || value === null) return; return value as T;
1687-1711
: Consider adding caching for run lookups.The
findRunForEventCreation
function retrieves run information from the database. If this function is called frequently with the same runId, consider implementing a caching mechanism to reduce database load.+ // Simple in-memory cache for run lookups + const runCache = new Map<string, any>(); + const CACHE_TTL_MS = 60000; // 1 minute cache export async function findRunForEventCreation(runId: string) { + // Check cache first + const cachedRun = runCache.get(runId); + if (cachedRun && cachedRun.timestamp > Date.now() - CACHE_TTL_MS) { + return cachedRun.data; + } - return prisma.taskRun.findFirst({ + const run = await prisma.taskRun.findFirst({ where: { id: runId, }, // ...rest of the query }); + // Store in cache if found + if (run) { + runCache.set(runId, { data: run, timestamp: Date.now() }); + } + + return run; }
1773-1798
: Document that recordRunDebugLog overwrites any existing isDebug value.The function sets
isDebug: true
which will overwrite any existing value inoptions?.attributes
. While likely intentional, this should be documented to prevent confusion./** * Records a debug log event for a run * @param runId - The ID of the run * @param message - The message to log * @param options - The event options + * @note This function will always set isDebug to true, overriding any existing value */ export async function recordRunDebugLog( runId: string, message: string, options: Omit<TraceEventOptions, "environment" | "taskSlug" | "startTime"> & { duration?: number; parentId?: string; startTime?: Date; } ): Promise< | { success: true; } | { success: false; code: "RUN_NOT_FOUND" | "FAILED_TO_RECORD_EVENT"; error?: unknown; } > { return recordRunEvent(runId, message, { ...options, attributes: { ...options?.attributes, isDebug: true, }, }); }apps/webapp/app/v3/services/batchTriggerV4.server.ts (1)
271-312
: Use Enum for processing strategy for improved type safety.The strategy parameter uses a string literal type but is used in switch statements. Consider using a proper TypeScript enum for better type safety and autocompletion.
-export const BatchProcessingStrategy = z.enum(["sequential", "parallel"]); -export type BatchProcessingStrategy = z.infer<typeof BatchProcessingStrategy>; +export enum BatchProcessingStrategy { + SEQUENTIAL = "sequential", + PARALLEL = "parallel" +} +export const BatchProcessingStrategySchema = z.nativeEnum(BatchProcessingStrategy);Then update switch statements and references to use the enum values.
apps/webapp/app/v3/services/triggerTaskV1.server.ts (4)
47-48
: Use more specific redirection in deprecation notice.The deprecation comment should direct users to a specific alternative implementation rather than just mentioning a class name.
-/** @deprecated Use TriggerTaskService in `triggerTask.server.ts` instead. */ +/** @deprecated Use TriggerTaskService in `~/services/triggerTask.server.ts` instead, which contains version-specific routing. */
691-697
: Use consistent logging patterns.The code uses
console.log
here instead of the logger service used throughout the rest of the file.- console.log("Failed to get queue name: No task found", { + logger.debug("Failed to get queue name: No task found", { taskId, environmentId: environment.id, });
481-488
: Use consistent logging patterns.The code uses
console.log
here instead of the logger service used throughout the rest of the file.- console.log("Failed to get queue name: Invalid queue config", { + logger.debug("Failed to get queue name: Invalid queue config", { taskId, environmentId: environment.id, queueConfig: task.queueConfig, });
603-659
: Simplify complex error handling with early returns.The error handling logic for Prisma constraint violations is complex and deeply nested. Consider using early returns or extraction to separate methods for improved readability.
You could restructure the error handling to reduce nesting and improve clarity:
if (!(error instanceof Prisma.PrismaClientKnownRequestError)) { throw error; } logger.debug("TriggerTask: Prisma transaction error", { code: error.code, message: error.message, meta: error.meta, }); if (error.code !== "P2002") { throw error; } const target = error.meta?.target; if (!Array.isArray(target) || target.length === 0) { throw error; } // Handle one-time-use token case if (typeof target[0] === "string" && target[0].includes("oneTimeUseToken")) { throw new ServiceValidationError( `Cannot trigger ${taskId} with a one-time use token as it has already been used.` ); } // Handle queue creation case if (/* queue creation error condition */) { // Handle queue creation error } // Handle idempotency key case if (/* idempotency key error condition */) { // Retry with incremented attempt return await this.call(taskId, environment, body, options, attempt + 1); } // Default case throw new ServiceValidationError( `Cannot trigger ${taskId} as it has already been triggered with the same idempotency key.` );apps/webapp/app/v3/services/triggerTaskV2.server.ts (4)
470-476
: Use consistent logging patterns.The code uses
console.log
here instead of the logger service used elsewhere.- console.log("Failed to get queue name: No task found", { + logger.debug("Failed to get queue name: No task found", { taskId, environmentId: environment.id, });
481-488
: Use consistent logging patterns.The code uses
console.log
here instead of the logger service used elsewhere.- console.log("Failed to get queue name: Invalid queue config", { + logger.debug("Failed to get queue name: Invalid queue config", { taskId, environmentId: environment.id, queueConfig: task.queueConfig, });
559-561
: Add safeguard against negative queue size calculations.If the engine returns an incorrect queue size or if there's an error in calculation, the projected size might be incorrectly calculated.
const queueSize = await engine.lengthOfEnvQueue(environment); -const projectedSize = queueSize + itemsToAdd; +const projectedSize = Math.max(0, queueSize + itemsToAdd);
36-37
: Use more specific redirection in deprecation notice.The deprecation comment should direct users to a specific alternative implementation rather than just mentioning a class name.
-/** @deprecated Use TriggerTaskService in `triggerTask.server.ts` instead. */ +/** @deprecated Use TriggerTaskService in `~/services/triggerTask.server.ts` instead, which provides version-specific routing. */
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
pnpm-lock.yaml
is excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (38)
apps/coordinator/src/checkpointer.ts
(1 hunks)apps/docker-provider/src/index.ts
(1 hunks)apps/kubernetes-provider/tsconfig.json
(1 hunks)apps/webapp/app/components/code/CodeBlock.tsx
(1 hunks)apps/webapp/app/components/runs/v3/RunFilters.tsx
(1 hunks)apps/webapp/app/components/runs/v3/SpanTitle.tsx
(1 hunks)apps/webapp/app/entry.server.tsx
(2 hunks)apps/webapp/app/env.server.ts
(2 hunks)apps/webapp/app/presenters/v3/SpanPresenter.server.ts
(12 hunks)apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.deployments/route.tsx
(1 hunks)apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam/route.tsx
(5 hunks)apps/webapp/app/routes/admin.api.v1.environments.$environmentId.ts
(2 hunks)apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx
(11 hunks)apps/webapp/app/utils/taskEvent.ts
(1 hunks)apps/webapp/app/v3/authenticatedSocketConnection.server.ts
(2 hunks)apps/webapp/app/v3/eventRepository.server.ts
(19 hunks)apps/webapp/app/v3/handleSocketIo.server.ts
(4 hunks)apps/webapp/app/v3/marqs/devQueueConsumer.server.ts
(1 hunks)apps/webapp/app/v3/runEngineHandlers.server.ts
(1 hunks)apps/webapp/app/v3/services/batchTriggerV3.server.ts
(1 hunks)apps/webapp/app/v3/services/batchTriggerV4.server.ts
(1 hunks)apps/webapp/app/v3/services/cancelTaskRun.server.ts
(2 hunks)apps/webapp/app/v3/services/cancelTaskRunV1.server.ts
(1 hunks)apps/webapp/app/v3/services/changeCurrentDeployment.server.ts
(1 hunks)apps/webapp/app/v3/services/createCheckpoint.server.ts
(2 hunks)apps/webapp/app/v3/services/finalizeDeployment.server.ts
(2 hunks)apps/webapp/app/v3/services/triggerTaskV1.server.ts
(1 hunks)apps/webapp/app/v3/services/triggerTaskV2.server.ts
(1 hunks)apps/webapp/app/v3/services/worker/workerGroupTokenService.server.ts
(1 hunks)apps/webapp/app/v3/taskEventStore.server.ts
(5 hunks)apps/webapp/test/GCRARateLimiter.test.ts
(6 hunks)apps/webapp/test/fairDequeuingStrategy.test.ts
(8 hunks)apps/webapp/test/realtimeClient.test.ts
(3 hunks)internal-packages/database/prisma/schema.prisma
(19 hunks)internal-packages/redis-worker/src/queue.ts
(1 hunks)internal-packages/run-engine/src/engine/errors.ts
(1 hunks)internal-packages/run-engine/src/engine/eventBus.ts
(1 hunks)internal-packages/run-engine/src/engine/types.ts
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- apps/webapp/app/v3/services/changeCurrentDeployment.server.ts
🚧 Files skipped from review as they are similar to previous changes (20)
- apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.deployments/route.tsx
- apps/webapp/app/components/runs/v3/RunFilters.tsx
- apps/webapp/app/routes/admin.api.v1.environments.$environmentId.ts
- apps/webapp/app/v3/services/batchTriggerV3.server.ts
- internal-packages/redis-worker/src/queue.ts
- apps/webapp/app/components/code/CodeBlock.tsx
- apps/webapp/test/realtimeClient.test.ts
- apps/webapp/app/components/runs/v3/SpanTitle.tsx
- internal-packages/run-engine/src/engine/types.ts
- apps/webapp/app/v3/marqs/devQueueConsumer.server.ts
- apps/coordinator/src/checkpointer.ts
- apps/webapp/app/entry.server.tsx
- apps/webapp/app/utils/taskEvent.ts
- apps/webapp/app/v3/services/finalizeDeployment.server.ts
- apps/webapp/test/fairDequeuingStrategy.test.ts
- apps/webapp/app/v3/services/createCheckpoint.server.ts
- apps/webapp/app/v3/authenticatedSocketConnection.server.ts
- apps/webapp/app/v3/services/cancelTaskRun.server.ts
- apps/docker-provider/src/index.ts
- apps/kubernetes-provider/tsconfig.json
🧰 Additional context used
🪛 Biome (1.9.4)
apps/webapp/app/presenters/v3/SpanPresenter.server.ts
[error] 461-477: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 487-492: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 494-494: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx
[error] 1026-1043: Avoid using unnecessary Fragment.
A fragment is redundant if it contains only one child, or if it is the child of a html element, and is not a keyed fragment.
Unsafe fix: Remove the Fragment
(lint/complexity/noUselessFragments)
apps/webapp/app/v3/services/triggerTaskV1.server.ts
[error] 244-245: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: units / 🧪 Unit Tests
🔇 Additional comments (78)
apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam/route.tsx (5)
73-73
: Verified import addition provides better type access.The addition of
Run
to the imports fromRunPresenter.server
improves type access in the component, allowing for better type checking and IntelliSense support when working with run objects.
469-473
: Good admin-only debug feature implementation.The addition of the
showDebug
state variable and admin access check is well-implemented. This allows for enhanced debugging capabilities while properly restricting access to admin users only.
492-492
: Appropriate conditional filtering for debug events.The updated tree filtering logic now correctly toggles between showing all events and filtering out debug events based on the
showDebug
state. This ensures that debug information is only shown when explicitly requested by an admin.
517-525
: Well-implemented admin-only debug toggle with keyboard shortcut.The debug toggle is properly gated behind an admin check and includes a helpful keyboard shortcut (
Shift+D
). This provides a convenient way for admins to access additional debugging information while keeping the UI clean for regular users.
228-230
: Enhanced user experience with keyboard shortcut.Adding the keyboard shortcut
C
to the cancel run button improves accessibility and efficiency for power users. This is a good addition that aligns with existing keyboard shortcut patterns in the application.apps/webapp/app/v3/handleSocketIo.server.ts (7)
51-76
: Excellent security implementation for worker namespaces.This approach of separate namespaces with distinct authentication mechanisms provides proper security isolation between production and development workers. The production worker namespace using
WorkerGroupTokenService
ensures proper authentication, while the dev worker namespace has appropriate environment type verification.
447-456
: Clean implementation of header extraction utility.The
headersFromHandshake
function is well-implemented, with proper handling of non-string values in the headers. This creates a standard Headers object from Socket.io handshake data, making it easier to use with the Request API.
470-499
: Robust authentication middleware implementation.The middleware properly authenticates socket connections using the provided authentication function. It has good error handling and logging, and correctly disconnects unauthorized sockets.
537-543
: Prevent potential log overflow and consider concurrency impacts.In createWorkerNamespace, the debug interval (every 5s) logging active rooms might create noise at scale. You could throttle or conditionally enable this at debug level only. Also, the eventBus subscription is properly unsubscribed on disconnect, preventing memory leaks. The overall approach is sound, but be mindful that multiple worker connections could introduce concurrency overhead.
579-612
: Well-implemented error handling for batch operations.The use of
Promise.allSettled
is excellent for handling errors in batch operations without failing the entire process. The implementation properly logs any individual subscription failures while allowing successful subscriptions to proceed.
506-533
: Consider enabling conditional worker notifications based on environment.The worker notification handler has a good conditional check based on the
RUN_ENGINE_DEBUG_WORKER_NOTIFICATIONS
environment variable. This allows for detailed debugging when needed without overwhelming logs in production environments.
670-672
: Simple but effective room naming strategy.The
roomFromFriendlyRunId
function provides a consistent way to generate room names from run IDs, which helps maintain a standardized approach throughout the codebase.apps/webapp/test/GCRARateLimiter.test.ts (4)
5-5
: Import of Redis client is appropriate.Adding the Redis import to support the new pattern of creating Redis clients from options.
111-112
: Test validates gradual rate limit recovery.Good test case for validating that the retry time gradually decreases as time passes, which is an important aspect of the GCRA algorithm's behavior. The implementation is correct.
169-170
: LGTM: Testing key prefix isolation.This test correctly verifies that different rate limiters with different key prefixes don't interfere with each other, which is important for safely using multiple rate limiters in the same Redis instance.
221-222
: Error handling test is robust.This test effectively verifies error propagation when Redis operations fail. The mock implementation and error assertion are well implemented.
internal-packages/run-engine/src/engine/eventBus.ts (3)
1-4
: Dependencies look good.The imports appropriately bring in the necessary types from database and core packages, which will be used in the event definitions.
180-180
: Great utility type for type-safe event argument extraction.The
EventBusEventArgs
type is a well-designed utility type that enables type-safe extraction of event arguments based on the event type key. This will make it easier to implement type-safe event handlers.
129-140
:❓ Verification inconclusive
Consider adding more context to workerNotification payload.
The
workerNotification
event contains minimal information compared to other events. Depending on its usage, you might want to include additional context like notification type, message, or severity to make it more informative for consumers.
🏁 Script executed:
#!/bin/bash # Find usages of the workerNotification event to understand if additional context is needed # Look for implementations or usages of workerNotification echo "Searching for workerNotification usages..." rg -A 3 -B 3 "workerNotification" --type tsLength of output: 14264
Action: Evaluate Enhancing the
workerNotification
PayloadThe
workerNotification
event currently carries only the minimal fields:
time
run
(withid
)snapshot
(withid
andexecutionStatus
)Our search across the codebase shows that both the event emitter (in files such as
internal-packages/run-engine/src/engine/eventBus.ts
and related tests) and the consumers (e.g., inapps/webapp/app/v3/runEngineHandlers.server.ts
) rely on this limited structure. Although its minimalism appears sufficient for the current usage, adding additional context—such as a notification type, message, or severity—could improve debuggability and align with the enriched payloads used by other events.Please review whether enhancing this payload would benefit consumers in terms of clarity and future extensibility, and if so, update both the emitter and listener implementations accordingly.
apps/webapp/app/v3/services/cancelTaskRunV1.server.ts (1)
51-85
: Validate concurrency edge cases for finalizing runs.When calling "FinalizeTaskRunService" to mark the task run as canceled, ensure concurrent requests don't overwrite the updated status before or after "finalizeService.call" finishes. If multiple components call "cancel" simultaneously, record-based locking or transaction-level consistency may be needed to prevent race conditions.
internal-packages/run-engine/src/engine/errors.ts (3)
1-58
: Well-structured error handling implementation!The
runStatusFromError
function effectively mapsTaskRunError
types to appropriateTaskRunStatus
values with clear organization and thorough error code coverage. The comments explaining the distinction between "CRASHED" and "SYSTEM_FAILURE" are particularly helpful.
55-57
: Good use of exhaustive type checkingUsing
assertExhaustive
in the default case ensures proper type safety by verifying that all possible error codes are handled. This helps catch any missing mappings when new error codes are added to the system.
42-42
:✅ Verification successful
Verify if "N0" is a typo in the error code
I noticed that "TASK_HAS_N0_EXECUTION_SNAPSHOT" uses the digit "0" instead of the letter "O". If this is unintentional, it could lead to incorrect error handling.
🏁 Script executed:
#!/bin/bash # Search for other occurrences of this error code to verify if it's consistently using "N0" echo "Searching for TASK_HAS_N0_EXECUTION_SNAPSHOT:" rg "TASK_HAS_N0_EXECUTION_SNAPSHOT" --type ts -A 1 -B 1 echo -e "\nSearching for similar error codes with 'NO' instead of 'N0':" rg "TASK_HAS_NO_EXECUTION_SNAPSHOT" --type ts -A 1 -B 1 echo -e "\nChecking TaskRunError definition to confirm correct error code:" rg "TaskRunError.*=.*{" --type ts -A 50 | grep -E "('|\")TASK_HAS_N0_EXECUTION_SNAPSHOT('|\")"Length of output: 1440
Action: Confirm that "TASK_HAS_N0_EXECUTION_SNAPSHOT" is intentional
After verifying the codebase, we found that the error code is used consistently in several files:
packages/core/src/v3/schemas/common.ts
packages/core/src/v3/errors.ts
internal-packages/run-engine/src/engine/errors.ts
The search for a similar error code using
"TASK_HAS_NO_EXECUTION_SNAPSHOT"
returned no results, which indicates that the usage of"N0"
(digit zero) is deliberate rather than a typo. Although the regex check for theTaskRunError
definition produced an error (due to a regex syntax issue), the available evidence confirms that the codebase uniformly uses"TASK_HAS_N0_EXECUTION_SNAPSHOT"
.Please confirm with the domain requirements or documentation if this naming convention is correct. Otherwise, no change is necessary.
apps/webapp/app/v3/services/worker/workerGroupTokenService.server.ts (14)
31-36
: Well-structured token configuration.Good job on setting up a clear token configuration with constants for prefix, length, and allowed characters. Using
customAlphabet
from nanoid provides a secure way to generate random tokens with the specified character set.
37-51
: Token generation and storage is well-implemented.The token creation flow is secure, generating a random token with a hash that's stored in the database while returning both the hash and plaintext version to the caller. This approach prevents storing plaintext tokens in the database while still allowing for token verification.
53-70
: Effective token validation logic with appropriate logging.The
findWorkerGroup
method properly validates tokens by computing the hash and looking it up. The warning log when no match is found is helpful for debugging.
72-105
: Token rotation is implemented securely.The token rotation flow correctly checks for the existence of the worker group and token before attempting rotation, with appropriate error logging if issues occur. This ensures that token rotation is robust and secure.
107-119
: Secure token hashing.Using SHA-256 for token hashing is a secure approach. The token generation function encapsulates the logic for creating both the plaintext token and its hash, which is a good practice.
121-199
: Authentication flow has thorough validation and security checks.The authentication method thoroughly validates the request, checking for the presence of the token, proper token prefix, instance name, and additional security measures for managed workers. The use of proper error logging provides good visibility for debugging authentication issues.
155-185
: Secure verification of managed worker secrets.The implementation uses
timingSafeEqual
for comparing secrets, which is a crucial security practice to prevent timing attacks. The length check before comparison is also important to ensure the timing-safe comparison works correctly.
200-252
: Comprehensive validation for different worker types.The worker type-specific validation ensures that each type has the required context (environment, deployment, etc.) before returning the authenticated instance. This helps prevent issues with missing data in later operations.
254-463
: Robust worker instance creation with transaction support.The
getOrCreateWorkerInstance
method properly handles both managed and unmanaged worker types, with appropriate validation for each. The use of database transactions ($transaction
) ensures atomicity when creating/updating related records. The handling of race conditions withPrisma.PrismaClientKnownRequestError
is a good practice.
465-475
: Secure header sanitization.The
sanitizeHeaders
method prevents sensitive headers like authorization from being logged, which is a good security practice to avoid exposing sensitive information in logs.
533-585
: Conditional dequeue implementation based on worker type.The
dequeue
method handles both managed and unmanaged workers appropriately, with specific dequeue strategies for each type. The check for required properties and update of thelastDequeueAt
timestamp are good practices.
587-629
: Good separation of dequeue methods for different contexts.The specialized dequeue methods (
dequeueFromVersion
anddequeueFromEnvironment
) for managed workers provide good flexibility while maintaining appropriate type checks and error handling.
781-805
: Efficient variable resolution and injection.The method for resolving environment variables and injecting run-specific values is well-implemented. The approach of using
reduce
to convert the array of variables to an object is clean and efficient.
808-824
: Clear type definitions for authentication responses.The union type for
WorkerGroupTokenAuthenticationResponse
clearly defines the structure for both managed and unmanaged worker types, which is important for type safety when consuming these responses.internal-packages/database/prisma/schema.prisma (10)
151-152
: New Relationships in Organization Model
The addition of theworkerGroups
andworkerInstances
fields to theOrganization
model is well aligned with the goal of managing multiple worker entities per organization. Please verify that all queries or mutations that previously assumed a singular worker structure are updated to handle these array relationships appropriately.
454-463
: Enhancements in the Project Model for Engine and Worker Associations
Introducing theengine
field (of typeRunEngineVersion
with defaultV1
) augments theProject
model with versioning support for the new run engine. In addition, adding relationships forworkerGroups
,workers
, and the optionaldefaultWorkerGroup
(with its associated ID) is a strong move toward improved resource management. Please ensure that any business logic or migration scripts interacting with this model take these new fields into account.
1578-1579
: BackgroundWorker Engine Field Addition
The newengine
field in theBackgroundWorker
model (defaulting toV1
) helps in distinguishing between legacy and new engine operations. Confirm that this field is consistently used across the codebase for any logic that conditionally handles background workers based on engine version.
1603-1605
: Optional Worker Group Relation for BackgroundWorker
Adding theworkerGroup
relation (and its correspondingworkerGroupId
field) in theBackgroundWorker
model is a beneficial change. The use ofonDelete: SetNull
is appropriate if a worker group is removed, but please double-check that your deployment and cleanup logic properly handles cases where this relationship becomes null.
1687-1688
: TaskRun Model Engine Field Integration
The inclusion of theengine
field in theTaskRun
model (defaulting toV1
) ensures that each run is tagged with the correct engine version. This change is essential for downstream processing—please verify that run processors and metrics correctly use this field for conditional logic.
1744-1746
: Tracking First Attempt Start for Run Engine 2.0
The newfirstAttemptStartedAt
field inTaskRun
, marked for use with Run Engine 2.0+, will be valuable for tracking the execution lifecycle. Ensure that when a run is processed by the new engine, this timestamp is correctly recorded.
1944-1950
: TaskRunExecutionSnapshot Engine Field Update
Within theTaskRunExecutionSnapshot
model, theengine
field now defaults toV2
, reflecting its intended use with the new run engine architecture. This change is critical for differentiating snapshot data based on execution contexts; please ensure that any snapshot retention or processing routines are aware of this versioning.
2730-2734
: WorkerDeploymentType Enum Update
The updatedWorkerDeploymentType
enum now includes an additionalV1
value alongsideMANAGED
andUNMANAGED
. This adjustment appears to support legacy behavior in deployments. Make sure that any conditional logic based on this enum is updated accordingly.
2749-2775
: WorkerDeployment Model Enhancements
In theWorkerDeployment
model, introducing thetype
field (defaulting toV1
) along with new relationships forpromotions
,alerts
, andworkerInstance
enriches the deployment metadata. These additions will likely facilitate more nuanced deployment tracking and alerting. Please ensure that the orchestration and alerting systems are revised to incorporate these new fields, and that any related indexing remains performant.
2541-2557
:❓ Verification inconclusive
Extensive Enhancements in BatchTaskRun for Engine v2
TheBatchTaskRun
model now includes several new columns:
• Friendly Identifiers:runIds
(a string array) andrunCount
for easier tracking of batch items.
• Payload & Options: Fields likepayload
,payloadType
,options
, andbatchVersion
standardize batch data handling.
• Engine v2 Fields: IntroducingexecutionSnapshots
,runsBlocked
, andwaitpoints
will enable finer control over batch processing and its dependencies.These changes are significant and should be accompanied by thorough testing—both in data migration scripts and in runtime scenarios—to ensure consistency and performance.
Action Required: Ensure Comprehensive Testing for Engine v2 Enhancements in BatchTaskRun
The enhancements in the
BatchTaskRun
model—introducing fields for friendly IDs (runIds
,runCount
), standardized payload handling (payload
,payloadType
,options
,batchVersion
), and new engine v2-specific relationships (executionSnapshots
,runsBlocked
,waitpoints
)—are significant. These changes clearly improve tracking and batch processing control but come with the inherent risk of migration and runtime inconsistencies.Please confirm that exhaustive testing (both for data migration and runtime behavior) has been planned or implemented to validate these enhancements.
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (13)
62-62
: No concerns with the added import.
72-75
: Import statements look good.
Confirmed thatCompleteWaitpointForm
andForceTimeout
are used downstream.
198-198
: Use of<SpanTitle>
is appropriate.
No issues identified in the component usage here.
234-294
: The new “detail” tab logic seems correct.
All status checks and property displays for the span appear to be handled well. No obvious issues.
313-313
: Verify correctness of admin-only access.
You’re checkingisAdmin
to conditionally display sensitive metadata. Confirm that this is aligned with your org policies, ensuring only the correct users see these fields.
324-324
: Different icon for cached runs is well-implemented.
No further suggestions.
329-332
: Conditionally appending "(cached)" is clear.
This improves clarity for users observing cached run identifiers.
502-517
: Idempotency fields appear well-handled.
No concerns regarding the logic to display and expire the key.
550-565
: Field additions for engine and master queues look good.
The conditional rendering for admins only is a good practice.
703-704
: Run error display logic is correct.
Conditionally rendering the error component is appropriate.
709-711
: Properly showing output only when no error is present.
This clarifies the run’s result state.
720-725
: Dynamic link for focusing or jumping to the run is correct.
No issues noted.
730-730
: Label change for cached vs. non-cached is straightforward.
Good user experience improvement.apps/webapp/app/v3/runEngineHandlers.server.ts (3)
60-71
: Ensure queued alerts are idempotent.In the
runFailed
handler's alert logic, multiple identical calls in a short time window could lead to duplicate alerts if not guarded by a check or unique constraint. Confirm that repeated calls won't create redundant or conflicting alerts.
172-206
: Evaluate concurrency for partial run failures.The
runAttemptFailed
handler marks in-progress events as crashed. If multiple parallel failures occur, overlapping attempts might concurrently crash the same events, leading to potential race conditions. Confirm that the underlying repository methods are concurrency-safe or implement idempotent updates to avoid inconsistent states.
299-325
: Prevent double cancellation.The
runCancelled
handler loops over in-progress events and cancels them. If the same run is canceled multiple times in quick succession, calls toeventRepository.cancelEvent
could conflict. Consider concurrency checks (e.g., version numbers or optimistic locking) to avoid double cancellation side effects.apps/webapp/app/v3/taskEventStore.server.ts (2)
23-23
: Confirm usage of theisDebug
field.Introducing the
isDebug
attribute in theTraceEvent
type can help classify debug logs versus production logs. Ensure that any related filtering or access controls are adjusted so that only authorized contexts retrieve debug data where necessary.
126-142
: Validate time window alignment when partitioning is enabled.While selecting partitioned events, the code adds a time buffer to handle partition window offsets. Confirm that the additional window logic aligns with boundary conditions (e.g., crossing the partition boundary at exact times) to avoid missing or duplicating trace events.
apps/webapp/app/presenters/v3/SpanPresenter.server.ts (3)
66-74
: Check for concurrency or partial updates when retrieving run detail.You call
#getRun
with a final event store and timestamps. If another process updates the run concurrently (e.g., changing completion times), ensure that the final data reflects these changes or that you handle them gracefully.
115-121
: Consider referencing concurrency checks for getSpan.Similar to other event-based functions, concurrent updates might cause stale data. Confirm the
eventRepository.getSpan
call supports concurrency-safe queries or use version checks to ensure data consistency.
130-156
: Validate that the run model is consistent with the new engine fields.The newly added fields
engine
,masterQueue
,secondaryMasterQueue
,error
, andoutput
are fetched here. Confirm these fields are set consistently across the codebase, and that partial updates or missing fields won't cause undefined behaviors in views or downstream logic.apps/webapp/app/v3/eventRepository.server.ts (8)
13-13
: Type refinement from AuthenticatedEnvironment to TaskEventEnvironment.The change from
AuthenticatedEnvironment
toTaskEventEnvironment
provides a more specific and appropriate type for event handling. This is a good refinement to improve type safety and better represent the environment context needed for task events.Also applies to: 88-88
62-62
: Well-implemented debug flag propagation through type system.The
isDebug
property has been consistently added to all relevant type definitions, ensuring proper type checking throughout the event flow. This enables better debugging capabilities while maintaining type safety.Also applies to: 129-129, 174-174, 206-206
289-314
: Good use of TypeScript'ssatisfies
operator.Using the
satisfies
operator ensures thecompletedEvent
object conforms to theCreatableEvent
type while preserving the specific literal types of the object properties, which is more type-safe than a simple cast.
665-677
: Clean implementation of entity metadata extraction.The code properly extracts entity type and ID from the event properties, which enhances the event data model. This maintains good separation of concerns by isolating the extraction logic.
835-848
: Enhanced recordEvent method with improved flexibility.The method signature now includes optional
duration
andparentId
parameters, providing more flexibility in event recording. The implementation now properly calculates duration when an endTime is provided.
868-872
: Consistent debug flag handling for UI presentation.The code now properly updates the icon based on the debug status, which helps distinguish debug events visually in the UI. This is a good UX improvement.
886-891
: Proper propagation of debug state to event level.Debug events are now correctly assigned "WARN" level instead of "TRACE", making them more visible in logging systems. This is a good practice for ensuring debug information stands out appropriately.
1713-1771
: Well-structured error handling in recordRunEvent.The function has a clear return type that explicitly describes success and failure cases, with proper error codes and error information. The try/catch block with appropriate logging enhances the reliability of the event recording system.
apps/webapp/app/v3/services/batchTriggerV4.server.ts (1)
531-537
:❓ Verification inconclusive
Handle potential NULL run case.
If
processBatchTaskRunItem
returnsundefined
, the code throws a generic error but continues to try processing subsequent items. This could lead to inconsistent batch states.Consider handling this more robustly by either:
- Using an early return pattern after logging the error
- Implementing a retry mechanism for the specific failed item
- Ensuring proper transaction rollback to maintain data consistency
🏁 Script executed:
#!/bin/bash # Check if there are any instances where a batch run contains null/undefined run IDs rg -A 10 $'runIds: \[' apps/webapp/appLength of output: 1944
Review Action Required: Enhance Handling of Undefined Task Runs
The current implementation logs an error and throws a generic exception when a task run returns undefined. This approach can lead to continued processing of subsequent items and potential batch state inconsistencies. Based on the grep output (which shows initialized empty run arrays), it looks like there’s attention given to run identifiers, but explicit handling for undefined task runs remains absent.
Please consider one of the following improvements:
- Early Exit: After logging, return immediately to halt further processing.
- Retry Mechanism: Implement retry logic for the failed item.
- Transaction Rollback: Ensure a proper rollback of the batch process if an item fails.
Review these changes to ensure they align with your transaction and error-handling strategies.
apps/webapp/app/v3/services/triggerTaskV2.server.ts (1)
43-45
: Check default attempts and over-retries.There's an
attempt
parameter that defaults to 0. Combined withMAX_ATTEMPTS
, the code raises an error afterattempt > MAX_ATTEMPTS
. Carefully verify that you don't introduce an endless loop if an unexpected race condition resets it.I'd recommend adding a safety mechanism like a retry counter to ensure we don't enter an infinite loop due to race conditions. For instance, you might want to introduce logic to detect cyclical retry patterns.
//e.g. a bug | ||
switch (error.code) { | ||
case "RECURSIVE_WAIT_DEADLOCK": | ||
return "COMPLETED_WITH_ERRORS"; | ||
case "TASK_RUN_CANCELLED": | ||
return "CANCELED"; | ||
case "MAX_DURATION_EXCEEDED": | ||
return "TIMED_OUT"; | ||
case "TASK_PROCESS_OOM_KILLED": | ||
case "TASK_PROCESS_MAYBE_OOM_KILLED": | ||
case "TASK_PROCESS_SIGSEGV": | ||
case "DISK_SPACE_EXCEEDED": | ||
case "OUTDATED_SDK_VERSION": | ||
case "HANDLE_ERROR_ERROR": | ||
case "TASK_RUN_CRASHED": | ||
case "TASK_PROCESS_EXITED_WITH_NON_ZERO_CODE": | ||
return "CRASHED"; | ||
case "COULD_NOT_FIND_EXECUTOR": | ||
case "COULD_NOT_FIND_TASK": | ||
case "COULD_NOT_IMPORT_TASK": | ||
case "CONFIGURED_INCORRECTLY": | ||
case "TASK_ALREADY_RUNNING": | ||
case "TASK_PROCESS_SIGKILL_TIMEOUT": | ||
case "TASK_RUN_HEARTBEAT_TIMEOUT": | ||
case "TASK_DEQUEUED_INVALID_STATE": | ||
case "TASK_DEQUEUED_QUEUE_NOT_FOUND": | ||
case "TASK_RUN_DEQUEUED_MAX_RETRIES": | ||
case "TASK_RUN_STALLED_EXECUTING": | ||
case "TASK_RUN_STALLED_EXECUTING_WITH_WAITPOINTS": | ||
case "TASK_HAS_N0_EXECUTION_SNAPSHOT": | ||
case "GRACEFUL_EXIT_TIMEOUT": | ||
case "TASK_INPUT_ERROR": | ||
case "TASK_OUTPUT_ERROR": | ||
case "POD_EVICTED": | ||
case "POD_UNKNOWN_ERROR": | ||
case "TASK_EXECUTION_ABORTED": | ||
case "TASK_EXECUTION_FAILED": | ||
case "TASK_PROCESS_SIGTERM": | ||
case "TASK_DEQUEUED_INVALID_RETRY_CONFIG": | ||
case "TASK_DEQUEUED_NO_RETRY_CONFIG": | ||
case "TASK_DID_CONCURRENT_WAIT": | ||
return "SYSTEM_FAILURE"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Consider adding unit tests
This function handles critical error mapping logic that affects how task runs are reported. To ensure reliability, consider adding comprehensive unit tests that cover all error code mappings, especially for edge cases.
🏁 Script executed:
#!/bin/bash
# Search for existing tests for this function
echo "Searching for tests for runStatusFromError:"
rg "runStatusFromError" --type test
rg "runStatusFromError" --type spec
# Look for test directory structure
echo -e "\nLooking for test files in the run-engine package:"
find internal-packages/run-engine -name "*.test.ts" -o -name "*.spec.ts" | grep -i error
Length of output: 552
Action Required: Add Comprehensive Unit Tests for Error Mapping Logic
The function in internal-packages/run-engine/src/engine/errors.ts
(lines 13–54) that maps error codes to statuses currently appears to lack targeted unit tests. Our initial search for tests referencing runStatusFromError
did not yield any results. Given the critical nature of this error-mapping functionality—which directly impacts task-run reporting—we strongly recommend adding comprehensive unit tests that cover all error code cases as well as potential edge cases.
- Location:
internal-packages/run-engine/src/engine/errors.ts
(inside the error mapping switch statement) - Recommendation: Implement unit tests that validate each case, ensuring that unexpected or edge-case error codes are handled appropriately.
// FIXME: Required for unmanaged workers | ||
readonly isLatestDeployment = true; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Address the FIXME comment for isLatestDeployment
.
There's a FIXME comment indicating that the isLatestDeployment
property is hardcoded to true
. This could lead to incorrect behavior if not properly addressed.
Consider implementing a proper check to determine if a deployment is the latest version:
🏁 Script executed:
#!/bin/bash
# Description: Search for any existing logic in the codebase that determines if a deployment is the latest
# Check for logic that determines latest deployment
echo "Searching for code that determines latest deployment..."
rg -A 10 -B 10 "latest.*deployment|deployment.*latest" --type ts
Length of output: 5128
Action Required: Replace the hardcoded true
with a dynamic check for the latest deployment
The search results reveal that our codebase already implements logic to verify the latest deployment (e.g., in apps/webapp/app/v3/services/retryDeploymentIndexing.server.ts
and within the API client in packages/cli-v3/src/apiClient.ts
). Instead of forcing isLatestDeployment
to always be true
, it would be more robust to align its behavior with the existing implementations that compare deployment IDs or fetch the latest deployment status.
- Change Required: Refactor the
isLatestDeployment
property inapps/webapp/app/v3/services/worker/workerGroupTokenService.server.ts
(lines 505-507) to dynamically assess whether the deployment is indeed the latest. - Suggestion: Consider reusing or adapting the logic from the retry deployment indexing service or the API client methods that fetch the latest deployment details. If unmanaged workers can’t leverage the full dynamic check, update the FIXME comment with an explanation and a plan to address this gap later.
const batch = await this._prisma.batchTaskRun.create({ | ||
data: { | ||
id: BatchId.fromFriendlyId(batchId), | ||
friendlyId: batchId, | ||
runtimeEnvironmentId: environment.id, | ||
runCount: body.items.length, | ||
runIds: [], | ||
payload: payloadPacket.data, | ||
payloadType: payloadPacket.dataType, | ||
options, | ||
batchVersion: "runengine:v1", | ||
oneTimeUseToken: options.oneTimeUseToken, | ||
}, | ||
}); | ||
|
||
if (body.parentRunId && body.resumeParentOnCompletion) { | ||
await this._engine.blockRunWithCreatedBatch({ | ||
runId: RunId.fromFriendlyId(body.parentRunId), | ||
batchId: batch.id, | ||
environmentId: environment.id, | ||
projectId: environment.projectId, | ||
organizationId: environment.organizationId, | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor duplicated batch creation logic.
There's significant code duplication between the inline and async batch creation paths. Consider extracting the common batch creation logic into a private method.
+async #createBatchTaskRun(
+ batchId: string,
+ payloadPacket: IOPacket,
+ environment: AuthenticatedEnvironment,
+ oneTimeUseToken?: string,
+ options?: any
+) {
+ return await this._prisma.batchTaskRun.create({
+ data: {
+ id: BatchId.fromFriendlyId(batchId),
+ friendlyId: batchId,
+ runtimeEnvironmentId: environment.id,
+ runCount: body.items.length,
+ runIds: [],
+ payload: payloadPacket.data,
+ payloadType: payloadPacket.dataType,
+ options,
+ batchVersion: "runengine:v1",
+ oneTimeUseToken: oneTimeUseToken,
+ },
+ });
+}
Then call this method from both places.
Also applies to: 245-269
start: result.workingIndex, | ||
// This will be the remainder of the slice | ||
// for example if the original range was 0-50 and the workingIndex is 25, the new range will be 25-25 | ||
// if the original range was 51-100 and the workingIndex is 75, the new range will be 75-25 | ||
count: options.range.count - result.workingIndex - options.range.start, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incorrect calculation of remaining range count.
The current calculation for determining the new count in parallel processing could lead to incorrect values. When calculating the remaining items, only the difference between the original range's start and the working index should be considered.
count: options.range.count - result.workingIndex - options.range.start,
+count: options.range.count - (result.workingIndex - options.range.start),
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
start: result.workingIndex, | |
// This will be the remainder of the slice | |
// for example if the original range was 0-50 and the workingIndex is 25, the new range will be 25-25 | |
// if the original range was 51-100 and the workingIndex is 75, the new range will be 75-25 | |
count: options.range.count - result.workingIndex - options.range.start, | |
}, | |
start: result.workingIndex, | |
// This will be the remainder of the slice | |
// for example if the original range was 0-50 and the workingIndex is 25, the new range will be 25-25 | |
// if the original range was 51-100 and the workingIndex is 75, the new range will be 75-25 | |
count: options.range.count - (result.workingIndex - options.range.start), | |
}, |
dependentBatchRun && | ||
dependentBatchRun.dependentTaskAttempt && | ||
(isFinalAttemptStatus(dependentBatchRun.dependentTaskAttempt.status) || | ||
isFinalRunStatus(dependentBatchRun.dependentTaskAttempt.taskRun.status)) | ||
) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use optional chaining for safer property access.
The current code may cause runtime errors if dependentBatchRun
exists but dependentTaskAttempt
is undefined.
if (
dependentBatchRun &&
- dependentBatchRun.dependentTaskAttempt &&
- (isFinalAttemptStatus(dependentBatchRun.dependentTaskAttempt.status) ||
- isFinalRunStatus(dependentBatchRun.dependentTaskAttempt.taskRun.status))
+ dependentBatchRun.dependentTaskAttempt?.status &&
+ (isFinalAttemptStatus(dependentBatchRun.dependentTaskAttempt.status) ||
+ isFinalRunStatus(dependentBatchRun.dependentTaskAttempt.taskRun.status))
) {
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
dependentBatchRun && | |
dependentBatchRun.dependentTaskAttempt && | |
(isFinalAttemptStatus(dependentBatchRun.dependentTaskAttempt.status) || | |
isFinalRunStatus(dependentBatchRun.dependentTaskAttempt.taskRun.status)) | |
) { | |
if ( | |
dependentBatchRun && | |
dependentBatchRun.dependentTaskAttempt?.status && | |
(isFinalAttemptStatus(dependentBatchRun.dependentTaskAttempt.status) || | |
isFinalRunStatus(dependentBatchRun.dependentTaskAttempt.taskRun.status)) | |
) { |
🧰 Tools
🪛 Biome (1.9.4)
[error] 244-245: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
if (!workerGroup) { | ||
logger.error("Default worker group not found", { | ||
projectId: environment.projectId, | ||
}); | ||
|
||
return; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle missing worker group more gracefully.
If no default worker group is found, the function logs an error and returns undefined, which could lead to a null reference exception when the caller attempts to use the return value.
if (!workerGroup) {
logger.error("Default worker group not found", {
projectId: environment.projectId,
});
- return;
+ throw new ServiceValidationError(
+ `Cannot trigger ${taskId}: no default worker group found for project ${environment.projectId}`
+ );
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if (!workerGroup) { | |
logger.error("Default worker group not found", { | |
projectId: environment.projectId, | |
}); | |
return; | |
} | |
if (!workerGroup) { | |
logger.error("Default worker group not found", { | |
projectId: environment.projectId, | |
}); | |
throw new ServiceValidationError( | |
`Cannot trigger ${taskId}: no default worker group found for project ${environment.projectId}` | |
); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
internal-packages/database/prisma/schema.prisma (2)
1755-1757
: Priority Management in TaskRun
The newpriorityMs
field inTaskRun
enables adjusting the effective queue timestamp by applying a negative offset. This is a useful mechanism for prioritizing certain runs. It might be beneficial to document the expected range or behavior of this value so that future maintainers understand its impact.
2538-2554
: BatchTaskRun Model Enhancements for Engine v2
TheBatchTaskRun
model has been significantly extended:
- New fields for friendly IDs (
runIds
),runCount
,payload
,payloadType
,options
, andbatchVersion
have been added.- For engine v2, arrays to capture
executionSnapshots
,runsBlocked
, andwaitpoints
are included.This expanded model should provide the necessary data to manage batch processing more effectively. Please ensure that the relationships and unique constraints between batch items, snapshots, and waitpoints are well tested and that migration scripts are provided if needed.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
internal-packages/database/prisma/migrations/20250304184614_remove_task_run_first_attempt_started_at_column/migration.sql
(1 hunks)internal-packages/database/prisma/schema.prisma
(18 hunks)
✅ Files skipped from review due to trivial changes (1)
- internal-packages/database/prisma/migrations/20250304184614_remove_task_run_first_attempt_started_at_column/migration.sql
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: typecheck / typecheck
- GitHub Check: units / 🧪 Unit Tests
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (15)
internal-packages/database/prisma/schema.prisma (15)
151-152
: Addition of Worker Fields to Organization
The new fieldsworkerGroups
andworkerInstances
have been added to theOrganization
model to associate worker groups and instances. This design seems appropriate for grouping and managing these entities.
454-456
: Introduction of Engine Field in Project Model
TheProject
model now includes a new fieldengine
of typeRunEngineVersion
with a default ofV1
alongside theversion
field set toV2
. Given that this PR is for Run Engine 2.0 (alpha), please verify that the default engine version is intentional. Consider whether the alpha testing should instead default toV2
or if explicit switching is desired.
459-464
: Enhancements to Project Worker Associations
Additional arrays forworkerGroups
andworkers
, as well as the optional relationdefaultWorkerGroup
(with its ID field), have been added to theProject
model. These changes improve the linkage between projects and worker group entities. Ensure that these relations are in sync with your application logic and that any index or constraint considerations for optimizing query performance are addressed.
424-427
: Expanded RuntimeEnvironment Relationships
TheRuntimeEnvironment
model now includes arrays fortaskRunCheckpoints
,waitpoints
,workerInstances
, andexecutionSnapshots
. This extension should help in tracking task execution progress and system state at the environment level. Please verify that the engine logic which updates these fields maintains consistency, and consider adding documentation on how they interact.
1578-1579
: BackgroundWorker Engine Version Field Added
TheBackgroundWorker
model now declares anengine
field of typeRunEngineVersion
with a default value ofV1
. This change sets the groundwork for differentiating between run engine versions at the worker level. Confirm that this default is aligned with your intended rollout strategy for Engine 2.0 alpha.
1603-1605
: Linking BackgroundWorkers to Worker Groups
A new relation fieldworkerGroup
(and its correspondingworkerGroupId
) has been added to theBackgroundWorker
model. This improves the grouping and management of background workers. Ensure that cascading and nullability behavior (i.e. usingonDelete: SetNull
) is consistent with your overall data integrity requirements.
1687-1687
: TaskRun Engine Field Introduction
TheTaskRun
model now gets anengine
field (of typeRunEngineVersion
) with a default ofV1
. This is an important addition for enabling engine version–based behavior throughout task runs. Please verify that related business logic correctly discriminates between V1 and V2 as needed.
1714-1719
: New Queue and Attempt Fields in TaskRun
The additions ofmasterQueue
(with default"main"
), an optionalsecondaryMasterQueue
, andattemptNumber
fields to theTaskRun
model provide enhanced control over task queuing and execution tracking for engine v2+. These fields will be key for managing the run order and retries. Ensure that their usage is thoroughly documented and tested in the engine workflow.
1770-1774
: Waitpoint Integration in TaskRun
By introducing theassociatedWaitpoint
(as an optional relation) andblockedByWaitpoints
(as an array ofTaskRunWaitpoint
), the model now embeds waitpoint logic directly into the task run. This should help in preventing execution when there are pending waitpoints. Please make sure that transactional updates properly reflect the completion state of these waitpoints across the run lifecycle.
1784-1784
: Execution Snapshots Captured in TaskRun
The addition of theexecutionSnapshots
array inTaskRun
further aligns with the new tracking mechanisms introduced in the run engine. Confirm that data consistency is maintained and that snapshots are not duplicated or left stale betweenTaskRun
andTaskRunExecutionSnapshot
.
1927-1931
: New Enum: RunEngineVersion
The newRunEngineVersion
enum distinguishes between V1 and V2 versions of the run engine. This is fundamental to supporting multiple engine implementations simultaneously. Ensure that all engine-dependent logic across your codebase correctly references this enum, and update documentation accordingly.
1938-1944
: TaskRunExecutionSnapshot Enhancements
In theTaskRunExecutionSnapshot
model, the default engine is now set toV2
, which indicates that snapshots are intended for runs using the new engine. The inline comments clarifying that V1 did not use snapshots are very helpful. Please verify that any historical data or migration scripts are adjusted accordingly.
2727-2731
: New WorkerDeploymentType Enum Values
The enumWorkerDeploymentType
now includes aV1
member alongsideMANAGED
andUNMANAGED
. This change appears designed to provide backwards compatibility for deployment records. Ensure that your migration strategy covers upgrading legacy deployments if needed, and consider whether a similar option for V2 might be required in future iterations.
2746-2747
: WorkerDeployment Type Field Update
TheWorkerDeployment
model now explicitly includes atype
field of typeWorkerDeploymentType
with a default ofV1
. This clarifies the deployment model and ties it to the appropriate engine version. Double-check that your deployment orchestration logic respects this field when triggering builds and rollouts.
2070-2105
:❓ Verification inconclusive
Comprehensive Updates to the Waitpoint Model
Several new fields have been introduced:
inactiveIdempotencyKey
for storing expired keys,completedByTaskRunId
with its relation to mark the run that completed the waitpoint,completedAfter
for DATETIME/MANUAL waitpoints,completedByBatchId
for batch associations,blockingTaskRuns
to list task runs blocked by this waitpoint, andcompletedExecutionSnapshots
, along with output fields for storing waitpoint results.These additions significantly enhance waitpoint tracking and actionability. Please verify that the unique constraint
@@unique([environmentId, idempotencyKey])
and other indexes do not introduce conflicts when multiple waitpoints are created. Also, ensure that downstream logic properly deactivates and re-generates idempotency keys as documented in the comments.
🏁 Script executed:
#!/bin/bash # Search across the repository for occurrences of "idempotencyKey" rg "idempotencyKey" . # Search for occurrences of "inactiveIdempotencyKey" to compare references rg "inactiveIdempotencyKey" .Length of output: 65749
Ensure Downstream Idempotency Key Logic Handles New Constraints Correctly
The new fields in the schema—includinginactiveIdempotencyKey
,completedByTaskRunId
,completedAfter
,completedByBatchId
,blockingTaskRuns
, andcompletedExecutionSnapshots
—integrate well with the updated waitpoint model. The unique constraint (@@unique([environmentId, idempotencyKey])
) aligns with our migration files (e.g.Waitpoint_environmentId_idempotencyKey_key
), and similar logic for deactivating and regenerating idempotency keys is evident in the engine (seeinternal-packages/run-engine/src/engine/index.ts
whereinactiveIdempotencyKey
is set andidempotencyKey
regenerated).Next Steps:
- Verify that when a waitpoint is completed, the active
idempotencyKey
is properly deactivated (saved ininactiveIdempotencyKey
) and regenerated to avoid conflicts, especially under concurrent waitpoint creations.- Check that downstream processes (e.g., task triggers and waitpoint handling) consistently use the new fields and adhere to the unique constraint.
- Ensure that any tests covering waitpoint key regeneration and conflict resolution are updated and passing.
We're going to merge this to make working on it easier, but this does not mean that Run Engine 2.0 is available (yet). We're in the last stages before it will go into cloud for beta (and then we'll get the self-hosted version released). |
Good luck @coderabbitai
Summary by CodeRabbit
New Features
Improvements
Chores