Skip to content

Conversation

@icecrasher321
Copy link
Collaborator

@icecrasher321 icecrasher321 commented Jan 2, 2026

Summary

  • HITL pause function failing let costs leak through. Patch that to always log the cost.
  • Trigger Dev OOM / Timeout errors would leave log in running state, and also not track cost. Add incremental cost logging to always capture cost.
  • Cron to cleanup the state of these executions that died because of worker crash [standard practice]
  • Cleanup types import structure in execution

Type of Change

  • Bug fix

Testing

Tested all three improvements manually.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Jan 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Review Updated (UTC)
docs Skipped Skipped Jan 2, 2026 9:56pm

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 2, 2026

Greptile Summary

This PR implements three critical improvements to prevent cost leakage and execution state corruption when worker processes crash or timeout:

Key Changes:

  • Incremental cost logging: Added onBlockComplete and flushAccumulatedCost to LoggingSession to write costs to the database after each block execution, ensuring costs are captured even if the worker crashes mid-execution
  • HITL pause error handling: Wrapped persistPauseResult calls in try-catch blocks across all execution paths (API, webhooks, schedules, Trigger.dev) to mark executions as failed if pause state cannot be persisted, preventing cost loss
  • Stale execution cleanup: New cron job at apps/sim/app/api/cron/cleanup-stale-executions/route.ts identifies executions stuck in running state for 30+ minutes and marks them as failed with appropriate error messages
  • Type consolidation: Moved ExecutionMetadata, SerializableExecutionState, and ExecutionCallbacks to executor/execution/types.ts for better organization
  • Simplified cost tracking: Removed mergeCostModels method and test coverage; cost is now initialized with BASE_EXECUTION_CHARGE upfront

The changes ensure that costs are always captured and execution state is properly cleaned up, even when Trigger.dev workers experience OOM errors or timeouts.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are well-architected defensive improvements that add error handling and incremental persistence without modifying core execution logic. The incremental cost tracking pattern is robust, the HITL error handling is comprehensive across all execution paths, and the cron cleanup follows standard database maintenance patterns.
  • No files require special attention

Important Files Changed

Filename Overview
apps/sim/lib/logs/execution/logging-session.ts Added incremental cost tracking via onBlockComplete and flushAccumulatedCost to capture costs as execution progresses, preventing cost loss on worker crashes
apps/sim/lib/logs/execution/logger.ts Simplified cost tracking by removing mergeCostModels method and initializing executions with BASE_EXECUTION_CHARGE upfront
apps/sim/app/api/cron/cleanup-stale-executions/route.ts New cron endpoint that identifies executions stuck in running state for 30+ minutes and marks them as failed
apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts Added try-catch around persistPauseResult to prevent unhandled failures, now marks execution as failed if pause persistence fails
apps/sim/app/api/workflows/[id]/execute/route.ts Added error handling for pause persistence failures in execution route

Sequence Diagram

sequenceDiagram
    participant Client
    participant ExecutionCore
    participant LoggingSession
    participant Database
    participant PauseManager
    participant CronJob

    Note over Client,CronJob: Normal Execution Flow with Incremental Cost Tracking
    
    Client->>ExecutionCore: Execute Workflow
    ExecutionCore->>LoggingSession: start()
    LoggingSession->>Database: Create log entry with BASE_EXECUTION_CHARGE
    
    loop For each block execution
        ExecutionCore->>ExecutionCore: Execute Block
        ExecutionCore->>LoggingSession: onBlockComplete(blockId, output)
        LoggingSession->>LoggingSession: Accumulate cost
        LoggingSession->>Database: flushAccumulatedCost()
        Note over Database: Cost saved incrementally<br/>to prevent loss on crash
    end
    
    Note over Client,CronJob: HITL Pause Flow with Error Handling
    
    ExecutionCore->>ExecutionCore: Pause detected
    ExecutionCore->>PauseManager: persistPauseResult()
    
    alt Pause persistence succeeds
        PauseManager->>Database: Save pause state & snapshot
        PauseManager-->>ExecutionCore: Success
    else Pause persistence fails
        PauseManager-->>ExecutionCore: Error
        ExecutionCore->>LoggingSession: markAsFailed("Failed to persist pause state")
        LoggingSession->>Database: Update status to 'failed'
        Note over Database: Cost already saved<br/>from incremental tracking
    end
    
    Note over Client,CronJob: Worker Crash Protection
    
    rect rgb(255, 240, 240)
        Note over ExecutionCore: Worker crashes/times out
        ExecutionCore->>ExecutionCore: Process terminates
        Note over Database: Log remains in 'running' state<br/>Cost partially saved via incremental flush
    end
    
    Note over Client,CronJob: Stale Execution Cleanup (Cron)
    
    CronJob->>Database: Query executions in 'running' state > 30 min
    Database-->>CronJob: Return stale executions
    
    loop For each stale execution
        CronJob->>Database: Update status to 'failed'
        CronJob->>Database: Set error message
        CronJob->>Database: Set endedAt & totalDurationMs
    end
    
    Note over CronJob,Database: Cleanup ensures no executions<br/>remain stuck in running state
Loading

@icecrasher321
Copy link
Collaborator Author

@greptile

@icecrasher321 icecrasher321 merged commit dc3de95 into staging Jan 2, 2026
11 checks passed
@waleedlatif1 waleedlatif1 deleted the fix/logging-gaps branch January 3, 2026 01:54
waleedlatif1 added a commit that referenced this pull request Jan 3, 2026
…ext menu (#2672)

* feat(logs-context-menu): consolidated logs utils and types, added logs record context menu (#2659)

* feat(email): welcome email; improvement(emails): ui/ux (#2658)

* feat(email): welcome email; improvement(emails): ui/ux

* improvement(emails): links, accounts, preview

* refactor(emails): file structure and wrapper components

* added envvar for personal emails sent, added isHosted gate

* fixed failing tests, added env mock

* fix: removed comment

---------

Co-authored-by: waleed <walif6@gmail.com>

* fix(logging): hitl + trigger dev crash protection (#2664)

* hitl gaps

* deal with trigger worker crashes

* cleanup import strcuture

* feat(imap): added support for imap trigger (#2663)

* feat(tools): added support for imap trigger

* feat(imap): added parity, tested

* ack PR comments

* final cleanup

* feat(i18n): update translations (#2665)

Co-authored-by: waleedlatif1 <waleedlatif1@users.noreply.github.com>

* fix(grain): updated grain trigger to auto-establish trigger (#2666)

Co-authored-by: aadamgough <adam@sim.ai>

* feat(admin): routes to manage deployments (#2667)

* feat(admin): routes to manage deployments

* fix naming fo deployed by

* feat(time-picker): added timepicker emcn component, added to playground, added searchable prop for dropdown, added more timezones for schedule, updated license and notice date (#2668)

* feat(time-picker): added timepicker emcn component, added to playground, added searchable prop for dropdown, added more timezones for schedule, updated license and notice date

* removed unused params, cleaned up redundant utils

* improvement(invite): aligned styling (#2669)

* improvement(invite): aligned with rest of app

* fix(invite): error handling

* fix: addressed comments

---------

Co-authored-by: Emir Karabeg <78010029+emir-karabeg@users.noreply.github.com>
Co-authored-by: Vikhyath Mondreti <vikhyathvikku@gmail.com>
Co-authored-by: waleedlatif1 <waleedlatif1@users.noreply.github.com>
Co-authored-by: Adam Gough <77861281+aadamgough@users.noreply.github.com>
Co-authored-by: aadamgough <adam@sim.ai>
waleedlatif1 pushed a commit that referenced this pull request Jan 8, 2026
* hitl gaps

* deal with trigger worker crashes

* cleanup import strcuture
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants