Skip to content

GraphFlow State Persistence Bug: Workflow Gets Stuck After Interruption During Agent Transitions #7043

@droideronline

Description

@droideronline

What happened?


Summary

GraphFlow workflows become unrecoverable when interrupted during agent transitions, leading to a corrupted state where no agents are ready to execute despite having remaining work.


Environment

  • Framework: autogen-agentchat with GraphFlow
  • Python: 3.11
  • Issue Type: State persistence & resume functionality

Problem Description

When a GraphFlow workflow is interrupted (e.g., via KeyboardInterrupt) during the transition between agents, the saved state becomes corrupted.

On resume, the workflow terminates immediately with:

Digraph execution is complete

—even though agents still have remaining work.


Steps to Reproduce

  1. Create a GraphFlow with multiple agents in sequence.

  2. Start the workflow and let the first agent complete successfully.

  3. Interrupt the process (Ctrl+C) during the transition to the next agent.

  4. Attempt to resume using:

    team.load_state(saved_state)
    run_stream(task="continue")

Expected Behavior

  • Workflow should resume from the next agent in the sequence and continue execution seamlessly.

Actual Behavior

  • Workflow immediately terminates with "Digraph execution is complete".
  • All agents receive the "continue" message but none execute.
  • The ready queue is empty despite remaining work.

State File Analysis

The corrupted state shows:

{
  "GraphManager": {
    "remaining": {
      "next_agent": { "next_agent": 1 },
      // ... other agents with work remaining
    },
    "enqueued_any": {
      "next_agent": { "next_agent": false },
      // ... all agents show false
    },
    "ready": []  
  }
}
  • First agent completed – properly recorded in message history
  • Next agent not enqueued – transition interrupted before coordination
  • Workflow coordination lost – no agents ready despite remaining work

Root Cause

The GraphFlow coordination mechanism is interrupted before it can enqueue the next agent, leaving the system in an inconsistent state:

  • Remaining work exists
  • No agents are enqueued
  • The workflow appears "complete" but is actually stuck

Which packages was the bug in?

Python Core (autogen-core)

AutoGen library version.

Python dev (main branch)

Other library version.

No response

Model used

No response

Model provider

None

Other model provider

No response

Python version

3.11

.NET version

None

Operating system

MacOS

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions