fix: prevent MCP cancel scope errors from crashing agent-to-agent calls#1314
Open
jsonmp-k8 wants to merge 1 commit intokagent-dev:mainfrom
Open
fix: prevent MCP cancel scope errors from crashing agent-to-agent calls#1314jsonmp-k8 wants to merge 1 commit intokagent-dev:mainfrom
jsonmp-k8 wants to merge 1 commit intokagent-dev:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical bug where MCP session cleanup crashes during agent-to-agent calls when sub-agents use MCP tools. The issue stems from anyio cancel scope violations when cleanup occurs across different asyncio tasks, causing 500 Internal Server Errors.
Changes:
- Added defensive exception handling in
KAgentMcpToolset.close()to catchBaseException(includingCancelledError) during MCP cleanup - Introduced
_safe_close_runner()method that isolates runner cleanup in a separate task usingasyncio.gather(return_exceptions=True)to prevent cancel scope corruption from propagating - Added logging for non-fatal cleanup errors to aid debugging without crashing the system
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
python/packages/kagent-adk/src/kagent/adk/_mcp_toolset.py |
Overrides close() method to catch BaseException during MCP session cleanup, preventing CancelledError from escaping and corrupting cancel state |
python/packages/kagent-adk/src/kagent/adk/_agent_executor.py |
Adds _safe_close_runner() method that isolates runner cleanup in a separate task to prevent cancel scope errors from propagating to event queue teardown |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
When an orchestrator agent calls a sub-agent that uses MCP tools, the
MCP session cleanup can trigger anyio CancelScope violations. This
happens because cancel scopes entered in one asyncio task context get
exited in a different task (created by asyncio.wait_for in the upstream
google-adk _cleanup_toolsets). The resulting CancelledError propagates
upward and crashes the A2A event queue teardown, causing 500 errors.
This fix applies multiple layers of defense:
1. _agent_executor.py: Wrap execute() with a top-level CancelledError
guard that clears all pending task cancellations via Task.uncancel()
(Python 3.11+) and publishes a failed status event instead of letting
the error propagate to _run_event_stream in the A2A SDK.
2. _agent_executor.py: Run runner.close() in an isolated asyncio task
via asyncio.gather(return_exceptions=True), so any CancelledError or
cancel scope corruption stays contained. Only suppress the specific
cross-task cancel scope error ("cancel scope" + "different task"),
re-raise everything else.
3. _mcp_toolset.py: Override close() to catch BaseException (not just
Exception), since CancelledError is a BaseException in Python 3.9+
and the upstream McpToolset.close() only catches Exception. Only
suppress known anyio cross-task cancel scope errors.
4. _agent_executor.py: Widen _publish_failed_status_event catch from
Exception to BaseException (re-raising KeyboardInterrupt/SystemExit)
so residual CancelledError cannot escape the failure event publisher.
Fixes kagent-dev#1276
Signed-off-by: Jaison Paul <paul.jaison@gmail.com>
df8b4b9 to
e4e1deb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
runner.close()in a separate asyncio task usingasyncio.gather(return_exceptions=True)to prevent cancel scope corruption from propagating to the A2A event queue teardownKAgentMcpToolset.close()to catchBaseException(includingCancelledError) since the upstreamMcpToolset.close()only catchesExceptionRoot Cause
When an orchestrator agent delegates to a sub-agent with MCP tools:
asyncio.create_taskfromDefaultRequestHandler), enteringanyio.CancelScopecontexts_cleanup_toolsets()in google-adk wrapstoolset.close()inasyncio.wait_for(), which can create a new internal taskAsyncExitStack.aclose()in that new task tries to exit theCancelScopefrom Task P — a different taskAttempted to exit a cancel scope that isn't the current task's current cancel scopeCancelledErrorpropagates and crashesqueue.close()→queue.join(), causing a 500 Internal Server ErrorFix
Two layers of defense:
_agent_executor.py:_safe_close_runner()runsrunner.close()in an isolated asyncio task viaasyncio.gather(return_exceptions=True). AnyCancelledErroror cancel scope corruption is collected as a result rather than raised, keeping the event queue teardown safe._mcp_toolset.py: Overrideclose()to catchBaseException(not justException). In Python 3.9+,CancelledErroris aBaseException, so the upstreamMcpToolset.close()which only catchesExceptionlets it escape.Test plan
Fixes #1276