Fix/mcp tool adapter connects to the mcp server in a stateless mode #6284

SongChiYoung · 2025-04-12T01:20:03Z

Why are these changes needed?

This PR introduces a reusable, event-driven McpSessionActor component for AutoGen’s MCP tool adapters.
Previously, each tool instance created its own isolated ClientSession, making it difficult to share a single session context across tools. This change solves that by:

Introducing a McpSessionActor that manages a single session each Agent via an async task.
Allowing safe concurrent usage of the same session by multiple tools.
Supporting serialization/deserialization of the session actor for config-based workflows.
Ensuring graceful shutdown of the session, even in atexit or test teardown scenarios.

This addresses hanging issues during test teardown and resolves RuntimeError: Attempted to exit cancel scope in a different task errors in anyio.

Related issue number

Closes #6198

Checks

I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://github.com/microsoft/autogen/blob/main/CONTRIBUTING.md to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.
—-

Notes on Lifecycle Management and Design Decisions

This PR does not include a routine to reinitialize the session when an Agent or Team stops for various reasons. Instead, the actor shuts down gracefully when the process terminates.
It's unclear whether the on_reset method is expected to be invoked in such cases, so I took a conservative approach. I'm happy to adjust this if there's guidance from the maintainers.
As seen in the test cases, when the session is used outside of an Agent (i.e., called directly), users must explicitly call the close() method.
This design assumes lazy initialization and does not cause hangs as long as the session is constructed but not used.
Supporting with-statement awareness was not included in this PR to keep things simple, but it can be added in a separate PR if the community sees value in it.
Currently, the session is not reset inside on_reset() since doing so would require another initialize() on the next call, adding more complexity.
If needed, I’m open to revisiting this, either in this PR or a follow-up, depending on guidance from domain experts or maintainers.

In summary: there may be edge cases that lead to a hang only if the session is used without an Agent and not closed explicitly.
This is considered a usage-side responsibility. Adding more complexity to guard against such nonstandard usage might lead to over-engineering.
As for on_reset, I tried to follow the “simple is best” philosophy by relying on graceful shutdown rather than adding extra lifecycle management.

If there’s a better idiomatic approach for AutoGen or MCP usage patterns, I’d greatly appreciate your advice.

…Connects_to_the_MCP_Server_in_a_StatelessMode

SongChiYoung · 2025-04-12T02:42:07Z

Notes on Lifecycle Management and Design Decisions

This PR does not include a routine to reinitialize the session when an Agent or Team stops for various reasons. Instead, the actor shuts down gracefully when the process terminates.
It's unclear whether the on_reset method is expected to be invoked in such cases, so I took a conservative approach. I'm happy to adjust this if there's guidance from the maintainers.
As seen in the test cases, when the session is used outside of an Agent (i.e., called directly), users must explicitly call the close() method.
This design assumes lazy initialization and does not cause hangs as long as the session is constructed but not used.
Supporting with-statement awareness was not included in this PR to keep things simple, but it can be added in a separate PR if the community sees value in it.
Currently, the session is not reset inside on_reset() since doing so would require another initialize() on the next call, adding more complexity.
If needed, I’m open to revisiting this, either in this PR or a follow-up, depending on guidance from domain experts or maintainers.

In summary: there may be edge cases that lead to a hang only if the session is used without an Agent and not closed explicitly.
This is considered a usage-side responsibility. Adding more complexity to guard against such nonstandard usage might lead to over-engineering.
As for on_reset, I tried to follow the “simple is best” philosophy by relying on graceful shutdown rather than adding extra lifecycle management.

If there’s a better idiomatic approach for AutoGen or MCP usage patterns, I’d greatly appreciate your advice.

python/packages/autogen-ext/src/autogen_ext/tools/mcp/_session.py

ekzhu · 2025-04-13T04:42:26Z

Currently, the session is not reset inside on_reset() since doing so would require another initialize() on the next call, adding more complexity.

Make sense.

python/packages/autogen-ext/src/autogen_ext/tools/mcp/_session.py

…Connects_to_the_MCP_Server_in_a_StatelessMode

…r_in_a_StatelessMode

codecov · 2025-04-15T06:29:05Z

Codecov Report

Attention: Patch coverage is 80.70175% with 22 lines in your changes missing coverage. Please review.

Project coverage is 77.27%. Comparing base (7e8472f) to head (646e75f).

Files with missing lines	Patch %	Lines
.../autogen-ext/src/autogen_ext/tools/mcp/_session.py	81.57%	14 Missing ⚠️
.../autogen-ext/src/autogen_ext/tools/mcp/_factory.py	14.28%	6 Missing ⚠️
...ges/autogen-ext/src/autogen_ext/tools/mcp/_base.py	87.50%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6284      +/-   ##
==========================================
+ Coverage   77.24%   77.27%   +0.02%     
==========================================
  Files         200      200              
  Lines       14473    14556      +83     
==========================================
+ Hits        11180    11248      +68     
- Misses       3293     3308      +15

Flag	Coverage Δ
unittests	`77.27% <80.70%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ekzhu

I have changed my mind regarding the overall design I approved earlier.

I tried this with some examples and I found that when there is an error from the tool, it is not properly closing the tools and causing event loop closed errors -- as I expected because I forgot to include try--finally to close the tools.

Try this simple example:

params = StdioServerParams(
    command="uvx",
    args=["mcp-server-fetch"],
    read_timeout_seconds=60,
)
tools = await mcp_server_tools(server_params=params)
assert tools is not None
assert tools[0].name == "fetch"
result = await tools[0].run_json({"url": "https://github.com/microsoft/not_exist"}, CancellationToken())
assert result is not None

Another weird issue is when one of the tool is closed, all of the tools will be closed -- this can be counter-intuitive.

I think we still need to use a context manager for the session. We can still use the McpSessionActor wrapper. I think this way it will make the implementation much easier and avoid confusion.

One more note: I think McpSession instead of McpSessionActor is more succinct.

ekzhu · 2025-04-15T07:01:17Z

python/packages/autogen-ext/src/autogen_ext/tools/mcp/_factory.py

Let's update the API doc to use a context manager for the McpSessionActor.

ekzhu · 2025-04-15T07:04:38Z

python/packages/autogen-ext/src/autogen_ext/tools/mcp/_session.py

+    server_params: McpServerParams
+
+
+class McpSessionActor(ComponentBase[BaseModel], Component[McpSessionConfig]):


The McpSessionActor should be used as an async context manager.

it is very easy for user to accidently forget calling close().

I think the new usage should be like this:

async McpSessionActor(server_params) as session: tools = mcp_server_tools(sessions) # rest of code.

Otherwise we will need to call close() on every tools directly, which is fine in some cases, but it is really easy to forget. So we should support the async context manager usage.

ekzhu · 2025-04-15T07:20:12Z

Also, let's add some unit tests that test mcp_server_tools directly to improve the code coverage.

SongChiYoung · 2025-04-15T13:19:24Z

@ekzhu

Thanks for the detailed suggestions! I think I now better understand your preference for using async with on McpSessionActor directly — it’s a clean way to enforce graceful shutdown and avoid forgetting close().

However, I’d like to share one constraint from real-world usage that makes that pattern difficult in practice:

🧩 Tools are often injected into Agents — not used inline

In most AutoGen use cases, we don’t call tools immediately after creation.
Instead, tools are passed into Agents or Teams as dependencies.

In this pattern, the actor must outlive the entire agent execution, and it’s very difficult to scope the McpSessionActor within an async with block before injecting it.

Then the session would be closed the moment the with block ends — even before the agent starts running.
This breaks composability.

⸻

🛠️ Why the current design favors robustness

To solve that, I moved session lifecycle outside of tool logic, and into the McpSessionActor, which:
• Shares a single session across multiple tools
• Allows lazy initialization
• Can be reused across agents
• Can be explicitly await close()d at safe points (e.g. at the end of the test or app shutdown)

And yes — I absolutely agree that we should add aenter / aexit support as optional sugar, so advanced users can still do.
But I believe the actor-based pattern is necessary to make tool injection work safely.

Would love to hear your thoughts, and happy to iterate on this further!

ekzhu · 2025-04-15T14:13:47Z

In most AutoGen use cases, we don’t call tools immediately after creation. Instead, tools are passed into Agents or Teams as dependencies.

Then all of the running of agents and teams should be within the session's context. I don't see why this is an issue. The session context clearly indicates the lifecycle of the session, and anything that depends on it should live within it.

If the session is not created as a context, then we can lazily initialize it. But how do you solve the issue that closing one tool closes all the tools?

SongChiYoung · 2025-04-16T11:30:33Z

@ekzhu
Thanks for the suggestion.
I tried testing with a with block, but hit a few issues - so I wanted to first explain where things stand.

TL;DR

To fix McpToolAdapter Connects to the MCP Server in a Stateless Mode #6198, all MCP tools must share the same session.
--> That’s why closing one tool closes all others.
Using with solves that, but breaks user code of before version and makes serialization hard.
--> Found this thanks to your comment - turns out even my version has deserialization issues.
The current Actor pattern avoids both problems.
--> Runs in its own task and owns the session.
--> Can inject the same session into multiple tools.

I’ll still try to support with in a clean way if possible.

1. #6198 root issue

The original issue was the session closing early if tools used with inside an Agent.

Later I found a bigger problem:
Tools in the same MCP server need to share the same session.

For example, web_surf and save_pdf are separate tools — but the PDF tool can’t access browsed pages unless they share a session.
That’s why I had to inject the same session into all tools when using mcp_server_tools().

2. Why `with` was tricky in practice

Yes, wrapping the whole thing in a with block technically works —
But it breaks two things:

It changes the public API and user code has to change
It breaks serialization/deserialization
(Sessions are created outside the Agent/Team - so they can’t be deserialization)

Right now I’m working on making this session sharing work even in deserialization — your comment helped me catch this. Appreciate it.

3. Why I went with Actor

MCP SDK raises errors when the session is closed from a different task:

RuntimeError: athrow(): asynchronous generator is already running  
RuntimeError: Attempted to exit cancel scope in a different task than it was entered in

That’s why I put the session inside a dedicated task - the Actor.
That way it’s always closed from the right place. Tools just talk to the Actor, and it owns the session.

4. About “closing one tool closes all”

This is expected - because tools share the same session.

I thought about tracking tool lifetimes and only closing the session when all are done -
But it adds a lot of complexity, and probably isn’t worth it.
If there’s a cleaner idea I’m open though.

Let me know if you think anything here’s off.
Happy to adjust direction.

…Connects_to_the_MCP_Server_in_a_StatelessMode

SongChiYoung · 2025-04-16T16:26:03Z

@ekzhu
I think the current implementation is now in a pretty good place, and I’d love your thoughts.

✅ Summary of changes:

The session now lives inside McpToolAdapter, but is not closed at the end of with — this ensures it stays alive for as long as the Agent is running.
I updated the atexit handler to avoid event loop closed errors — it should now exit more gracefully.
I also fixed the issue where closing one tool would shut down the entire session — I'm now tracking a simple ref_count so the session only shuts down when all tools are done.
Serialization and deserialization are working correctly again!

One note on the async with usage pattern:
As I mentioned in an earlier comment, I’m hesitant to require users to wrap tools in a with block at a higher level — mainly because it introduces real complications with serialization/deserialization.
That’s why I’m keeping the session lifecycle managed independently for now.

🤔 A quick question about tests:

From what I saw in the previous codecov results, the coverage didn’t drop significantly after the changes.
That said, if there are specific test cases you’d still like to see, I’d be happy to add them!
Many of the new raise statements were added for safety, and might be tricky to trigger intentionally.

I’ve also documented more details, but figured I’d hold off unless it’s helpful now (since it might be too much information all at once). Just let me know if you’d like me to share more!

ekzhu · 2025-04-16T18:17:25Z

It breaks serialization/deserialization
(Sessions are created outside the Agent/Team - so they can’t be deserialization)

I think the fundamental issue is that MCP server is not compatible with how AssistantAgent uses tools, which assumes tools are separate and stateless.

It's not something we can solve with this PR -- it will require changes to the AssistantAgent. We could pass the MCP server parameters as a constructor argument to AssistantAgent, which then setups the session internally and manages it as part of its own lifecycle. This way, it will work perfectly with serialization and deserialization.

SongChiYoung · 2025-04-16T18:39:30Z

@ekzhu
You're absolutely right — I totally agree with your point.

But since passing MCP server parameters directly into Agent would introduce a breaking change,
how about making stateful tools optional instead?

(That way, the current PR wouldn't need to change much at all.)

Also, we could add close() hooks to Agent:

close() would terminate the session
on_reset() would re-initialize it

BTW even with your design, I think we’ll still need the Agent.close() hook for cleaning up the session,
and on_reset() must re-initialize it anyway.

so this might work either way.

Just one question:
Is there an existing hook when the Team terminates that could be used to trigger agent-level shutdowns?
I feel like there must be one based on State, but maybe I just don’t know the right place to look yet.

ekzhu · 2025-04-16T18:47:29Z

I think we can do that in a different PR.

For this PR, I want to avoid the complexity in the current implementation: reference counts, actors ... too many moving parts and potential for bugs.

Let's just focus on addressing the original issue with minimal fix, without worrying about serialization, which will be addressed in a separate PR.

So, for tools that require sharing a persistent session:

with create_mcp_server_session(server_params) as session:
  tools = mcp_server_tools(session)
  agent = AssistantAgent(..., tools=tools)

This won't work with serialization but that's not the point. It addresses the immediate issue.

For tools that can use new session on each invocation:

tools = mcp_server_tools(server_params)
agent = AssistantAgent(..., tools=tools)

Same thing as before.

I believe my issue description in #6198 already described this.

ekzhu · 2025-04-16T18:48:40Z

I'd rather have a quick fix to this problem so I can make a release today :D

I can take on this if it's too late for you.

SongChiYoung · 2025-04-16T19:06:35Z

@ekzhu
Totally fair — I understand the goal is to keep this PR minimal and focused. 👍

That said, I’ve been working on a use case where MCP + serialization is actually super important 😅
I’m building an internal tool CompanyDeepResearch(Research datas and makes report from company own data) on top of AutoGen + MCP, using the autogen-oaiapi wrapper I shared earlier in Discord.
For concurrent access and persistence, team (de)serialization for copy is kind of essential.

So… just wanted to flag that there’s at least one user (me 😅) really hoping to see serialization support without introducing breaking changes to the API.

I’ll hold off for now and follow up in a separate PR as suggested!

By the way — it’s currently 4AM KST here 😅
I’m worried I might break something if I try to fix this right now — would you mind taking care of it for this issue? I’d really appreciate it! 🙏

ekzhu · 2025-04-16T21:17:53Z

That said, I’ve been working on a use case where MCP + serialization is actually super important 😅

Understood. Let's get this issue over with and work on a more complete fix as a separate PR.

Resolves #6232, #6198 This PR introduces an optional parameter `session` to `mcp_server_tools` to support reuse of the same session. ```python import asyncio from autogen_agentchat.agents import AssistantAgent from autogen_agentchat.conditions import TextMentionTermination from autogen_agentchat.teams import RoundRobinGroupChat from autogen_agentchat.ui import Console from autogen_ext.models.openai import OpenAIChatCompletionClient from autogen_ext.tools.mcp import StdioServerParams, create_mcp_server_session, mcp_server_tools async def main() -> None: model_client = OpenAIChatCompletionClient(model="gpt-4o", parallel_tool_calls=False) # type: ignore params = StdioServerParams( command="npx", args=["@playwright/mcp@latest"], read_timeout_seconds=60, ) async with create_mcp_server_session(params) as session: await session.initialize() tools = await mcp_server_tools(server_params=params, session=session) print(f"Tools: {[tool.name for tool in tools]}") agent = AssistantAgent( name="Assistant", model_client=model_client, tools=tools, # type: ignore ) termination = TextMentionTermination("TERMINATE") team = RoundRobinGroupChat([agent], termination_condition=termination) await Console( team.run_stream( task="Go to https://ekzhu.com/, visit the first link in the page, then tell me about the linked page." ) ) asyncio.run(main()) ``` Based on discussion in this thread: #6284, we will consider serialization and deserialization of MCP server tools when used in this manner in a separate issue. This PR also replaces the `json_schema_to_pydantic` dependency with built-in utils.

SongChiYoung added 4 commits April 12, 2025 09:17

FEAT: now it's working

54a0302

Merge remote-tracking branch 'upstream/main' into FIX/McpToolAdapter_…

02ab35a

…Connects_to_the_MCP_Server_in_a_StatelessMode

TEST: fix test case, when change internal logic

a38e5e3

CHOR: pyright/mypy done

340cbc2

SongChiYoung mentioned this pull request Apr 12, 2025

McpToolAdapter Connects to the MCP Server in a Stateless Mode #6198

Closed

SongChiYoung mentioned this pull request Apr 12, 2025

from autogen_ext.models.ollama import OllamaChatCompletionClient init #6155

Open

ekzhu reviewed Apr 13, 2025

View reviewed changes

python/packages/autogen-ext/src/autogen_ext/tools/mcp/_session.py Show resolved Hide resolved

ekzhu reviewed Apr 13, 2025

View reviewed changes

python/packages/autogen-ext/src/autogen_ext/tools/mcp/_session.py Show resolved Hide resolved

SongChiYoung and others added 3 commits April 13, 2025 14:03

Merge remote-tracking branch 'upstream/main' into FIX/McpToolAdapter_…

a6e3485

…Connects_to_the_MCP_Server_in_a_StatelessMode

Merge branch 'main' into FIX/McpToolAdapter_Connects_to_the_MCP_Serve…

b1c1b5c

…r_in_a_StatelessMode

Merge branch 'main' into FIX/McpToolAdapter_Connects_to_the_MCP_Serve…

e6ccd5b

…r_in_a_StatelessMode

ekzhu approved these changes Apr 15, 2025

View reviewed changes

ekzhu added 2 commits April 14, 2025 23:34

Fix format

646e75f

use private

8e9154d

ekzhu suggested changes Apr 15, 2025

View reviewed changes

SongChiYoung added 6 commits April 16, 2025 21:28

FEAT: with version aware

99fe749

FIX: param None case is error

8d4d770

Merge remote-tracking branch 'upstream/main' into FIX/McpToolAdapter_…

56e59b2

…Connects_to_the_MCP_Server_in_a_StatelessMode

FIX: fix commit mistake

c0ea3a1

FIX: death is more grace

0113d4e

FIX: more more graceful die

d3dbd69

FIX: Now close session when terminate team

f263f84

FEAT: Ready for stateful tool mode

7834e7d

ekzhu closed this Apr 16, 2025

SongChiYoung deleted the FIX/McpToolAdapter_Connects_to_the_MCP_Server_in_a_StatelessMode branch April 16, 2025 22:38

ekzhu mentioned this pull request Apr 16, 2025

Make shared session possible for MCP tool #6312

Merged

SongChiYoung mentioned this pull request Apr 21, 2025

Introduce workbench #6340

Merged

		server_params: McpServerParams


		class McpSessionActor(ComponentBase[BaseModel], Component[McpSessionConfig]):

Fix/mcp tool adapter connects to the mcp server in a stateless mode #6284

Fix/mcp tool adapter connects to the mcp server in a stateless mode #6284

Uh oh!

Conversation

SongChiYoung commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Notes on Lifecycle Management and Design Decisions

Uh oh!

SongChiYoung commented Apr 12, 2025

Notes on Lifecycle Management and Design Decisions

Uh oh!

Uh oh!

ekzhu commented Apr 13, 2025

Uh oh!

Uh oh!

codecov bot commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ekzhu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ekzhu Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

ekzhu Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

ekzhu commented Apr 15, 2025

Uh oh!

SongChiYoung commented Apr 15, 2025

Uh oh!

ekzhu commented Apr 15, 2025

Uh oh!

SongChiYoung commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Uh oh!

SongChiYoung commented Apr 16, 2025

✅ Summary of changes:

🤔 A quick question about tests:

Uh oh!

ekzhu commented Apr 16, 2025

Uh oh!

SongChiYoung commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekzhu commented Apr 16, 2025

Uh oh!

ekzhu commented Apr 16, 2025

Uh oh!

SongChiYoung commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekzhu commented Apr 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SongChiYoung commented Apr 12, 2025 •

edited

Loading

codecov bot commented Apr 15, 2025 •

edited

Loading

ekzhu left a comment •

edited

Loading

SongChiYoung commented Apr 16, 2025 •

edited

Loading

SongChiYoung commented Apr 16, 2025 •

edited

Loading

SongChiYoung commented Apr 16, 2025 •

edited

Loading