feat: Some kind of risk level returned by servers #114

domdomegg · 2024-12-13T19:00:37Z

Is your feature request related to a problem? Please describe.

Most MCP client applications (such as the Claude Desktop app), ask users to approve many minor actions. Example dialog:

This can be frustrating for users if there are many tools they're trying to use. Having to do this many times likely will result in users defaulting to allowing decisions (alarm fatigue). Some other MCP client apps might choose to not ask users for permission at all, which seems dangerous.

Ideally we want some way to:

Enable users to grant permission to run low-risk actions automatically
Flag to users when actions really are potentially high-risk, and the consequences (without alarm fatigue)

Describe the solution you'd like

Currently, the protocol does not provide a way for servers to indicate how 'risky' an action is (apart from maybe in a non-structured way in the description). There's also no straightforward way for the server to provide context about how risky a particular action would be.

One idea might be to add properties to the Tool data type, that adds something like:

An expression of how risky an action is in general.
- Intuitively, this might be thought of as a risk_level of low, moderate or high
- In practice we might want something richer, and invite ideas on how we can do this better! Maybe an array of risk types, with some value for impact on them. For example (ideally 'low' or 'high' would be defined on an impact scale for each category):
  - make_amazon_purchase => financial_risk: ~£100-£1000
  - access_google_drive => privacy_risk: moderate?
  - access_medical_records => privacy_risk: high
  - control_smart_light_state => disruption_risk: low
  - start_aws_server => financial_risk: ~£100, cyber_risk: moderate
  - [I think these categories could do with a lot of refinement, and don't stand by them - am sure there are better papers exploring risk taxonomies of AI agents available!]
- Another way this might be thought of is maybe like oauth scopes, which again define a class of actions usually by how impactful they can be.
A way for the server to easily respond with a more precise risk level for a given call
- e.g. if making an Amazon purchase of a specific item, it might return an impact statement like 'This will authorise a payment of £25.99 to buy Cuddly Stuffed Animal Sloth Soft Toy'. It might also attach a structured risk statment of financial_risk: £25.99.

Clients could then have more flexibility for how they want to warn users of actions. E.g.

maybe a user is happy with autoapproving AI system taking any actions with moderate privacy risk (corresponding to a level of accessing general documents), but zero financial risk.
a user might be happy with AI systems reading any of the data in their database, but not editing any of it without checking with them

(In the future, AI systems might themselves be able to make these judgements based on a risk profile set by the user - e.g. evaluating the request against a user's risk appetite statement. Returning the risk information would then help this system evaluate more complex tests, such as 'Autoapprove edits to database table X, but only allow read access to table Y' OR 'Autoapprove creating email drafts, but ask me before sending them.')

Describe alternatives you've considered

I'm open to other ways to achieve solving the problem (improving safety of MCPs by avoiding alarm fatigue).

The text was updated successfully, but these errors were encountered:

jspahrsummers · 2024-12-16T14:43:27Z

Thanks for filing this. Your proposal makes a lot of sense. I wonder if perhaps we support an open-ended set of risk tags, on a normalized scale [0, 1] (similar to model preferences in sampling), with some predefined "well-known" tags codified in the spec.

g0t4 · 2025-01-17T04:05:13Z

FTR we had a similar discussion here: https://github.com/orgs/modelcontextprotocol/discussions/69, I will compare it to the above and add any specific feedback

g0t4 · 2025-01-17T04:57:41Z

Obviously this is a repo about the spec alone, that said here is how I see the bigger picture and what would be useful to consider. Simple would be best, if these changes are overly complicated then users won't bother.

Client config for prompts

Users should be able to configure each tool to: never approve, always approve, approve once per chat, conditionally approve (per tool use request), etc

For example, a fetch server that simply downloads web pages, I would want that to never prompt me.

Tool risk

A tool can instruct users on risk/approval. This could apply as the default if the user doesn't configure an override. So, same choices: never approve, always approve, approve once per chat, conditionally approve (per tool use request)

Models can assess risk per tool use

When a user specifies to conditionally approve, why not let the model assess the risk of each tool use request?!

For example, in my mcp-server-commands, Claude knows well enough to mark the rm command as requires approval. Whereas cat/ls can be no approval, unless reading a sensitive file (i.e. passwords) which Claude can mark as maybe approve/requires approval. And let the client decide how to handle maybe approve (i.e. if I already selected approve once per chat for that tool then don't prompt me)

Best part is, users can customize instructions in a system prompt to specify how they expect the model to interpret risk. For example, by default across all of my projects:

Treat low risk commands like ls and cat as no approval, unless handling sensitive files in which case mark as maybe approve. And, always mark commands that remove files as requires approval

Whereas, a project specific prompt might be:

Commands that modify files underneath the directory /tmp/scratchpad can all be marked no approval because it's a temp directory. If you feel that a command is still risky, you can overrule and prompt me.

Model trust

This would be the last feature I might consider adding if the above is insufficient.

For the most part I trust the models I am using and wouldn't use an untrusted model to do anything risky. That said, people could use a separate model/server solely to score the risk of each tool use request. So, some sort of mechanism to round trip a score (requires approval, no approval, maybe approve) right before the tool is used.

tadasant · 2025-01-17T20:33:40Z

I totally agree with the problem statement here, namely:

This can be frustrating for users if there are many tools they're trying to use. Having to do this many times likely will result in users defaulting to allowing decisions (alarm fatigue).

One concern I have with the proposed solutions' direction is whether or not servers can have the right context to define risk. I think it's hard for a server to properly define the "risk" associated with a tool call in a general way that serves any client. For example, if I build a DuckDuckGo-like "private chat client" that promises to be privacy-forward, then something as simple as a fetch call that exposes the fact that I visited an external website might be considered outside normal "risk" bounds, whereas it's a very low/no-risk action in most other contexts.

It's certainly possible to work around this by well-defining categories and a taxonomy for defining risk that everyone agrees on like @domdomegg started to get into, but I worry it's a lot of complexity to introduce into the spec.

An alternative: what if we leave this up to client applications to manage?

The spec is currently pretty strongly worded on this topic:

Applications SHOULD: ... Present confirmation prompts to the user for operations, to ensure a human is in the loop

A example tweak that might resolve this issue, as far as the spec is concerned:

Applications SHOULD: ... Present confirmation prompts or configuration options to the user for risky operations

If we designate risk to be a concern of the client, I think the client already has everything it needs to manage the risk:

A client already understands the intent of a tool call because it has the tool name/description (case in point: all the examples in this thread so far have communicated the level of risk with one-liner descriptions of the tool). It also knows any arguments it is considering passing into the tool (e.g. private data like an email address), so it actually gets a fuller picture of the possible risk before the server ever could.
A client can store information from a user like "this server is trusted" or "this server is only partially trusted to do what it says" and incorporate that layer of risk accordingly
This can all be done separately to invoking the tool call, so we avoid the need for extending the Tool data type

For a client like Claude Desktop, I could imagine the implementation looking like:

Add a field to claude_desktop_config.json per-server to do one of "never approve, always approve, approve once per chat, conditionally approve"
Optionally allow some nested config to set more granular configs per-tool-call (and when that's missing, keep it in the on-the-fly UX like it is now)
In the "conditionally approve" case, you could allow another config option that allows a free-form description of the conditions, e.g. "when the command argument in this tool call is rm, require approval; if it's ls, don't require; everything else, approve once per chat. Though this one is tricky in that I'm not sure you'd want to introduce a separate inference step here, so maybe you'd want to define a more strictly defined taxonomy... either way, that'd be Claude Desktop's UX concern, not MCP's.

If it so chose, Claude Desktop could smooth over the "JSON config" part of the UX by just running inference on app startup, "based on what I know about my user's risk tolerance, let me configure the current never approve/always approve/etc JSON settings for every visible tool accordingly" and then re-run that step whenever it receives a listChanged notification.

Maybe down the line when there are more examples of popular clients, it'll be worth better-defining best practices for client implementations in the spec.

allenporter · 2025-01-18T15:48:26Z

Yes, the server doesn't get to have a say in defining the risk level. The point of the confirmation is just as much about telling the user that you're sending their private data from a conversation to a remote tool, than the actual riskiness of the action a tool might perform. I agree with the framing that this a client decision, and the client just need enough info in the spec to do this effectively, and then the rest is a client implementation detail.

domdomegg added the enhancement New feature or request label Dec 13, 2024

jspahrsummers added this to the DRAFT: 202X-XX-XX milestone Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Some kind of risk level returned by servers #114

feat: Some kind of risk level returned by servers #114

domdomegg commented Dec 13, 2024 •

edited

Loading

jspahrsummers commented Dec 16, 2024

g0t4 commented Jan 17, 2025

g0t4 commented Jan 17, 2025 •

edited

Loading

tadasant commented Jan 17, 2025

allenporter commented Jan 18, 2025

feat: Some kind of risk level returned by servers #114

feat: Some kind of risk level returned by servers #114

Comments

domdomegg commented Dec 13, 2024 • edited Loading

jspahrsummers commented Dec 16, 2024

g0t4 commented Jan 17, 2025

g0t4 commented Jan 17, 2025 • edited Loading

Client config for prompts

Tool risk

Models can assess risk per tool use

Model trust

tadasant commented Jan 17, 2025

allenporter commented Jan 18, 2025

domdomegg commented Dec 13, 2024 •

edited

Loading

g0t4 commented Jan 17, 2025 •

edited

Loading