Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Some kind of risk level returned by servers #114

Open
domdomegg opened this issue Dec 13, 2024 · 5 comments
Open

feat: Some kind of risk level returned by servers #114

domdomegg opened this issue Dec 13, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@domdomegg
Copy link

domdomegg commented Dec 13, 2024

Is your feature request related to a problem? Please describe.

Most MCP client applications (such as the Claude Desktop app), ask users to approve many minor actions. Example dialog:

image

This can be frustrating for users if there are many tools they're trying to use. Having to do this many times likely will result in users defaulting to allowing decisions (alarm fatigue). Some other MCP client apps might choose to not ask users for permission at all, which seems dangerous.

Ideally we want some way to:

  • Enable users to grant permission to run low-risk actions automatically
  • Flag to users when actions really are potentially high-risk, and the consequences (without alarm fatigue)

Describe the solution you'd like

Currently, the protocol does not provide a way for servers to indicate how 'risky' an action is (apart from maybe in a non-structured way in the description). There's also no straightforward way for the server to provide context about how risky a particular action would be.

One idea might be to add properties to the Tool data type, that adds something like:

  • An expression of how risky an action is in general.
    • Intuitively, this might be thought of as a risk_level of low, moderate or high
    • In practice we might want something richer, and invite ideas on how we can do this better! Maybe an array of risk types, with some value for impact on them. For example (ideally 'low' or 'high' would be defined on an impact scale for each category):
      • make_amazon_purchase => financial_risk: ~£100-£1000
      • access_google_drive => privacy_risk: moderate?
      • access_medical_records => privacy_risk: high
      • control_smart_light_state => disruption_risk: low
      • start_aws_server => financial_risk: ~£100, cyber_risk: moderate
      • [I think these categories could do with a lot of refinement, and don't stand by them - am sure there are better papers exploring risk taxonomies of AI agents available!]
    • Another way this might be thought of is maybe like oauth scopes, which again define a class of actions usually by how impactful they can be.
  • A way for the server to easily respond with a more precise risk level for a given call
    • e.g. if making an Amazon purchase of a specific item, it might return an impact statement like 'This will authorise a payment of £25.99 to buy Cuddly Stuffed Animal Sloth Soft Toy'. It might also attach a structured risk statment of financial_risk: £25.99.

Clients could then have more flexibility for how they want to warn users of actions. E.g.

  • maybe a user is happy with autoapproving AI system taking any actions with moderate privacy risk (corresponding to a level of accessing general documents), but zero financial risk.
  • a user might be happy with AI systems reading any of the data in their database, but not editing any of it without checking with them

(In the future, AI systems might themselves be able to make these judgements based on a risk profile set by the user - e.g. evaluating the request against a user's risk appetite statement. Returning the risk information would then help this system evaluate more complex tests, such as 'Autoapprove edits to database table X, but only allow read access to table Y' OR 'Autoapprove creating email drafts, but ask me before sending them.')

Describe alternatives you've considered

I'm open to other ways to achieve solving the problem (improving safety of MCPs by avoiding alarm fatigue).

@domdomegg domdomegg added the enhancement New feature or request label Dec 13, 2024
@jspahrsummers
Copy link
Member

Thanks for filing this. Your proposal makes a lot of sense. I wonder if perhaps we support an open-ended set of risk tags, on a normalized scale [0, 1] (similar to model preferences in sampling), with some predefined "well-known" tags codified in the spec.

@jspahrsummers jspahrsummers added this to the DRAFT: 202X-XX-XX milestone Dec 16, 2024
@g0t4
Copy link

g0t4 commented Jan 17, 2025

FTR we had a similar discussion here: https://github.com/orgs/modelcontextprotocol/discussions/69, I will compare it to the above and add any specific feedback

@g0t4
Copy link

g0t4 commented Jan 17, 2025

Obviously this is a repo about the spec alone, that said here is how I see the bigger picture and what would be useful to consider. Simple would be best, if these changes are overly complicated then users won't bother.

Client config for prompts

Users should be able to configure each tool to: never approve, always approve, approve once per chat, conditionally approve (per tool use request), etc

For example, a fetch server that simply downloads web pages, I would want that to never prompt me.

Tool risk

A tool can instruct users on risk/approval. This could apply as the default if the user doesn't configure an override. So, same choices: never approve, always approve, approve once per chat, conditionally approve (per tool use request)

Models can assess risk per tool use

When a user specifies to conditionally approve, why not let the model assess the risk of each tool use request?!

For example, in my mcp-server-commands, Claude knows well enough to mark the rm command as requires approval. Whereas cat/ls can be no approval, unless reading a sensitive file (i.e. passwords) which Claude can mark as maybe approve/requires approval. And let the client decide how to handle maybe approve (i.e. if I already selected approve once per chat for that tool then don't prompt me)

Best part is, users can customize instructions in a system prompt to specify how they expect the model to interpret risk. For example, by default across all of my projects:

Treat low risk commands like ls and cat as no approval, unless handling sensitive files in which case mark as maybe approve. And, always mark commands that remove files as requires approval

Whereas, a project specific prompt might be:

Commands that modify files underneath the directory /tmp/scratchpad can all be marked no approval because it's a temp directory. If you feel that a command is still risky, you can overrule and prompt me.

Model trust

This would be the last feature I might consider adding if the above is insufficient.

For the most part I trust the models I am using and wouldn't use an untrusted model to do anything risky. That said, people could use a separate model/server solely to score the risk of each tool use request. So, some sort of mechanism to round trip a score (requires approval, no approval, maybe approve) right before the tool is used.

@tadasant
Copy link

I totally agree with the problem statement here, namely:

This can be frustrating for users if there are many tools they're trying to use. Having to do this many times likely will result in users defaulting to allowing decisions (alarm fatigue).

One concern I have with the proposed solutions' direction is whether or not servers can have the right context to define risk. I think it's hard for a server to properly define the "risk" associated with a tool call in a general way that serves any client. For example, if I build a DuckDuckGo-like "private chat client" that promises to be privacy-forward, then something as simple as a fetch call that exposes the fact that I visited an external website might be considered outside normal "risk" bounds, whereas it's a very low/no-risk action in most other contexts.

It's certainly possible to work around this by well-defining categories and a taxonomy for defining risk that everyone agrees on like @domdomegg started to get into, but I worry it's a lot of complexity to introduce into the spec.

An alternative: what if we leave this up to client applications to manage?

The spec is currently pretty strongly worded on this topic:

Applications SHOULD: ... Present confirmation prompts to the user for operations, to ensure a human is in the loop

A example tweak that might resolve this issue, as far as the spec is concerned:

Applications SHOULD: ... Present confirmation prompts or configuration options to the user for risky operations

If we designate risk to be a concern of the client, I think the client already has everything it needs to manage the risk:

  • A client already understands the intent of a tool call because it has the tool name/description (case in point: all the examples in this thread so far have communicated the level of risk with one-liner descriptions of the tool). It also knows any arguments it is considering passing into the tool (e.g. private data like an email address), so it actually gets a fuller picture of the possible risk before the server ever could.
  • A client can store information from a user like "this server is trusted" or "this server is only partially trusted to do what it says" and incorporate that layer of risk accordingly
  • This can all be done separately to invoking the tool call, so we avoid the need for extending the Tool data type

For a client like Claude Desktop, I could imagine the implementation looking like:

  • Add a field to claude_desktop_config.json per-server to do one of "never approve, always approve, approve once per chat, conditionally approve"
  • Optionally allow some nested config to set more granular configs per-tool-call (and when that's missing, keep it in the on-the-fly UX like it is now)
  • In the "conditionally approve" case, you could allow another config option that allows a free-form description of the conditions, e.g. "when the command argument in this tool call is rm, require approval; if it's ls, don't require; everything else, approve once per chat. Though this one is tricky in that I'm not sure you'd want to introduce a separate inference step here, so maybe you'd want to define a more strictly defined taxonomy... either way, that'd be Claude Desktop's UX concern, not MCP's.

If it so chose, Claude Desktop could smooth over the "JSON config" part of the UX by just running inference on app startup, "based on what I know about my user's risk tolerance, let me configure the current never approve/always approve/etc JSON settings for every visible tool accordingly" and then re-run that step whenever it receives a listChanged notification.

Maybe down the line when there are more examples of popular clients, it'll be worth better-defining best practices for client implementations in the spec.

@allenporter
Copy link

Yes, the server doesn't get to have a say in defining the risk level. The point of the confirmation is just as much about telling the user that you're sending their private data from a conversation to a remote tool, than the actual riskiness of the action a tool might perform. I agree with the framing that this a client decision, and the client just need enough info in the spec to do this effectively, and then the rest is a client implementation detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants