Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC3059: Limits API — Part 3: Federated rate limiting #3059

Closed
wants to merge 13 commits into from
247 changes: 247 additions & 0 deletions proposals/3059-federated-rate-limits.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,247 @@
# MSC3059: Limits API — Part 3: Federated per-user ratelimiting on Matrix

Not all servers are as lucky as matrix.org to have variable scaling,
hence some of them will need to place rate limits on users and rooms,
and users and admins should be able to check and modify them in a
standardised way, and the servers should be able to communicate those
kinds of rate limits in a standardised way. This has been mentioned
in **[#803](https://github.com/issues/803)**.

## The basics

As **[@ara4n](https://github.com/ara4n)** said in **[#803](https://github.com/issues/803)**:

> Suggestions are:
>
> * **Rate limit per-user**. This has the disadvantage that a server admin will have to
> reach down from the heavens and explicitly configure config or something for
> particular sets of users. This feels fragile and kludgy, and overlaps with
> the current AS-configuration stuff where we can configure rate limiting for particular
> namespaces of users (but only if they're an AS).
> * **Rate limit per-room(-per-user)**. This is nicer as we can just store it in room state,
> and people can set it based on power levels. It has the disadvantage though that after
> rate-limiting has been disabled in a huge room, someone can accidentally/deliberately
> still DoS the server out of existence. This could be extended to per-room-per-user rules
> too (i.e. let this particular user talk fast in this particular room) but that
> feels a bit overkill.
> * **Rate limit per-room(-per-user), but with units being egress-msg/s
> rather than ingress-msgs/s**
> This might be quite an elegant solution to prevent server overload. By specifying the limit
> in egress-msg/s, you can be confident that a room won't sprout lots of users and then overload
> the server — and it lets server admins specify a meaningful global cap per-server too.
> (i.e. configure that no user is allowed to trigger more than 100 egress messages
> per second, or whatever).

All rate limits are in this proposal are applied by the homeserver.
In more detail: The limits shall be enforced by the initiating user's homeserver,
based on the server's own time of receiving the event and checked by other servers
and receiving clients, again based of the same `origin_server_ts` value. An event
is to be rejected immediately if it exceeds the rate limit set for that kind of event.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer active voice over passive - in this particular case, it's not clear if it's just the initiating user's homeserver that rejects the event, or also other parties you mentioned in the previous sentence (and in that latter case you really should leave a note on how that works given that homeservers' wallclocks can, and do, come in ridiculous discrepancy with each other; also in that latter case there's an opportunity to spoof timestamps, thereby evading rate limits).

Rate limiting events with negative values are ill-formed. Unit of the rate limits
are events per second, ie `ev Hz`.

A rate limit of zero for particular event type means that the event is completely
disallowed for the applicable users of the rate limit. Rate limit events with
universal scope and value zero are ill-formed, so are rate limiting events with
value zero that only cover rate limiting events. If a rate limiting event with
a particular scope includes rate limiting events within its scope and its value
is zero, that limit is ignored for rate limiting events and previous rate limit
continues to apply for rate limiting events only, still considering the limiting
scope of the previous rate limiting events that affect the rate limiting events
in the process of rate limiting. Rate limit increases are retroactive
and decreases apply going forward from the point the request is made.
Rate limiting events themselves follow a different rate limit excess policy
from other events:

Rate limit events will not be rejected for exceeding the rate limits
unless all the limit has entirely been spent by rate limiting events.
If a rate limit is reached by a rate limiting event otherwise, those
actions are to be taken in order, until rate limits are obeyed again:
1. The last rate limit of the same scope sent from same homeserver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Last" by which order?

will be replaced as-if by editing, if the new limit has a greater
limit value than the old one.
2. The previous non-rate limiting, non-membership events are
invalidated according to state resolution order starting from the tip.

Rate limit events in an end-to-end encrpyted room that only cover end-to-end
encrypted events shall also be sent end-to-end encrypted, and otherwise be
rejected by the homeserver as unauthorised.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would homeservers enforce such rate limits if they can't decrypt those, by definition?


## Per-user per-server event rate limiting semantics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this whole group the same question as for Limits Part 1 applies: how much should this really be unified across homeserver implementations, as long as it's down to homeserver admins to set these anyway? And by the way, how do you check authorisation?


To modify per-user event rate limit of all users:
```
PUT /_matrix/client/r0/admin/limits/ HTTP/1.1
{
"type": "m.limits.rate.user",
"value": 123.4567
}
```

To modify per-user event rate limit of all users for some event types:
```
PUT /_matrix/client/r0/admin/limits/scoped HTTP/1.1
{
"type": "m.limits.rate.user",
"limits": {"m.room.message": 123.4567, "m.ban": 1.234567}
}
```

To modify per-user event rate limit of a particular user:
```
PUT /_matrix/client/r0/admin/limits/{user_id} HTTP/1.1
{
"type": "m.limits.rate.user",
"value": 123.4567
}
```

To modify per-user event rate limit of a particular user for some event types:
```
PUT /_matrix/client/r0/admin/limits/{user_id}/scoped HTTP/1.1
{
"type": "m.limits.rate.user",
"limits": {"m.room.message": 123.4567, "m.ban": 1.234567}
}
```
Queries are made to the same paths, using GET method instead.
Users can query rate limits of users from the same homeserver.
To clear the limit, either DELETE the rate limit or send a
not defined value. A server is free to apply lower limits
than set by these endpoints at some or all times.

## Per-user per-room event rate limiting semantics

The event bodies are very similar to above per-user per-server limits.
An example state event below (in the example below, both the power
level and roles are specified but normally those two will not appear
in the same time):

```
{
"type": "m.limits.rate.user",
"power_level": 0,
"power_level.operator": "maximum",
"users": [],
"users.operator": "include",
"roles": [],
"roles.operator": "include_min(1)",
"limits": {"m.room.message": 123.4567, "m.ban": 1.234567}
}
```

Limits are cleared and edited following the usual message editing conventions.
`users` is an unordered list of user MXIDs and/or aliases.
`power_level` is the power level that the rate limit is going to be applied.
`power_level.operator` is the relevant comparison operator that the power level
is going to be applied. Valid operators are greater than or equal (`minimum`,
`min`, `gte`, `greater_or_equal`), equal (`equals`, `equal`, `exact`), less than
or equal (`maximum`, `max`, `lte`, `less_or_equal`), greater (`greater`,
`minimum_exclusive`, `minex`), less (`less`, `maximum_exclusive`, `maxex`).
`roles` is reserved and meant to include an unordered list of roles for a future
role-based access control. `roles.operator` and `users.operator` are
combinatoric operators that are going to be applied to the user's MXID (`users`), or
user's roles and role limiting scope (`roles`) to evaluate whether the rate
limit applies to a given user. The valid combinatoric operators for roles are
`include_min({n})`, include at least `n` roles from the list,
`include_max({n})`, include at most `n` roles from the list, `include({m},{n})`,
include at least `m` and at most `n` roles from the list, `include({n})`,
include exactly `n` roles from the list, `exclude_min({n})`, exclude at least
`n` roles from the list, `include_only_min({n})`, include at least `n` roles
from the list and no others, `include_only_max({n})`, include at most `n` roles
from the list and no others, `include_only({m},{n})`, include at least `m`
and at most `n` roles from the list and no others, `include_only({n})`,
include exactly `n` roles from the list and no others, `exclude_min({n})`,
exclude at least `n` roles from the list, `exclude_max({n})`, exclude at
most `n` roles from the list, `exclude({m},{n})`, exclude at least `m`
and at most `n` roles from the list, `exclude({n})`, exclude exactly `n` roles
from the list. `all` is a placeholder representing all the roles in the list
and can be used as a parameter in those inclusion or exclusion operators.
The valid combinatoric operators for users are `include` and `exclude`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a practical need for that? That defines some elaborate machinery - have you seen a need for that in real-world situations?

Copy link
Author

@erkinalp erkinalp Jun 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microsoft Exchange and on-premises installations of Office 365 (and by extension, Microsoft Teams) have a group policy feature allowing domain managers and team managers to enable/disable various features using policies against users/groups. The proposed scoping mechanism is intended to provide a similar override mechanism on Matrix (though, only for lowering the limits or disabling a feature completely, not for increasing the limits), as a second layer over usual level or role based access control.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Group policies in Active Directory are a huge beast, developed over a couple of decades(!). I really urge you to not try to stuff it in here. Also, "because has it" is a rather poor rationale. Again, you should explain WHY on this MSC's own merits, not something else's.

Users having at least one of `limits.rate` or `limits` power can change
per-room rate limit. However, all users can impose rate limits on oneselves,
and those self-imposed limits cannot be increased by other users above
the self-imposed values. Both `users` and `roles` parameters can be used,
in which case the applicable users is based on the union of the user set
based on user filtering and the user set based on the role filtering.

Applicable user set of the rate limit is then calculated as follows:
* A rate limiting event with both power level based filtering and role
based filtering (`power_level` and `roles` defined) at the same time
is rejected.
* If both `power_level` and `roles` are omitted and `users` are defined,
and `users.operator` is `exclude`, then the limit applies to all users,
except ones stated in the list.
* If `power_level` is defined, users are filtered according to `power_level`
and `power_level.operator`, then the `users` are added or subtracted from
that set according to `users.operator`.
* If `roles` is defined, the following rules apply:
* ** If `roles.operator` is an exclusion operator, then take the set of all
users and subtract users according to `roles.operator` according to `roles`
and the roles user possesses.
* ** If `roles.operator` is an inclusion operator, then start with empty set,
and add users according to `roles` listed and `roles.operator`.
* ** If `users` is also defined, then the `users` are added or subtracted
from the set calculated according to the abovementioned method, depending
on `users.operator`.
* For burst definitions, the same rules apply, except maximum applicable
user set is the set of users that are limited by the parent rate limit.

## Rate limit burst capability

Bursting can be defined using the `burst` property, which includes an array
of JSON objects, each defining a burst group over a base limit. Repeating
the above with bursting capabilities. Bursting coefficient `burst.coef` is
multiplied by base limit to calculate effective rate limit during the
burst period, for up to `burst.duration` seconds.

An example state event below (in the example below, both the power
level and roles are specified but normally those two will not appear
in the same time):

```json
{
"type": "m.limits.rate.user",
"power_level": 0,
"power_level.scope": "maximum",
"roles": [],
"roles.operator": "exclude(all)",
"limits": {"m.room.message": 123.4567, "m.ban": 1.234567},
"burst": [
{
"burst.coef": 1.00,
"burst.duration": 0,
"users": [],
"users.operator": "exclude",
"roles": [],
"roles.operator": "include_min(1)"
}
]
}
```

## Potential issues

As **[@ara4n](https://github.com/ara4n)** said:

> However, implementation-wise, i'm a bit worried that different APIs will have
> different limiting thresholds depending on the room that they interact with
> — and that the HS will have to query the room state every time someone says
> something to decide how limited they should be.

Rate limiting events themselves obeying the rate limits may make the limiting
logic pretty complex and cause a lower practical limit than allowed by the request.

## Alternatives considered

None considered yet.

## Security considerations

Possible rate limit request spams may cause both server-side and client-side
performance degradation.

## Unstable prefix

`m.limits.rate` should be replaced by `org.matrix.msc3059.rate`. And unstable API
endpoints should have `r0` replaced by `unstable` in the endpoint paths.