Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2732: Olm fallback keys #2732

Merged
merged 7 commits into from
Jun 13, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions proposals/2732-olm-fallback-keys.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# MSC2732: Olm fallback keys
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not quite clear to me what exactly the purpose and tradeoffs of this entire design are (hence attaching this thread to the first line, as good a place as any).

What does this improve over just having a number of one-time keys? As I understand it, the benefit is that fallback keys can be used indefinitely and so there is no limit on established sessions like with pre-generated keys.

But then why not drop the pre-generated key mechanism in favour of this mechanism entirely? As I understand it, because the security guarantees of this model are weaker, and so it is preferable to use pre-generated keys where possible.

But then doesn't this weaken the overall security model, by making it possible for an attacker to intentionally exhaust all of someone's pre-generated keys, essentially carrying out a downgrade attack and forcing them (or rather, the people communicating with them) into fallback keys being used instead, which would weaken security properties?

I have difficulty squaring this circle and understanding how this can be both a) useful, b) secure, and c) additive to the current model. Can you shed some light on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I have vague memories from staring at the signal protocol (which does the same trick) that this wasn't as much of a disaster as you might think, but I can't remember why. @uhoreg I think this is worth spelling out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I keep forgetting to respond to this.

But then why not drop the pre-generated key mechanism in favour of this mechanism entirely? As I understand it, because the security guarantees of this model are weaker, and so it is preferable to use pre-generated keys where possible.

My understanding is that it is slightly weaker security, but it's not a huge difference. So it is "better" to use one-time keys, but when one is unavailable, a fallback key is "good enough" in many cases. But this proposal allows clients to make up their own mind about the tradeoffs. If they don't think that a fallback key is secure enough, then they don't need to use it. If they don't care at all about the extra security from one-time keys vs. fallback keys, they can drop one-time keys completely and just rely on fallback keys. If they want to the extra security from one-time keys when available, but don't want to inconvenience the user when one-time keys aren't available, it can choose to do that too.

As an explanation of the security difference: under an assumption that an attacker cannot break Curve25519, and so must attack the client directly to get the private keys, there is not much difference between using a one-time key and a fallback key. If an attacker is able to extract the keys from the client, then they will have the ciient's private identity key, all the one-time keys that have already been used, and the current fallback key. And the only difference between having a private key for a one-time key and for a fallback key is that the fallback key may have been used already in a session that was already processed. But if a client promptly replaces a fallback key after it has been used and forgets the private key quickly (after it's reasonably sure that it has received all the sessions that use it), then the difference is small.

So I think this comes down to a case of cryptographic ideal vs. pragmatism. From a cryptographic standpoint, one-time keys are the way to go, because we're paranoid and worry about Curve25519 being broken. But practically speaking, the tradeoff between the possibility of someone cracking Curve25519 or attacking the client at just the right moment that they can decrypt some extra sessions, versus the inconvenience of having undecryptable messages because we ran out of one-time keys, for most people, I think, leans towards convenience in this case. But again, the client can make they choice about what to do, and the user can choose a client that matches their paranoia level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an explanation of the security difference: under an assumption that an attacker cannot break Curve25519, and so must attack the client directly to get the private keys, there is not much difference between using a one-time key and a fallback key.

How do OTKs help if an attacker can break Curve25519? If they have the encrypted payloads it seems reasonable that they'll also see the OTK key go past as its claimed? Or am I missing something here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with OTKs you have to break every single curve25519 keypair. With fallback keys if you break the curve25519 fallback key you immidiately have all sessions with devices who used that fallback key to initiate communication.

Copy link
Contributor

@poljar poljar Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to correct a misconception here: Signal added fallback keys to their protocol after we forked it into Olm. The same might apply to the MAC length recommendation; I don't remember offhand.

Is this correction of the misconception correct (🥁)? I have not been around back then but as far as I can tell libolm was forked from Axolotl (v2?) which did have a concept of a fallback key, Axolotl (v3?) which introduced x3dh appeared a couple of months after the fork. The removal of the fallback key can be found in this commit.

I don't know if the truncation has been changed or not, but I think that one case of such changes (the removal of the fallback key) and one potential such change (the removal of one-time keys) is enough to get my point across.

So, the relevant question here isn't "why were fallback keys stripped out", but rather "why didn't we immediately follow suit when Signal added them", which can be answered as (a) lack of tuits, but more fundamentally (b) I don't accept that we should automatically do everything Signal does just because Signal does it. They have some excellent ideas cryptographically, but at the end of the day their product is a bit different from ours, and certainly some of their decisions have been at least open to debate in the past.

I don't think this reversal is true given the above so I won't address everything, but my complaint isn't about following Signal step by step, it's about introducing changes to complex protocols without proper justification.

Fallback keys are used to defend against denial of service attacks. We should know why we don't want this property in the protocol.

One-time keys are used to offer better forward secrecy. Again, we should know why this isn't desirable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forward secrecy protects past sessions against future compromises of keys or passwords.

So given that we have established shared secrets, a compromise of secret keys in the future wouldn't compromise the shared secret. In this case the one-time keys that established the session are gone, if the fallback key is used it might not be gone.

To be almost unhelpfully pedantic here, OTKs don't strictly allow this as they're not deleted immediately on key use, they're deleted once the client has become aware that they've been used. Sure, this is likely a short window of time, but then if the fallback keys are only kept for a short amount of time the situations become effectively equivalent. I don't believe any of our forward secrecy actually immediately protects all old messages, rather it protects sufficiently old messages, i.e. there's always a period of time in the past where an attacker can still get messages for.

This is the crux of what confuses me: if OTKs only slightly reduce this window of time, and we're saying we're happy with the window that having fallback keys gives us (since we're using them), then are OTKs worth the additional complexity?

However:

Given that we don't know how many people have used our fallback key it becomes hard to know when we can throw the key away.

I think is what I've been missing here, I've been assuming that we can delete the fallback private keys quickly, and so the window of attack for OTKs vs fallback keys are effectively the same. If we have to keep round the fallback keys for hours, then that's probably(?) a sufficiently large window that using OTKs to reduce that windows makes sense.

The other piece here is that while an attacker can drain a device's OTKs, and so force new sessions to use the fallback keys, that is an active attack that can be observed by the servers and device. That will at least give some breadcrumbs that something fishy is going on, rather than allowing completely passive attacks. Though since draining of OTKs happens sufficiently often that we're making this MSC, I don't know if anybody would actually notice the draining of OTKs.

Basically: adding fallback keys weakens security, and while removing OTKs would weaken security some more, do they provide enough meaningful protection to warrant their complexity? It sounds like the answer is "yes, just", so I'm OK keeping them long term if people agree with the above analysis.


I'll stop being off topic in this MSC now and won't answer here anymore.

I'd rather these thoughts were recorded on the MSC for future reference, so that they can be linked back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the fallback key can be found in this commit.

oh. I'll get back in my box. Sorry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is what I've been missing here, I've been assuming that we can delete the fallback private keys quickly, and so the window of attack for OTKs vs fallback keys are effectively the same. If we have to keep round the fallback keys for hours, then that's probably(?) a sufficiently large window that using OTKs to reduce that windows makes sense.

Worse than that, since there is no way to determine the number of times the fallback key was retrieved (as far as I'm aware), we have no deterministic way of concluding that all sessions involving that fallback key have been created. So the choices are between keeping a given fallback key for an indefinite amount of time or else risk having undecryptable messages.

This doesn't happen with OTKs due to their property that they will be used at most once, so when a session is created for a given OTK, the client can conclude with certainty that it can drop that OTK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worse than that, since there is no way to determine the number of times the fallback key was retrieved (as far as I'm aware), we have no deterministic way of concluding that all sessions involving that fallback key have been created. So the choices are between keeping a given fallback key for an indefinite amount of time or else risk having undecryptable messages.

Generally, it should be fairly safe to assume that when a user claims a one-time key (whether it's actually an OTK or ends up being a fallback key), they're going to use it right away. So clients should only need to keep the fallback key for a little time after it's used (maybe a couple minutes after the sync in which they notice that it's been used, to allow for network delays). The risk of having undecryptable messages is somewhat mitigated by having olm unwedging and key resharing, though that requires the sender to come back online.

This doesn't happen with OTKs due to their property that they will be used at most once, so when a session is created for a given OTK, the client can conclude with certainty that it can drop that OTK.

We actually do have a similar problem with OTKs. libolm tries to use a constant amount of memory, so it only has limited space for OTKs. That means that libolm will sometimes evict OTKs when it generates new ones, so if someone claims a OTK and waits too long to use it, it may have been evicted by the time they finally get around to using it.


Olm uses a set of one-time keys when initializing a session between two
devices: Alice uploads one-time keys to her homeserver, and Bob claims one of
them to perform a Diffie-Hellman to generate a shared key. As implied by the
name, a one-time key is only to be used once. However, if all of Alice's
one-time keys are claimed, Bob will not be able to create a session with Alice.

This can be addressed by Alice uploading a fallback key that is used in place
of a one-time key when no one-time keys are available.

## Proposal

A new request parameter, `fallback_keys`, is added to the body of the
[`/keys/upload` client-server API](https://matrix.org/docs/spec/client_server/r0.6.1#post-matrix-client-r0-keys-upload), which is in the same format as the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also add any new response to having successfully uploaded the fallback keys, similar to OTKs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently have any addition to the response.

`one_time_keys` parameter with the exception that there must be at most one key
per key algorithm. If the user had previously uploaded a fallback key for a
given algorithm, it is replaced -- the server will only keep one fallback key
per algorithm for each user.

When uploading fallback keys for algorithms whose key format is a signed JSON
object, client should include a property named `fallback` with a value of
`true`.

Example:

`POST /keys/upload`

```json
{
"fallback_keys": {
"signed_curve25519:AAAAAA": {
"key": "base64+public+key",
"fallback": true,
"signatures": {
"@alice:example.org": {
"ed25519:DEVICEID": "base64+signature"
}
}
}
}
}
```

When Bob calls `/keys/claim` to claim one of Alice's one-time keys, but Alice
has no one-time keys left, the homeserver will return the fallback key instead,
if Alice had previously uploaded one. Unlike with one-time keys, fallback keys
are not deleted when they are returned by `/keys/claim`. However, the server
marks that they have been used.
uhoreg marked this conversation as resolved.
Show resolved Hide resolved

A new response parameter, `device_unused_fallback_key_types`, is added to
`/sync`. This is an array listing the key algorithms for which the server has
an unused fallback key for the device. If the client wants the server to have a
fallback key for a given key algorithm, but that algorithm is not listed in
`device_unused_fallback_key_types`, the client will upload a new key as above.

The `device_unused_fallback_key_types` parameter must be present if the server
supports fallback keys. Clients can thus treat this field as an indication
that the server supports fallback keys, and so only upload fallback keys to
servers that support them.

Example:

`GET /sync`

Response:

```jsonc
{
// other fields...
"device_unused_fallback_key_types": ["signed_curve25519"]
}
```

## Security considerations

Using a fallback key rather than a one-time key has security implications. An
attacker can replay a message that was originally sent with a fallback key, and
the receiving client will accept it as a new message if the fallback key is
still active. Also, an attacker that compromises a client may be able to
retrieve the private part of the fallback key to decrypt past messages if the
client has still retained the private part of the fallback key.

For this reason, clients should not store the private part of the fallback key
indefinitely. For example, client should only store at most two fallback keys:
the current fallback key (that it has not yet received any messages for) and
the previous fallback key, and should remove the previous fallback key once it
is reasonably certain that it has received all the messages that use it (for
example, one hour after receiving the first message that used it).
Comment on lines +85 to +89
Copy link
Member

@richvdh richvdh Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the client supposed to rotate the fallback key as soon as the current one is used? If so, that implies it needs to store more than two keys at once.

I think it would be good to give some clearer guidance on how often the client should rotate the keys.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the client supposed to rotate the fallback key as soon as the current one is used?

Yes. We could say that it should be rotated after it has done processing the /sync request in which the current fallback key was used. At the same time, it should replenish the one-time keys. So my thinking was that it would be unlikely (under normal circumstances) that all the one-time keys get exhausted in an hour.

However, this changes if the client opts not to use one-time keys and to only use fallback keys. And an attacker could possibly run through all the one-time keys quickly. So, I think we could either

  • reduce the time recommended for a client to keep the fallback key, maybe to ~5 minutes, and rate-limit rotating the key. If all new sessions created within a 5-minute period end up using the same key, that doesn't seem to terrible.
  • say that if you receive a new session created using the new fallback key, then discard the old fallback key, regardless of when it was last rotated -- if a session comes in with the new fallback key, it's probably less likely that there will be remaining sessions using the old fallback key, unless there are major federation delays (clients should generally start a new session very shortly after claiming the key, so the delay between the user receiving the session should be fairly short).

I think that I prefer the first option. I can't really think of other options, aside from variations of the theme, that allow an upper bound on the number of keys a client has to keep.

Comment on lines +84 to +89
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've no objections to this from the POV of the protocol, but still think it would be nice to clarify how key rotation is expected to work.


For addressing replay attacks, clients can also keep track of inbound sessions
to detect replays.

## Unstable prefix

The `fallback_keys` request parameter and the `device_unused_fallback_key_types`
response parameter will be prefixed by `org.matrix.msc2732.`.