Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It is impossible to trace to-device messages from client->server->server->client #558

Closed
ara4n opened this issue Aug 19, 2022 · 7 comments
Closed
Labels
A-E2EE O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Enhancement Z-UISI Unable to decrypt errors

Comments

@ara4n
Copy link
Member

ara4n commented Aug 19, 2022

In order to diagnose element-hq/element-web#23113 I just went on a mission to trace the lifetime of to-device messages from me (EI) -> matrix.org -> vector.modular.im -> rick (EW) in order to find out where they got lost. (Turns out they got lost at Rick's end; see the bug for details).

However, the process of tracing the to-device messages is utterly miserable due to insufficient debugging. Looking at the logs, it starts off on Element iOS okay with encrypting the room_key for the appropriate destination device:

2022-08-18 22:41:07.611 Element[5250:589660]  [MXMegolmEncryption] shareKey: Sharing keys with device @rxl881:vector.modular.im:ZAUPWAYFGG
2022-08-18 22:41:07.625 Element[5250:589660]  [MXOlmDevice] encryptMessage: Olm Session id NPkU5Y8QyqmlDSKkgZ/hTROQ1cGWdrpvOcx/2vWkU0Q to ZRFNAAnDjik0Z5QO74hvi901ULy4OmnVu50jD7tUJTg
2022-08-18 22:41:07.626 Element[5250:589660]  [MXBackgroundTask] Start background task #9663 - [MXRealmCryptoStore] performSessionOperationWithDevice
2022-08-18 22:41:07.626 Element[5250:589660]  [MXBackgroundTask] Background task [MXRealmCryptoStore] performSessionOperationWithDevice started with app state: active and estimated background time remaining: undetermined
2022-08-18 22:41:07.649 Element[5250:589660]  [MXBackgroundTask] End background task #9663 - [MXRealmCryptoStore] performSessionOperationWithDevice after 22.861ms

...and then it bundles it up with a bunch of others...

2022-08-18 22:41:08.257 Element[5250:589660]  [MXMegolmEncryption] shareKey: Actually share with 2 users and 28 devices

...and sends it to the server in a to-device request...:

2022-08-18 22:41:08.278 Element[5250:589660]  [MXHTTPClient] #2963 - PUT _matrix/client/r0/sendToDevice/m.room.encrypted/m1692882576.68

However, while we have a transaction ID for this set of to-device messages, we don't have any other IDs for tracing them to their various recipients.

We can see the inbound request in synapse's logs on matrix.org:

encryption-1
2022-08-18 19:41:09,001 - synapse.access.http.28101 - 450 - INFO - PUT-354700 - 82.132.228.213 - 28101 - {@matthew:matrix.org} Processed request: 0.018sec/0.002sec (0.007sec, 0.001sec) (0.001sec/0.009sec/3) 2B 200 "PUT /_matrix/client/r0/sendToDevice/m.room.encrypted/m1692882576.68 HTTP/1.1" "Element/1.8.25 (iPhone; iOS 15.6; Scale/3.00)" [0 dbevts]

...but jaeger currently can't view the parent span due to matrix-org/synapse#13567. Having found the span based on timeframes, it doesn't tell us much either though, as we don't log which remote devices we're sending to - only local ones. So all you get told is that it's trying to send a to-device msg to vector.modular.im.

There's then no way to correlate that at all to the actual attempt to send the EDU to vector.modular.im with the to-device message in it: the only logs you get are:

fed-sender-14:
2022-08-18 19:41:09,010 - synapse.federation.sender.transaction_manager - 121 - INFO - federation_transaction_transmission_loop-275093- - TX [vector.modular.im] {1660839599478} Sending transaction [1660839599478], (PDUs: 0, EDUs: 1)
2022-08-18 19:41:09,092 - synapse.http.matrixfederationclient - 299 - INFO - federation_transaction_transmission_loop-275093- - {PUT-O-194819} [vector.modular.im] Completed request: 200 OK in 0.08 secs, got 11 bytes - PUT matrix://vector.modular.im/_matrix/federation/v1/send/1660839599478
2022-08-18 19:41:09,092 - synapse.federation.sender.transaction_manager - 174 - INFO - federation_transaction_transmission_loop-275093- - TX [vector.modular.im] {1660839599478} got 200 response

...which as far as I can see doesn't have any IDs at all (other than the local txnid), and doesn't come up in Jaeger, either correlated with the original request, or viewing the outbound traffic at all?

At this point I gave up of trying to track the inbound request on vector.modular.im and confirming that it went down /sync to Rick, as I assume we're missing logging and tracing for that too.

Finally, on js-sdk, I had to completely guess which of the to-device messages was actually the one in question - it will have been one of these (from Rick's side of the logs):

2022-08-19T08:11:27.271Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.582Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.583Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined

If we want to be able to trace UISI root causes we have to fill in the missing observability here, and see which Olm traffic went missing, and where, and then why.

@ara4n
Copy link
Member Author

ara4n commented Aug 19, 2022

Fwiw, my suggestion for fixing this is to get clients & servers to emit structured logging which we can then view via jaeger. The server almost does this already; it just needs IDs and spans to be joined up. The clients meanwhile can emit structured logging datapoints to a little HTTP server sitting alongside Jaeger which can then emit the necessary datapoints into Jaeger. The datapoints would be pretty simple stuff, all keyed by user + traceable ID of whatever type is appropriate for the given req. Obviously we wouldn't include any plaintext or key data, and would only enable it for Element employees. Something like:

  • read/write keys and devices
  • pull the device list
  • sync the device list
  • encrypt/decrypt an event to send over olm with a given session
  • send/receive to-device messages (including who to, both local & remote, and a traceable ID)
  • encrypt/decrypt an event via megolm with a given key (and whether we could decrypt)

...and that's about it.

Currently we have automatic rageshaking on UISI, but this has the big problem that rageshakes don't currently contain enough data to debug easily (hence this bug), hence wanting to get structured observability in so I can hand jaeger a given message ID, or to-device ID, or megolm session, and it can show us the story of its life and what went wrong.

@daniellekirkwood daniellekirkwood added the Z-UISI Unable to decrypt errors label Aug 23, 2022
@kittykat kittykat added A-E2EE S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Enhancement O-Occasional Affects or can be seen by some users regularly or most users rarely labels Sep 2, 2022
@richvdh
Copy link
Member

richvdh commented Nov 17, 2022

Fwiw, my suggestion for fixing this is to get clients & servers to emit structured logging which we can then view via jaeger.

I think we should descope proper structured logging from this issue, and limit this issue to making it easier to trace to-device messages through the system.

Structured logging is tracked at #32.

@richvdh richvdh changed the title Tracing UISIs from client->server->server->client is missing critical debugging info. It is impossible to trace to-device messages from client->server->server->client Nov 17, 2022
@richvdh
Copy link
Member

richvdh commented Nov 17, 2022

As an aside, note that to-device messages have a message_id which are used internally by Synapse and by the federation api; however this isn't exposed to clients. Given the batching mechanism in the client-server api, it wouldn't be entirely trivial to expose it to the sending client.

In any case, ideally we would have a message id that is under control of clients, so that the client can assign the ID well before sending the message to the server.

I'm much of the opinion that we should just stick an io.element.msgid property inside the body of the to-device message.

@ara4n
Copy link
Member Author

ara4n commented Nov 25, 2022

I'm much of the opinion that we should just stick an io.element.msgid property inside the body of the to-device message.

Totally agreed. So.... can we? please? ideally ~2 years ago? :)

@richvdh
Copy link
Member

richvdh commented Nov 25, 2022

Yes, though I've been trying to wrap up a different distraction before I start on this one.

@richvdh
Copy link
Member

richvdh commented Dec 5, 2022

matrix-org/synapse#14598 adds support for improved to-device tracing on the server side, and matrix-org/matrix-js-sdk#2938 adds it for Element Web.

The idea is simple: the sending client adds a field org.matrix.msgid to the body of each to-device message it sends, and makes sure that the value is logged. The format is up to the client: I recommend a UUID, but anything with sufficient randomness to make a clash "unlikely" will be fine.

Then clients should also check for org.matrix.msgid in any to-device message they receive, and again log the value.

@richvdh
Copy link
Member

richvdh commented Dec 13, 2022

Hopefully this is now better, if still not quite as easy as one might hope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-E2EE O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Enhancement Z-UISI Unable to decrypt errors
Projects
None yet
Development

No branches or pull requests

4 participants