-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It is impossible to trace to-device messages from client->server->server->client #558
Comments
Fwiw, my suggestion for fixing this is to get clients & servers to emit structured logging which we can then view via jaeger. The server almost does this already; it just needs IDs and spans to be joined up. The clients meanwhile can emit structured logging datapoints to a little HTTP server sitting alongside Jaeger which can then emit the necessary datapoints into Jaeger. The datapoints would be pretty simple stuff, all keyed by user + traceable ID of whatever type is appropriate for the given req. Obviously we wouldn't include any plaintext or key data, and would only enable it for Element employees. Something like:
...and that's about it. Currently we have automatic rageshaking on UISI, but this has the big problem that rageshakes don't currently contain enough data to debug easily (hence this bug), hence wanting to get structured observability in so I can hand jaeger a given message ID, or to-device ID, or megolm session, and it can show us the story of its life and what went wrong. |
I think we should descope proper structured logging from this issue, and limit this issue to making it easier to trace to-device messages through the system. Structured logging is tracked at #32. |
As an aside, note that to-device messages have a In any case, ideally we would have a message id that is under control of clients, so that the client can assign the ID well before sending the message to the server. I'm much of the opinion that we should just stick an |
Totally agreed. So.... can we? please? ideally ~2 years ago? :) |
Yes, though I've been trying to wrap up a different distraction before I start on this one. |
matrix-org/synapse#14598 adds support for improved to-device tracing on the server side, and matrix-org/matrix-js-sdk#2938 adds it for Element Web. The idea is simple: the sending client adds a field Then clients should also check for |
Hopefully this is now better, if still not quite as easy as one might hope. |
In order to diagnose element-hq/element-web#23113 I just went on a mission to trace the lifetime of to-device messages from me (EI) -> matrix.org -> vector.modular.im -> rick (EW) in order to find out where they got lost. (Turns out they got lost at Rick's end; see the bug for details).
However, the process of tracing the to-device messages is utterly miserable due to insufficient debugging. Looking at the logs, it starts off on Element iOS okay with encrypting the room_key for the appropriate destination device:
...and then it bundles it up with a bunch of others...
...and sends it to the server in a to-device request...:
However, while we have a transaction ID for this set of to-device messages, we don't have any other IDs for tracing them to their various recipients.
We can see the inbound request in synapse's logs on matrix.org:
...but jaeger currently can't view the parent span due to matrix-org/synapse#13567. Having found the span based on timeframes, it doesn't tell us much either though, as we don't log which remote devices we're sending to - only local ones. So all you get told is that it's trying to send a to-device msg to vector.modular.im.
There's then no way to correlate that at all to the actual attempt to send the EDU to vector.modular.im with the to-device message in it: the only logs you get are:
...which as far as I can see doesn't have any IDs at all (other than the local txnid), and doesn't come up in Jaeger, either correlated with the original request, or viewing the outbound traffic at all?
At this point I gave up of trying to track the inbound request on vector.modular.im and confirming that it went down /sync to Rick, as I assume we're missing logging and tracing for that too.
Finally, on js-sdk, I had to completely guess which of the to-device messages was actually the one in question - it will have been one of these (from Rick's side of the logs):
If we want to be able to trace UISI root causes we have to fill in the missing observability here, and see which Olm traffic went missing, and where, and then why.
The text was updated successfully, but these errors were encountered: