-
-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pick a unit for describing string length #1001
Comments
It appears that synapse is actually using codepoints rather than bytes for user IDs, so it's already possible to create users with >255 bytes to them. This particular piece would be a spec clarification, I think. |
My client hardcoded the byte limit, since that is what I found in the client side spec, when I asked for it. |
And the database used in some clients is limited 255 UTF-16 code units too: https://github.com/hivedb/hive/blob/master/hive/lib/src/binary/frame.dart#L55 |
Synapse also uses python str PDUs are not affected and are actually limited to 65536 bytes, thanks to Room aliases can also be created with lots of bytes, but I'm not sure if those are actually limited at all (it's not a part of event auth, so it might even allow more than 255 codepoints). |
Only if you patch your Synapse. Unicode characters are not allowed in mxids by default, iirc. |
…epoints rather than bytes (matrix-org/matrix-spec#1001)
…epoints rather than bytes (update to matrix-org/gomatrixserverlib@5f66df0, matrix-org/matrix-spec#1001)
…es rather than codepoints (matrix-org/matrix-spec#1001)
In almost all instances we are already talking about UTF-8 encoded ASCII. As such this is not a behavioural change apart from in theory legacy identifiers, that never were part of the spec. fixes matrix-org#1001 Signed-off-by: Nicolas Werner <nicolas.werner@hotmail.de>
On the subject of In favour of bytes:
Against bytes:
Remarks:
|
Please note that this is a highly theoretical thing that i never got a clear answer on when we talked about it so im just putting it here to make sure we dont forget it. There might be the possibility of for Synapse atleast causing a room to be Bricked if we do merge a fix like matrix-org/synapse#13710 because of one simple theoretical attack that i will mention here because of a simple fact. All the details are public and this should be a non issue if we just accept this has to be a room version level thing like String PL. All the 3 implementations that support v10 already supports this behavior and i susspect conduit also supports it but cant state that as fact. You can attack Synapse theoretically by pushing a join event for a overflow user and having another Synapse issue a Control event in close proximity to that overflow join. Because Synapse would fail the correct events because of the invalid parent. Dendrite as far as i understand will chug along happily and not break in this failure mode. Going with the Retroactive fixing path literally opens up an attack on rooms if i am correct and i hope i am wrong. I very much hope so. The attack is probably avoidable if you push your control event onto a state branch that is not the branch that has the in the future possibily invalid event in it. Also i want to add to this that there are concerns that the databases of Catalyst servers in these rooms will be harmed irreparbly as stated by Nyaaori the Catalyst Dev. I cant say if this also applies for Conduit but it would not shock me. As a last point i want to add that Nyaaoris research into when this problem came about has found it literally always existed or it was created literally in the begining days of Synapse. Edit: Clarification. The attack is not technically in need of multiple homeservers its just as an example if we assume that your under attack and your own homeserver is not going to send overflow events. But yes the attack can be executed using a single homeserver if the attack actually is possible. |
Im putting this separately as i dont think its suitable to put this in an edit. Merging a fix to this issue that is not v11 fixes this would set a precedent where Spec clarifications are allowed to Brick rooms. Nyaaoris testing seems to imply that literally every single synapse version shares this flaw. String PL didnt brick rooms by being merged it unbricked rooms for servers who updated to be compliant. But this case we are instead asking to brick rooms that contain events accepted by all Synapse versions as valid. And in the string PLs case the same argument of only a limited amount of rooms affected could be made but we didnt brick those rooms that time. |
How would this brick a room? To brick a room, it would need to affect event authorization. And from what I can tell, this only bricks rooms where a server was intentionally patched to remove client side limits and admin in that room and intentionally abused this. Which is very far in the fuck around and find out territory. In the worst case it makes a few events vanish in rooms for unpatched servers. But let's look deeper into what this could affect. To "brick" a room, such an event would need to be in the auth chain. So what events are in the auth chain? Just a different server sending a control event will not break the room. Yes, it will make the prev events for that event invalid, but prev events don't matter for the auth chain. I don't think it is bad to brick rooms, that intentionally had admins sending spec violating events. If this isn't fixed, then some clients won't be able to interact with those rooms for quite a while, while currently the count of broken rooms with this fixed (in the one server implementation implementing this wrong for federation side checks) it so far only breaks rooms that intentionally messed around with this. On a personal note, I do find it quite challenging how many people advocate for keeping broken behaviour in the spec and increasing its complexity to not break some theoretical rooms, that don't seem to exist in practice, while this bug completely broke at least one client when abused, because it relied on what the spec said and defederated rooms (that were intentionally using broken users) between 3 spec following implementations and synapse on the other hand. Also, the spec says nowhere, that state keys of this size are allowed. The only place where one could argue this is the case, is the identifier grammar, which says 255 characters, but also restricts those characters to ASCII only, so this is still equal to 255 bytes. Sure, there is the historical id section, but that is to grandfather existing users in from old Synapse instances, that never allowed emojis in usernames. |
…epoints rather than bytes (update to matrix-org/gomatrixserverlib@5f66df0, matrix-org/matrix-spec#1001)
Just wanted to provide an update on my own research on this topic so its documented here in case its not already. State keys and State event types that are made up of 242 🐈 emojies are allowed to be sent by a stock Synapse tested via feline.support that is deployed via Ansible playbook found at https://github.com/spantaleev/matrix-docker-ansible-deploy. This playbook as of time of writing deploys the official matrix.org docker image for Synapse 1.67.0. Client utilised in the tests todays copy of Element Nightly Desktop. So the copy from 2022-09-17. I tested sending a state event with the type of 242 🐈 emojies and with the same as the key. And also regular event of the same type. State event was sent in a room created using the standard template for public rooms by me a 100 PL user. I replicated the sending of the non state event version in multiple other rooms during my recent testing without issue and in rooms where i am at default PL. This comment is just intended to document that current synapse does not verify state keys as not containing emoji or event types not containing them at all. Or if this check is done having the homeserver admin flag nullifies it. |
In the interest of moving this forwards, I think we should land this as bytes. Reading through the comments it feels like the main reasons for not using bytes is for backwards compatibility, and in my mind that is not a compelling enough reason to keep that behaviour in the long term. FWIW my view is that using anything other than bytes is fraught with difficulty, e.g. limiting to 255 unicode code points doesn't restrict the length of the rendered string, and many languages don't have in built ways of determining the "true" length of a unicode string. I also don't think its worth keeping the current behaviour for existing rooms, given the failure mode isn't truly terrible and there are very few reported incidents about this. cc. @neilalexander & @timokoesters as other HS developers that may want to chime in. Unless I hear objections soon then the Synapse team is going to merge matrix-org/synapse#13710 |
I'm happy to go with bytes — matrix-org/gomatrixserverlib#338 is ready and waiting. |
Ruma and by extension Conduit is already enforcing 255 bytes. It took a long time until we even learned Synapse did something else, and since it became clear that it's probably feasible for Synapse to do a small breaking change there, I've been pushing back against one person's infrequent asks for Synapse compatibility on this. |
…es rather than codepoints (#338) This effectively reverts the change made in [5f66df0](5f66df0) to bytes instead of codepoints, since Synapse will now enforce the same after matrix-org/synapse#13710. History here is that Synapse originally calculated bytes in Python 2.x, started counting codepoints in Python 3.x pretty much by accident and then the spec was ambiguous after the fact (hence matrix-org/matrix-spec#1001). Rationale is that bytes are probably easier for implementations to manage and less likely to generate huge indexes for client-side databases (especially where limits might exist like LMDB). cc @reivilibre
Here we are two years later, having failed to move this forward in any material way :(. I don't see it mentioned above, but https://spec.matrix.org/v1.10/client-server-api/#size-limits already specfies that the limits for a bunch of identifiers are bytes. IMHO this information should be moved to the appendices. |
…larify that the length is to be measured in bytes Fixes: matrix-org#1826 Relates to: matrix-org#1001 Signed-off-by: Johannes Marbach <n0-0ne+github@mailbox.org>
I took a stab at this in #1850. Are there any places in the spec other than user IDs where a standardization on bytes would be required? The remaining places that use "characters" (e.g. this or this) appear to also restrict the character set to ASCII (or even less than that). So I suppose the term "character" should be unambiguous there? |
We talk about unicode codepoints, bytes, characters, and just the word "length" to describe how long a string is allowed to be. This is confusing at best, so ideally we pick a unit and stick to it.
Some areas may require an MSC to make uniform.
An example of where this is a problem is state keys: They're limited to 255 bytes but user IDs are limited to 255 characters, so by extension 1 character == 1 byte because otherwise membership wouldn't work.
The text was updated successfully, but these errors were encountered: