Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pick a unit for describing string length #1001

Closed
turt2live opened this issue Mar 24, 2022 · 16 comments
Closed

Pick a unit for describing string length #1001

turt2live opened this issue Mar 24, 2022 · 16 comments
Labels
clarification An area where the expected behaviour is understood, but the spec could do with being more explicit

Comments

@turt2live
Copy link
Member

We talk about unicode codepoints, bytes, characters, and just the word "length" to describe how long a string is allowed to be. This is confusing at best, so ideally we pick a unit and stick to it.

Some areas may require an MSC to make uniform.

An example of where this is a problem is state keys: They're limited to 255 bytes but user IDs are limited to 255 characters, so by extension 1 character == 1 byte because otherwise membership wouldn't work.

@turt2live turt2live added the clarification An area where the expected behaviour is understood, but the spec could do with being more explicit label Mar 24, 2022
@turt2live
Copy link
Member Author

It appears that synapse is actually using codepoints rather than bytes for user IDs, so it's already possible to create users with >255 bytes to them.

This particular piece would be a spec clarification, I think.

@deepbluev7
Copy link
Contributor

My client hardcoded the byte limit, since that is what I found in the client side spec, when I asked for it.

@deepbluev7
Copy link
Contributor

And the database used in some clients is limited 255 UTF-16 code units too: https://github.com/hivedb/hive/blob/master/hive/lib/src/binary/frame.dart#L55

@tulir
Copy link
Member

tulir commented Aug 28, 2022

Synapse also uses python str len() (i.e. unicode codepoints) for measuring room IDs, state keys and event types: https://github.com/matrix-org/synapse/blob/v1.65.0/synapse/event_auth.py#L327-L339

PDUs are not affected and are actually limited to 65536 bytes, thanks to encode_canonical_json encoding the result as utf-8 and returning bytes instead of str.

Room aliases can also be created with lots of bytes, but I'm not sure if those are actually limited at all (it's not a part of event auth, so it might even allow more than 255 codepoints).

@deepbluev7
Copy link
Contributor

so it's already possible to create users with >255 bytes to them.

Only if you patch your Synapse. Unicode characters are not allowed in mxids by default, iirc.

neilalexander added a commit to matrix-org/gomatrixserverlib that referenced this issue Aug 28, 2022
neilalexander added a commit to matrix-org/dendrite that referenced this issue Aug 28, 2022
neilalexander added a commit to matrix-org/gomatrixserverlib that referenced this issue Sep 9, 2022
deepbluev7 added a commit to deepbluev7/matrix-spec that referenced this issue Sep 9, 2022
In almost all instances we are already talking about UTF-8 encoded
ASCII. As such this is not a behavioural change apart from in theory
legacy identifiers, that never were part of the spec.

fixes matrix-org#1001

Signed-off-by: Nicolas Werner <nicolas.werner@hotmail.de>
@reivilibre
Copy link
Contributor

reivilibre commented Sep 9, 2022

On the subject of user_id/sender_id, room_id, state_key, type, event_id:

In favour of bytes:

  • Client Size limits mentions bytes.
    • Changing the definition for clients is likely to break clients and it will be difficult to get them all fixed.
      • Strict limits are especially important in Embedded (and/or 'IoT') use cases.
  • Bytes are a less ambiguous term than 'characters'.
  • https://spec.matrix.org/v1.3/appendices/#user-identifiers mentions '255 characters' for user ID, but also imposes the restriction of using ASCII, even in the historical User ID section
  • The number 255 is most sensible in units of bytes, especially for use in clients' storage engines. Database systems don't usually count storage size in codepoints.
    • LMDB, a widely-known embedded database that you may choose to use in your client, has a key size limit of 511 bytes. This means that 255 codepoints can't generally be stored in the key of LMDB; it's an easy footgun since a client may not test such awkward cases. Bytes are simpler here.
  • 'Counting' bytes is easier than counting Unicode codepoints — you don't have to scan through UTF-8
  • Until recently, Dendrite and Conduit applied these limits in bytes.

Against bytes:

  • Synapse's behaviour was speculated to have changed when Python 3 started to be used, but it appears that this isn't true — json.decode returned Python 2 u-strings, whose len counts in codepoints.
    • In effect, it seems like it has historically been counted in codepoints rather than bytes.
    • (Tested back to Synapse 0.21.0/May 2017 by Nyaaori)
    • Edit: as of Fix event size checks synapse#13710, it now uses bytes for most checks.
  • It may be possible to imagine use cases for state keys where having a different length limit depending on whether you use multi-byte 'characters' is awkward for users (but what's a character? Codepoint? Grapheme? I think it would be awkward anyway)

Remarks:

  • 2 rooms have been created which make homeservers diverge in their behaviour and appear 'defederated' from each other, however this was intentionally caused by room admins.
  • It would be good to make sure we don't write any new parts of spec with such loose terminology in the future; especially noting that 'character' is quite widely considered a meaningless or ambiguous term.
  • I think there should be an 'Event Specification' part of the spec rather than defining events in the CS-API or SS-API, especially since we may end up with conflicting definitions. The current split makes it hard to find definitions and easy for things to fall out of sync.

@FSG-Cat
Copy link
Contributor

FSG-Cat commented Sep 11, 2022

Please note that this is a highly theoretical thing that i never got a clear answer on when we talked about it so im just putting it here to make sure we dont forget it.

There might be the possibility of for Synapse atleast causing a room to be Bricked if we do merge a fix like matrix-org/synapse#13710 because of one simple theoretical attack that i will mention here because of a simple fact. All the details are public and this should be a non issue if we just accept this has to be a room version level thing like String PL. All the 3 implementations that support v10 already supports this behavior and i susspect conduit also supports it but cant state that as fact.

You can attack Synapse theoretically by pushing a join event for a overflow user and having another Synapse issue a Control event in close proximity to that overflow join. Because Synapse would fail the correct events because of the invalid parent. Dendrite as far as i understand will chug along happily and not break in this failure mode.

Going with the Retroactive fixing path literally opens up an attack on rooms if i am correct and i hope i am wrong. I very much hope so. The attack is probably avoidable if you push your control event onto a state branch that is not the branch that has the in the future possibily invalid event in it.

Also i want to add to this that there are concerns that the databases of Catalyst servers in these rooms will be harmed irreparbly as stated by Nyaaori the Catalyst Dev. I cant say if this also applies for Conduit but it would not shock me.

As a last point i want to add that Nyaaoris research into when this problem came about has found it literally always existed or it was created literally in the begining days of Synapse.

Edit: Clarification. The attack is not technically in need of multiple homeservers its just as an example if we assume that your under attack and your own homeserver is not going to send overflow events. But yes the attack can be executed using a single homeserver if the attack actually is possible.

@FSG-Cat
Copy link
Contributor

FSG-Cat commented Sep 11, 2022

Im putting this separately as i dont think its suitable to put this in an edit.

Merging a fix to this issue that is not v11 fixes this would set a precedent where Spec clarifications are allowed to Brick rooms. Nyaaoris testing seems to imply that literally every single synapse version shares this flaw. String PL didnt brick rooms by being merged it unbricked rooms for servers who updated to be compliant. But this case we are instead asking to brick rooms that contain events accepted by all Synapse versions as valid. And in the string PLs case the same argument of only a limited amount of rooms affected could be made but we didnt brick those rooms that time.

@deepbluev7
Copy link
Contributor

How would this brick a room? To brick a room, it would need to affect event authorization. And from what I can tell, this only bricks rooms where a server was intentionally patched to remove client side limits and admin in that room and intentionally abused this. Which is very far in the fuck around and find out territory. In the worst case it makes a few events vanish in rooms for unpatched servers.

But let's look deeper into what this could affect. To "brick" a room, such an event would need to be in the auth chain.

So what events are in the auth chain? m.room.create, m.room.power_levels and m.room.member (of the sender of some events). The only way to pull in a member event of a different member B into the auth chain of user A, is when an invite is sent or when a restricted join is happening through that server. To then break a room, member A needs to send a powerlevel event. Any history after that point will be reset, after this is fixed in a server.

Just a different server sending a control event will not break the room. Yes, it will make the prev events for that event invalid, but prev events don't matter for the auth chain.

I don't think it is bad to brick rooms, that intentionally had admins sending spec violating events.

If this isn't fixed, then some clients won't be able to interact with those rooms for quite a while, while currently the count of broken rooms with this fixed (in the one server implementation implementing this wrong for federation side checks) it so far only breaks rooms that intentionally messed around with this.

On a personal note, I do find it quite challenging how many people advocate for keeping broken behaviour in the spec and increasing its complexity to not break some theoretical rooms, that don't seem to exist in practice, while this bug completely broke at least one client when abused, because it relied on what the spec said and defederated rooms (that were intentionally using broken users) between 3 spec following implementations and synapse on the other hand. Also, the spec says nowhere, that state keys of this size are allowed. The only place where one could argue this is the case, is the identifier grammar, which says 255 characters, but also restricts those characters to ASCII only, so this is still equal to 255 bytes. Sure, there is the historical id section, but that is to grandfather existing users in from old Synapse instances, that never allowed emojis in usernames.

@turt2live turt2live added the blocked Something needs to be done before action can be taken on this PR/issue. label Sep 11, 2022
0x1a8510f2 pushed a commit to 0x1a8510f2/dendrite that referenced this issue Sep 12, 2022
@FSG-Cat
Copy link
Contributor

FSG-Cat commented Sep 17, 2022

Just wanted to provide an update on my own research on this topic so its documented here in case its not already.

State keys and State event types that are made up of 242 🐈 emojies are allowed to be sent by a stock Synapse tested via feline.support that is deployed via Ansible playbook found at https://github.com/spantaleev/matrix-docker-ansible-deploy. This playbook as of time of writing deploys the official matrix.org docker image for Synapse 1.67.0. Client utilised in the tests todays copy of Element Nightly Desktop. So the copy from 2022-09-17.

I tested sending a state event with the type of 242 🐈 emojies and with the same as the key. And also regular event of the same type. State event was sent in a room created using the standard template for public rooms by me a 100 PL user. I replicated the sending of the non state event version in multiple other rooms during my recent testing without issue and in rooms where i am at default PL.

This comment is just intended to document that current synapse does not verify state keys as not containing emoji or event types not containing them at all. Or if this check is done having the homeserver admin flag nullifies it.

@erikjohnston
Copy link
Member

In the interest of moving this forwards, I think we should land this as bytes. Reading through the comments it feels like the main reasons for not using bytes is for backwards compatibility, and in my mind that is not a compelling enough reason to keep that behaviour in the long term.

FWIW my view is that using anything other than bytes is fraught with difficulty, e.g. limiting to 255 unicode code points doesn't restrict the length of the rendered string, and many languages don't have in built ways of determining the "true" length of a unicode string.

I also don't think its worth keeping the current behaviour for existing rooms, given the failure mode isn't truly terrible and there are very few reported incidents about this.

cc. @neilalexander & @timokoesters as other HS developers that may want to chime in. Unless I hear objections soon then the Synapse team is going to merge matrix-org/synapse#13710

@neilalexander
Copy link
Contributor

I'm happy to go with bytes — matrix-org/gomatrixserverlib#338 is ready and waiting.

@jplatte
Copy link
Contributor

jplatte commented Oct 21, 2022

Ruma and by extension Conduit is already enforcing 255 bytes. It took a long time until we even learned Synapse did something else, and since it became clear that it's probably feasible for Synapse to do a small breaking change there, I've been pushing back against one person's infrequent asks for Synapse compatibility on this.

neilalexander added a commit to matrix-org/gomatrixserverlib that referenced this issue Oct 21, 2022
…es rather than codepoints (#338)

This effectively reverts the change made in
[5f66df0](5f66df0)
to bytes instead of codepoints, since Synapse will now enforce the same
after matrix-org/synapse#13710.

History here is that Synapse originally calculated bytes in Python 2.x,
started counting codepoints in Python 3.x pretty much by accident and
then the spec was ambiguous after the fact (hence
matrix-org/matrix-spec#1001).

Rationale is that bytes are probably easier for implementations to
manage and less likely to generate huge indexes for client-side
databases (especially where limits might exist like LMDB).

cc @reivilibre
@richvdh richvdh removed the blocked Something needs to be done before action can be taken on this PR/issue. label Feb 27, 2024
@richvdh
Copy link
Member

richvdh commented May 24, 2024

In the interest of moving this forwards, I think we should land this as bytes

Here we are two years later, having failed to move this forward in any material way :(.

I don't see it mentioned above, but https://spec.matrix.org/v1.10/client-server-api/#size-limits already specfies that the limits for a bunch of identifiers are bytes. IMHO this information should be moved to the appendices.

Johennes added a commit to Johennes/matrix-spec that referenced this issue Jun 7, 2024
…larify that the length is to be measured in bytes

Fixes: matrix-org#1826
Relates to: matrix-org#1001
Signed-off-by: Johannes Marbach <n0-0ne+github@mailbox.org>
@Johennes
Copy link
Contributor

Johennes commented Jun 7, 2024

I took a stab at this in #1850.

Are there any places in the spec other than user IDs where a standardization on bytes would be required?

The remaining places that use "characters" (e.g. this or this) appear to also restrict the character set to ASCII (or even less than that). So I suppose the term "character" should be unambiguous there?

richvdh pushed a commit that referenced this issue Jun 12, 2024
…larify that the length is to be measured in bytes (#1850)

Fixes: #1826
Relates to: #1001
Signed-off-by: Johannes Marbach <n0-0ne+github@mailbox.org>
@richvdh
Copy link
Member

richvdh commented Jun 12, 2024

I think this is fixed by #1850 - thanks @Johennes.

There might be other places in the spec that need clarification, but I can't immediately see any. I suggest if anyone knows of any, they open a new issue with a more precise description

@richvdh richvdh closed this as completed Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification An area where the expected behaviour is understood, but the spec could do with being more explicit
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants