Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acceptable characters for MXC opaque identifiers should be more clearly described #503

Open
joepie91 opened this issue Jul 14, 2019 · 13 comments
Labels
A-Media-Repository clarification An area where the expected behaviour is understood, but the spec could do with being more explicit

Comments

@joepie91
Copy link

joepie91 commented Jul 14, 2019

From the spec:

Content locations are represented as Matrix Content (MXC) URIs. They look like:

mxc://<server-name>/<media-id>

<server-name> : The name of the homeserver where this content originated, e.g. matrix.org
<media-id> : An opaque ID which identifies the content.

However, it is not specified anywhere (as far as I can tell) what the valid character range is for these opaque media-ids. In particular, it doesn't specify whether the media-id may contain slashes - a detail that's quite semantically important for a lot of HTTP request routing implementations, which often treat a "URL parameter" as a string of any characters other than a slash, eg. as in /media/:id where /media/foo/bar would not match.

@richvdh
Copy link
Member

richvdh commented Jul 15, 2019

This is mentioned under #174, and an attempt to answer it is in the as-yet-unimplemented MSC1597: https://github.com/matrix-org/matrix-doc/blob/rav/proposals/id_grammar/proposals/1597-id-grammar.md#opaque-ids.

@joepie91
Copy link
Author

Whoops, my bad - I didn't realize that that included MXC identifiers as well. I guess this one can be closed, then?

@richvdh
Copy link
Member

richvdh commented Jul 15, 2019

It's probably useful to keep it open as a specific place to discuss media ids

@turt2live turt2live added clarification An area where the expected behaviour is understood, but the spec could do with being more explicit meta Something that is not a spec change/request and is not related to the build tools labels Jul 16, 2019
@uhoreg
Copy link
Member

uhoreg commented Oct 18, 2019

matrix-org/matrix-spec-proposals#1597 (a.k.a. matrix-org/matrix-spec-proposals#1598) was merged as an MSC, so really this is just a spec-pr-missing

@turt2live turt2live added the spec-omission implemented but not currently specified label May 15, 2020
@richvdh
Copy link
Member

richvdh commented Jan 18, 2021

matrix-org/matrix-spec-proposals#1597 (a.k.a. matrix-org/matrix-spec-proposals#1598) was merged as an MSC, so really this is just a spec-pr-missing

as noted over on matrix-org/matrix-spec-proposals#1598, the merge of that MSC dates to a time where MSCs could be merged before implementation, so although matrix-org/matrix-spec-proposals#1597 presents a proposal which seems to have some support, it's never been adopted by the ecosystem.

@richvdh
Copy link
Member

richvdh commented Jan 18, 2021

@richvdh
Copy link
Member

richvdh commented Jan 19, 2021

@neilalexander reports that Dendrite's media ids are

Hex 64 characters I think

Which I assume means that they consist of the characters [0-9a-f].

It'd be interesting to confirm what other media repo impls (eg Conduit) use for their media IDs, but I'd be a bit surprised if they fell outside the range proposed by MSC1597, which, for the record, is:

must be strings consisting entirely of the characters [0-9a-zA-Z.=_-]. Their length must not exceed 255 characters and they must not be empty.

I wonder what we can do to progress this, or what's actually blocking us fixing it.

  • Is there any reason we can't simply add an explicit grammar for media IDs to the spec, along the lines of MSC1597's proposal? It doesn't seem like it would be a breaking change in practice. If we did so, should we do a new MSC, or are we happy with MSC1597 to stand in for it?

  • If the spec today did give a specific grammar for media IDs, what difference would that make in practice? Would the grammar actually be enforced anywhere, or would it simply be for reference? If it's not enforced, we run the risk of one implementation breaking another by using IDs outside the grammar, but what would enforcement look like?

@KitsuneRal
Copy link
Member

AIUI, enforcement might imply that homeservers reject requests to download media with a non-compliant id. That would make the particular media unreachable through normal means but at least clients would be exempted from checking for stray / or an unescaped #.

Ideally, of course, the grammar should be enforced when uploading media; but I have no good ideas here except client applications nagging users to bug homeserver owners if what the client app received for a media id is not compliant.

@richvdh
Copy link
Member

richvdh commented Jan 20, 2021

per matrix-org/matrix-spec-proposals#1597 (comment), I think allowing = and forbidding ~ is an error.

@turt2live
Copy link
Member

oh, and ftr mmr's IPFS media ID is ipfs:someotherid per matrix-org/matrix-spec-proposals#2706

@turt2live
Copy link
Member

It appears Ruma discovered the text in the spec well before we did:

https://github.com/ruma/ruma/blob/68c9bb0930f2195fa8672fbef9633ef62737df5d/crates/ruma-identifiers-validation/src/mxc_uri.rs#L20-L22

https://spec.matrix.org/v1.9/client-server-api/#security-considerations-5

[...] homeservers MUST sanitise mxc:// URIs by allowing only alphanumeric (A-Za-z0-9), _ and - characters in the server-name and media-id values.

Therefore, this is a clarification issue imo. In practice, Synapse clearly does not apply this sanitization, but aside from edge cases where folks are using |, :, etc characters the vast majority of implementations appear to be aligned on generating compatible IDs.

Extending the character set requires an MSC.

@turt2live turt2live removed the spec-omission implemented but not currently specified label Feb 8, 2024
@turt2live turt2live changed the title Acceptable characters for MXC opaque identifiers not specified Acceptable characters for MXC opaque identifiers should be more clearly described Feb 8, 2024
@morguldir
Copy link

morguldir commented Feb 8, 2024

For the record, original PR and jira issue here:
matrix-org/matrix-spec-proposals#103
https://matrix.org/jira/browse/SPEC-165

also: element-hq/synapse#1323

@richvdh richvdh added A-Media-Repository and removed meta Something that is not a spec change/request and is not related to the build tools labels Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Media-Repository clarification An area where the expected behaviour is understood, but the spec could do with being more explicit
Projects
None yet
Development

No branches or pull requests

6 participants