Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical JSON is inadequately specified #1232

Closed
neilalexander opened this issue Sep 7, 2022 · 12 comments
Closed

Canonical JSON is inadequately specified #1232

neilalexander opened this issue Sep 7, 2022 · 12 comments
Labels
clarification An area where the expected behaviour is understood, but the spec could do with being more explicit

Comments

@neilalexander
Copy link
Contributor

Link to problem area:

https://spec.matrix.org/v1.3/appendices/#canonical-json

Issue

The spec does not adequately and precisely specify what is required of a Canonical JSON implementation. It is therefore impossible to create fully interoperable implementations for the purposes of event signing or hashing.

For example, when Synapse canonicalises JSON, it round-trips (unmarshals it and then remarshals it) using sort_keys=True. The complete list of behaviours are not documented but this round-tripping can affect values in the JSON in surprising ways, i.e.

  • integers that appear in scientific notation on the wire are reformatted into floats

    • 1e6 -> 1000000.0
    • 1e+2 -> 100.0
  • floats more precise than IEEE 754 can be rounded or reformatted (probably only affects older room versions)

    • 1.0101010101010101010101010101010108 -> 1.0101010101010102
    • 12345678901234567890.0 -> 1.2345678901234567e+19
  • unicode characters are always escaped, even though this is not mandated in JSON except for certain characters

    • 💅 -> \xf0\x9f\x92\x85
    • 世界 -> \xe4\xb8\x96\xe7\x95\x8c
  • numeric values NaN, Infinity and -Infinity are accepted because of IEEE 754 seemingly, but nothing requires a JSON implementation to use IEEE 754 (and indeed many JSON implementations won't allow these values at all)

  • duplicate JSON keys are removed preserving the last value only, although this is an implementation-specific detail (and some JSON implementations will not pick the lexicographically last value)

Dendrite/gomatrixserverlib currently does not touch the values at all — excess whitespace is removed and keys are sorted but the values are not round-tripped during canonicalisation and therefore remain exactly as they appear on the wire (valid or invalid admittedly).

cc @jplatte

@neilalexander neilalexander added the clarification An area where the expected behaviour is understood, but the spec could do with being more explicit label Sep 7, 2022
@clokep
Copy link
Member

clokep commented Sep 7, 2022

  • floats more precise than IEEE 754 can be rounded or reformatted (probably only affects older room versions)
  • numeric values NaN, Infinity and -Infinity are accepted because of IEEE 754 seemingly, but nothing requires a JSON implementation to use IEEE 754 (and indeed many JSON implementations won't allow these values at all)

I believe both of these only apply to room versions < 6 (before strict canonical JSON was enforced in Synapse).

@tulir
Copy link
Member

tulir commented Sep 12, 2022

numeric values NaN, Infinity and -Infinity are accepted because of IEEE 754 seemingly, but nothing requires a JSON implementation to use IEEE 754 (and indeed many JSON implementations won't allow these values at all)

Allowing those used to be a bug in Synapse, but it was fixed already matrix-org/synapse#8106

unicode characters are always escaped, even though this is not mandated in JSON except for certain characters

The spec and synapse both mention/use ensure_ascii=False, which should mean unicode is not escaped unless mandatory. That's also implied in the spec by "shortest UTF-8 JSON encoding" in the spec.

@richvdh
Copy link
Member

richvdh commented Sep 21, 2022

So is the only remaining unspecified thing here the handling of duplicate keys?

@neilalexander if you agree it might be good to close this issue and open a clearer one.

@neilalexander
Copy link
Contributor Author

neilalexander commented Sep 21, 2022

So is the only remaining unspecified thing here the handling of duplicate keys?

The IEEE 754 problems (floating precision, Infinity and NaN) matter for supporting older room versions properly so we can't really omit them. The behaviour around duplicate keys is still unspecified today.

I think the root of the problem here is that round-tripping through arbitrary JSON implementations to canonicalise is inherently dangerous because there might be unpredictable behaviours in different implementations.

For example, if you round-trip in Go and there are duplicate keys, you'll get the last value. If you do it in Swift, you get the first value. If you round-trip your over-precise float in Python, you get scientific notation. If you do it in Go, you don't.

Ultimately we want to strip whitespace and ensure a consistent key ordering, but I'm not convinced there's a good reason to round-trip the actual values at all. Otherwise two implementations that disagree on how something should appear on the wire will not be able to compare signatures or hashes properly against those from another implementation.

The spec and synapse both mention/use ensure_ascii=False, which should mean unicode is not escaped unless mandatory. That's also implied in the spec by "shortest UTF-8 JSON encoding" in the spec.

"Do what Python's sort_keys=True and ensure_ascii=False does" is not adequate for a specification because that's putting everything into the hands of a language-specific implementation. We need to describe behaviours instead.

Not even the Python documentation describes all of those behaviours precisely, so what is anyone implementing in another language supposed to do? What happens if Python changes something?

"Shortest UTF-8 JSON encoding" isn't much more useful either — for example, in the 12345678901234567890.0 vs 1.2345678901234567e+19 example, they are both the same length. The transformed value wasn't any "shorter", it just lost precision. As a result we end up hashing/signing a value that is arguably different to what was on the wire to begin with.

@tulir
Copy link
Member

tulir commented Sep 21, 2022

The IEEE 754 problems (floating precision, Infinity and NaN) matter for supporting older room versions properly so we can't really omit them.

I think Synapse made a breaking change when disabling them, they aren't allowed in old room versions either (unlike floats, which are allowed in old versions)

for example, in the 12345678901234567890.0 vs 1.2345678901234567e+19 example, they are both the same length

Those aren't allowed in canonical JSON, but it's true that the non-canonical JSON in pre-v6 rooms is currently extremely inadequately specified (not specified at all, other than stating it may not be canonical)

I'm not convinced there's a good reason to round-trip the actual values at all.

I don't think there's any way to avoid roundtripping other than inventing a new out-of-band signature mechanism, since you need to remove the signatures and unsigned objects before verifying the signature? An out-of-band signature mechanism would likely involve JSON as base64 within JSON (signatures are used in lots of places, not only federation transactions where the whole request is signed).

@neilalexander
Copy link
Contributor Author

Those aren't allowed in canonical JSON, but it's true that the non-canonical JSON in pre-v6 rooms is currently extremely inadequately specified (not specified at all, other than stating it may not be canonical)

They aren't allowed in "today's canonical JSON", but it's worth noting that there's really two "canonical JSONs" here: the one which applies to room versions <= 5 and the other which applies to room version >= 6. We were using the term long before room version 6 came out.

For what it's worth, federation request signing also suffers from similar problems as the JSON request body needs to be canonicalised in order to be signed too. It's using the exact same mechanism. If we're sending events in federation request bodies, as we do with /send for example, then it helps to be clear on what the expected behaviour is there for both old and new room versions.

Cross-signing is another thing that comes to mind. It matters in any place where we try to sign a blob of JSON in this way.

I don't think there's any way to avoid roundtripping other than inventing a new out-of-band signature mechanism, since you need to remove the signatures and unsigned objects before verifying the signature? An out-of-band signature mechanism would likely involve JSON as base64 within JSON (signatures are used in lots of places, not only federation transactions where the whole request is signed).

At the moment, gomatrixserverlib's canonicalisation works by parsing the structure of the JSON and the key values only — it does not round-trip the values themselves. It's a bit more manual but it is far more precise in its operation. In the absence of out-of-band signing, that seems like probably the best option for the future that we have, since it means there's no room for implementations to disagree on value formatting as long as we specify some simple rules for the keys.

@richvdh
Copy link
Member

richvdh commented Sep 21, 2022

The IEEE 754 problems (floating precision, Infinity and NaN) matter for supporting older room versions properly so we can't really omit them. The behaviour around duplicate keys is still unspecified today.

ok, even if there are outstanding issues here, I think there are at least four different issues being conflated here and I'd really rather we discussed them separately rather than attempting to conflate them all together:

  • "pre-v6 canonical JSON allows IEEE stuff": (true, but we can't really do much about it now other than make sure it is correctly documented).
  • '"Do what Python's sort_keys=True and ensure_ascii=False does" is not adequate for a specification' (true, but it's a simple matter of updating the spec. There is clear precedent here for what the spec ought to say)
  • "post-v6 canonical JSON doesn't specify how to handle duplicate keys": yes that's an issue. Let's open an issue to track it without conflating it with the floats and string encoding stuff.
  • "trying to use JSON for this is stupid anyway": probably right, but a bigger change. Can we track it in a separate issue?

@richvdh
Copy link
Member

richvdh commented Sep 21, 2022

Basically: each of those four things has different solutions. I don't think that conflating them all into one issue is helpful.

@turt2live
Copy link
Member

if we already have solutions for points 1 & 2, then there's not much reason to split 3 out to its own issue: it'd just get fixed as part of the trio anyways. 4 is probably worthy of its own issue, though in fairness it's not likely to ever get looked at given it has such a large scope: it'd be nice to attach a roadmap to that issue if we plan to open it as a separate point.

@richvdh
Copy link
Member

richvdh commented Sep 22, 2022

if we already have solutions for points 1 & 2, then there's not much reason to split 3 out to its own issue: it'd just get fixed as part of the trio anyways

1 and 2 are a matter of clarifying the spec. 3 needs discussion about how best to approach it. So no, I don't agree that 3 will automatically get fixed as part of 1 & 2.

@neilalexander neilalexander closed this as not planned Won't fix, can't repro, duplicate, stale Sep 22, 2022
@richvdh
Copy link
Member

richvdh commented Sep 22, 2022

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification An area where the expected behaviour is understood, but the spec could do with being more explicit
Projects
None yet
Development

No branches or pull requests

5 participants