Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behaviours of Canonical JSON not thoroughly documented #1245

Open
neilalexander opened this issue Sep 22, 2022 · 1 comment
Open

Behaviours of Canonical JSON not thoroughly documented #1245

neilalexander opened this issue Sep 22, 2022 · 1 comment
Labels
clarification An area where the expected behaviour is understood, but the spec could do with being more explicit

Comments

@neilalexander
Copy link
Contributor

Right now the spec provides a Python snippet to implement Canonical JSON:

import json

def canonical_json(value):
    return json.dumps(
        value,
        # Encode code-points outside of ASCII as UTF-8 rather than \u escapes
        ensure_ascii=False,
        # Remove unnecessary white space.
        separators=(',',':'),
        # Sort the keys of dictionaries.
        sort_keys=True,
        # Encode the resulting Unicode as UTF-8 bytes.
    ).encode("UTF-8")

This doesn't adequately document the actual behaviours, but instead has led us into a situation where the Python implementation is the only "correct" one.

Needs clarity to explain at least:

  • which numeric formats are appropriate to appear on the wire (i.e. should scientific notation like 1e9 ever appear?)
  • upper and lower bounds of all numeric values for both pre- and post-v6 rooms
  • whether or not implementations should be expected to use IEEE 754 for floats, given they can appear in some old rooms
  • how to handle unicode (escaping, UTF-16 surrogates, etc.)
  • precedence order of duplicate keys

(created from #1232)

@neilalexander neilalexander added the clarification An area where the expected behaviour is understood, but the spec could do with being more explicit label Sep 22, 2022
@richvdh
Copy link
Member

richvdh commented Sep 22, 2022

link to the relevant bit of spec: https://spec.matrix.org/v1.3/appendices/#canonical-json

  • which numeric formats are appropriate to appear on the wire (i.e. should scientific notation like 1e9 ever appear?)

No. I think this is implied by "Numbers in the JSON must be integers...", and certainly Synapse's behaviour here is consistent, but the spec could be more explicit. PR to clarify this would be appreciated.

  • upper and lower bounds of all numeric values for both pre- and post-v6 rooms

pre-v6 is #1244. Post-v6 is I think clearly specced by "Numbers in the JSON must be integers in the range [-(2**53)+1, (2**53)-1]."

  • whether or not implementations should be expected to use IEEE 754 for floats, given they can appear in some old rooms

Again, old rooms are #1244.

  • how to handle unicode (escaping, UTF-16 surrogates, etc.)

Yeah this definitely needs clarifying. The only relevant text at the moment is in the python snippet: "Encode code-points outside of ASCII as UTF-8 rather than \u escapes". The presence of the python snippet means that the behaviour is well-defined; it's just not defined in a way that is helpful for anyone not writing python. So, a PR to fix this would be very helpful.

  • precedence order of duplicate keys

I think this is #1246?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification An area where the expected behaviour is understood, but the spec could do with being more explicit
Projects
None yet
Development

No branches or pull requests

2 participants