Behaviours of Canonical JSON not thoroughly documented #1245

neilalexander · 2022-09-22T08:34:13Z

Right now the spec provides a Python snippet to implement Canonical JSON:

import json

def canonical_json(value):
    return json.dumps(
        value,
        # Encode code-points outside of ASCII as UTF-8 rather than \u escapes
        ensure_ascii=False,
        # Remove unnecessary white space.
        separators=(',',':'),
        # Sort the keys of dictionaries.
        sort_keys=True,
        # Encode the resulting Unicode as UTF-8 bytes.
    ).encode("UTF-8")

This doesn't adequately document the actual behaviours, but instead has led us into a situation where the Python implementation is the only "correct" one.

Needs clarity to explain at least:

which numeric formats are appropriate to appear on the wire (i.e. should scientific notation like 1e9 ever appear?)
upper and lower bounds of all numeric values for both pre- and post-v6 rooms
whether or not implementations should be expected to use IEEE 754 for floats, given they can appear in some old rooms
how to handle unicode (escaping, UTF-16 surrogates, etc.)
precedence order of duplicate keys

(created from #1232)

The text was updated successfully, but these errors were encountered:

richvdh · 2022-09-22T17:45:15Z

link to the relevant bit of spec: https://spec.matrix.org/v1.3/appendices/#canonical-json

which numeric formats are appropriate to appear on the wire (i.e. should scientific notation like 1e9 ever appear?)

No. I think this is implied by "Numbers in the JSON must be integers...", and certainly Synapse's behaviour here is consistent, but the spec could be more explicit. PR to clarify this would be appreciated.

upper and lower bounds of all numeric values for both pre- and post-v6 rooms

pre-v6 is #1244. Post-v6 is I think clearly specced by "Numbers in the JSON must be integers in the range [-(2**53)+1, (2**53)-1]."

whether or not implementations should be expected to use IEEE 754 for floats, given they can appear in some old rooms

Again, old rooms are #1244.

how to handle unicode (escaping, UTF-16 surrogates, etc.)

Yeah this definitely needs clarifying. The only relevant text at the moment is in the python snippet: "Encode code-points outside of ASCII as UTF-8 rather than \u escapes". The presence of the python snippet means that the behaviour is well-defined; it's just not defined in a way that is helpful for anyone not writing python. So, a PR to fix this would be very helpful.

precedence order of duplicate keys

I think this is #1246?

neilalexander added the clarification An area where the expected behaviour is understood, but the spec could do with being more explicit label Sep 22, 2022

This was referenced Sep 22, 2022

Multiple "Canonical JSON" definitions not distinct in spec #1247

Open

JSON signing not appropriate for interoperability #1248

Open

Canonical JSON is inadequately specified #1232

Closed

richvdh mentioned this issue Jun 14, 2023

Canonical JSON: 0 vs -0 #1566

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behaviours of Canonical JSON not thoroughly documented #1245

Behaviours of Canonical JSON not thoroughly documented #1245

neilalexander commented Sep 22, 2022

richvdh commented Sep 22, 2022 •

edited

Loading

Behaviours of Canonical JSON not thoroughly documented #1245

Behaviours of Canonical JSON not thoroughly documented #1245

Comments

neilalexander commented Sep 22, 2022

richvdh commented Sep 22, 2022 • edited Loading

richvdh commented Sep 22, 2022 •

edited

Loading