Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON parsing fails for integer values greater than BiggestInt.max #15413

Closed
ErikSchierboom opened this issue Sep 26, 2020 · 29 comments
Closed

JSON parsing fails for integer values greater than BiggestInt.max #15413

ErikSchierboom opened this issue Sep 26, 2020 · 29 comments

Comments

@ErikSchierboom
Copy link

The JSON parsing functions (like parseJson and parseFile) fail when trying to parse an integer which value exceeds the maximum value that BiggestInt can hold, even if that value could be represented by the BiggestUInt type.

Example

import json

let jsonWithBigInteger = "9223372036854775808"

echo parseJson(jsonWithBigInteger)

See https://play.nim-lang.org/#ix=2yON

Current Output

/usercode/in.nim(5)      in
/playground/nim/lib/pure/json.nim(931) parseJson
/playground/nim/lib/pure/json.nim(847) parseJson
/playground/nim/lib/pure/json.nim(783) parseJson
/playground/nim/lib/pure/strutils.nim(1094) parseBiggestInt
/playground/nim/lib/pure/parseutils.nim(443) parseBiggestInt
/playground/nim/lib/pure/parseutils.nim(423) rawParseInt
/playground/nim/lib/pure/parseutils.nim(397) integerOutOfRangeError
Error: unhandled exception: Parsed integer outside of valid range [ValueError]

Expected Output

9223372036854775808

Possible Solution

Additional Information

$ nim -v
Nim Compiler Version 1.2.6 [MacOSX: amd64]
Compiled at 2020-09-21
Copyright (c) 2006-2020 by Andreas Rumpf

active boot switches: -d:release
@metagn
Copy link
Collaborator

metagn commented Sep 26, 2020

A solution would be to detect if BiggestInt can hold the integer value and if not, use BiggestUInt.

This would need another JsonNodeKind and branch of JsonNode specifically for uints, which would have no other use outside this specific case (integers between [2^63, 2^64)). A simple alternative is parsing to a float, which is technically what JS does. A JBigNum branch that stores a raw string of the integer is also too specialized, since it is only for integers that are too big and not, for example, floats that have more than 18 decimal digits.

@disruptek
Copy link
Contributor

I’ve had this problem for ages. As far as I can tell, JSON doesn’t define a range for integers and unfortunately, we cannot complete a parse successfully when our native type cannot hold the JSON integer value.

The right solution fixes our stdlib, I think, but someone who cares about our hooks system should weigh in with a sympathetic design.

What do our other 3rd-party JSON libraries do?

@Araq
Copy link
Member

Araq commented Sep 27, 2020

We should parse to float for these cases. Or maybe pretend it's a string literal. UInt64 solves nothing.

@juancarlospaco
Copy link
Collaborator

Should be doable by fixing #14696 (comment)

@jackhftang
Copy link
Contributor

It will be very tricky for caller to handle logic that normally a JInt is expected, but if the value greater than some value it is a JFloat and lose precision silently. IMHO, throwing an error make sense.

@qbradley
Copy link

qbradley commented Oct 4, 2020

If the library was generic on the integer type then the user could choose appropriate type for their use case including a bigint from nimble. Is it possible to make it generic with a default to BiggestInt without breaking compatibility?

@timotheecour
Copy link
Member

D's std.json (see https://dlang.org/phobos/std_json.html) handles large uint's correctly; nim can handle them correctly too.

import std.stdio;
import std.json;
void main(string[]args){
  string s = "9223372036854775808";
  JSONValue j = parseJSON(s);
  assert(j.toString == "9223372036854775808");
  assert(j.type == JSONType.uinteger);
  assert(j.get!ulong == 9223372036854775808);
}

(note that https://github.com/dlang-community/std_data_json or https://vibed.org/api/vibe.data.json/ is more popular)

@juancarlospaco
Copy link
Collaborator

juancarlospaco commented Oct 4, 2020

@timotheecour Whats JSONType.uinteger a Big Int basically?.
I think that silently losing information into floats is not good workaround, need Big Int on stdlib.

@timotheecour
Copy link
Member

timotheecour commented Oct 4, 2020

Whats JSONType.uinteger a Big Int basically?.

no, it's an enum member encoding the parsed type; see code here: https://github.com/dlang/phobos/blob/master/std/json.d

here are how these map to internal D types:

struct JSONValue
{
    import std.exception : enforce;

    union Store
    {
        string                          str;
        long                            integer;
        ulong                           uinteger;
        double                          floating;
        JSONValue[string]               object;
        JSONValue[]                     array;
    }
    private Store store;

and here's the parsing logic:

                    if (isNegative)
                    {
                        value.store.integer = parse!long(data);
                        value.type_tag = JSONType.integer;
                    }
                    else
                    {
                        // only set the correct union member to not confuse CTFE
                        ulong u = parse!ulong(data);
                        if (u & (1UL << 63))
                        {
                            value.store.uinteger = u;
                            value.type_tag = JSONType.uinteger;
                        }
                        else
                        {
                            value.store.integer = u;
                            value.type_tag = JSONType.integer;
                        }
                    }

which is hopefully self explanatory: it tries to fit into a integer (mapping to D's long, or nim's int) unless it's positive and doesn't fit into an integer, in which case it maps to uinteger (D's ulong, nim's uint)

I think that silently losing information into floats is not good workaround, need Big Int on stdlib.

float is a bad workaround here; and Big Int is not needed to solve this problem. We can adapt the above D solution which (hopefully) results in no breaking change; only parsed integers that don't fit in int would result in being encoded as uint (instead of raising ValueError as they do currently)

@disruptek
Copy link
Contributor

How does that help? Sounds like a terrible product to use.

We need BigInt and I think it’s pretty clear why; the JSON spec demands it. Anything else is just going to require fixing later.

@krux02
Copy link
Contributor

krux02 commented Oct 9, 2020

Json numbers are not BigInt, they are arbitrary precision decimal numbers. The problem here is the same with string and BigInt, they require an additional tiny allocation and pointer indirection making json parsing even slower than it already is. So my recommendation is to not write to the decimal field unless absolutely necessary.

We should parse to float for these cases. Or maybe pretend it's a string literal. UInt64 solves nothing

WTF float? convert the number to a lossy type and then back to uint. What could ever go wrong with that one. String would work I agree, but also an additional uint64 field would fix the problem.

@timotheecour
Copy link
Member

timotheecour commented Oct 9, 2020

I have to agree with @krux02 here; Json numbers are not BigInt so using BigInt wouldn't solve the general problem of number, and most importantly would incur overhead in the common case; furthermore there is not ETA on std/bigints see #14696 (essentially, either it's a slow nim implementation/wrapper or it's a fast nim wrapper around GMP but which has LGPL licensing concerns, see #14696 (comment))

my proposal remains:

  • use uint64 exactly as done in D (for uint64 numbers too large to fit in int64) see JSON parsing fails for integer values greater than BiggestInt.max  #15413 (comment)
  • for now users that need larger numbers than uint64/int64 range (ditto with large floats) must encode as string otherwise overflow exception is raised
  • eventually, once std/bigints materialize, the overflow exception would not be thrown for numbers that don't fit in uint64/int64
  • ditto with std/bigfloats (which could also be enabled by a GMP wrapper btw, since GMP implements both)

benefits:

  • can be implemented quickly, without waiting on std/bigints, std/bigfloats
  • no bigint overhead forced upon every number including the 99.99..% common case (in practice) of numbers that fit in int64/uint64
  • no breaking change nor performance breaking change once std/bigints becomes available

@disruptek
Copy link
Contributor

I don't know why @krux02 brought up the number type; this issue is about the integer type.

I'm against changing json twice -- once to accomodate one extra bit, and again when we have a proper solution that meets the spec. If you can provide the API that won't necessitate an additional change later, then I'd support this change.

I'm curious why you think that we'll have bigint overhead for all integers. It seems to me that we'll know how many digits the integer has, so most smaller integers should be quite easy to accomodate natively without any bigint machinery.

@timotheecour
Copy link
Member

timotheecour commented Oct 10, 2020

uint64 in json are way more common than BigInt, for the simple reason that uint64 is a native type in most environments, so it makes sense to support it efficiently without incurring BigInt overhead, which can be large.

I don't see anything wrong with adding JsonNodeKind.JUInt for the range int64.high+..uint64.high and then once std/bigints is added, adding JsonNodeKind.JBigInt for the range of integers > uint64.high and < -int64.low.

Note: other libraries I found that support (at least) uint64 in json do as I suggest (more research welcome; by no means complete):

I'm against changing json twice

I don't follow this. No code will break and the mapping from a Number in any given range to a JsonNodeKind would not change as JUInt and JBigInt are added. Plus, to match the spec you'd anyway need to add a JBigFloat kind, whether you have BigInt or not.

PR

=> #15545

@disruptek
Copy link
Contributor

uint64 in json are way more common than BigInt, for the simple reason that uint64 is a native type in most environments, so it makes sense to support it efficiently without incurring BigInt overhead, which can be large.

I'm with you so far. 👍

I don't see anything wrong with adding JsonNodeKind.JUInt for the range int64.high+..uint64.high and then once std/bigints is added, adding JsonNodeKind.JBigInt for the range of integers > uint64.high and < -int64.low.

I know. But that doesn't mean there's nothing wrong with it, right? It just means you don't see it.

Once you start changing the semantics, you ruin the point of JSON. If you're using it in any kind of way beyond that which it is specified for, then basically you are doing it wrong. The whole idea is that we agree on what we know and don't know about the data.

I'm against changing json twice

I don't follow this. No code will break and the mapping from a Number in any given range to a JsonNodeKind would not change as JUInt and JBigInt are added. Plus, to match the spec you'd anyway need to add a JBigFloat kind, whether you have BigInt or not.

Not my problem. I didn't invent JSON nor did I write our implementation of it. I'm just trying to explain that you're going down the wrong path if you decide to attempt to ascribe more precision to types that are intentionally vague. This isn't adding abstraction, it's removing it.

Now, how are you not going to break my code when you add a new JsonNodeKind? I have exhaustive case statements for this enum.

@krux02
Copy link
Contributor

krux02 commented Oct 13, 2020

Now, how are you not going to break my code when you add a new JsonNodeKind? I have exhaustive case statements for this enum.

If the new enum value only occur if you encounter values that were impossible to use before, then I don't think it would break your code. As timothee suggested, the uint type should only be used for integer values that cannot be represented with int64.

I don't know why @krux02 brought up the number type; this issue is about the integer type.

Well to be precise, Json only has a one type for numbers. And these are arbitrary precision decimal types (I call it JNumber for now), because that is exactly what you can represent in ASCII. Nim's native types are fixed precision integer and floating point types with up to 64 bits of precision. On top of that Nim's floating point numbers (IEEE 754) also allow NaN and ±∞. This means neithe is JNumber a subset of any of the buildin Nim types, nor is float32/float64 a subset of JNumber and the implementation of Json now has to deal with this mess.

One advantage that we do know is, that almost all numbers that end up in json are serialized from some native number types, not hand written nor from some arbitrary precision library. So it totally makes sens to special case for all these builtin types to support them as good and fast as possible. So even if the Arbitrary precision number type would already be supported, it would still make sense to add this UInt special case for performance reasons.

@disruptek
Copy link
Contributor

First, maybe you can explain what I'm missing in the following exhaustive case statement:

case kind
of JInt: discard
of JFloat: discard
of JString: discard
of JBool: discard
of JObject: discard
of JArray: discard
of JNull: discard

This will result in a compile-time error if you add a new value to JsonNodeKind.

I agree that the implementation is poor. Again, if you want my support, show me how the change does not break existing code. Alternatively, show how the change won't itself need to be changed in some breaking way in the future.

Honestly, I would sooner support a solution that is 100x slower, supports only one JNumber type, and does so for every conceivable JSON encoding. If you're going to break me, break me for the last time.

@krux02
Copy link
Contributor

krux02 commented Oct 13, 2020

Yes, sorry, you are right. For a moment I forgot that in Nim case statements fail to compile when they are not fully covered.

@disruptek
Copy link
Contributor

What about JUndefined? It lets us change the spec just once. You can subsequently parse JUndefined with whatever assumption you wish to bring to bear upon the source.

@Araq
Copy link
Member

Araq commented Oct 14, 2020

JavaScript doesn't have 64 bit unsigned numbers at all, this discussion is silly. If we look at the spec, https://www.json.org/json-en.html it seems clear to me that any number must be accepted, including numbers much larger than high(uint64). I propose we add a new JSON kind, JRawNumber and parse it into a string value. Then clients can use a custom parseNumber proc and their favorite bignum library (or restrict it to uint64).

@disruptek
Copy link
Contributor

Sure, but this is JSON, not JavaScript.

If we break with JUndefined then we may as well also consolidate into a single JNumber kind also.

@timotheecour
Copy link
Member

timotheecour commented Oct 14, 2020

I like JRawNumber if done in addition to JUInt, uint64 is a native type on most platforms and shouldn't incur performance cost

  • JUInt shall still be introduced (see fix #15413: support JSON in int64.high+1..uint64.high as JUint, and beyond as JNumber #15545) and ensure uint64 doesn't incur overhead, since it's arguably much more common than bigint/bigfloat
  • JRawNumber is also added as a fallback to represent all valid json numbers not already covered by the other number kinds (JInt, JUint, JFloat)
  • JRawNumber is represented as a string and json parser only checks that the string represent valid json number, no decoding happens at parse time
  • proc getRawNumberStr(j: JsonNode): string is added to json get APIs (no decoding happens here)
  • user can use parse the output of getRawNumberStr outside of std/json using a separate library for bigint/bigfloat

this gives best of both worlds:

  • performance is not sacrified for the common case (uint64 can be common for some applications eg uuids etc)
  • json spec is followed
  • json isn't bloated by an added dependency on some bigint/bigfloat library
  • only breaks the case statement once (JUInt + JRawNumber can be added in 1 PR or at least within 1 nim point release)

@disruptek
Copy link
Contributor

Not a fan, but I'm thinking I'll be writing my own JSON parser, so don't let me stand in the way of "progress".

@krux02
Copy link
Contributor

krux02 commented Oct 14, 2020

@disruptek well, if you do that, you should probably start here: https://github.com/Araq/packedjson

@dom96
Copy link
Contributor

dom96 commented Oct 18, 2020

Am I understanding correctly that your plan is to break code by introducing JRawNumber? Can we come up with a transition period for this at least?

@Araq
Copy link
Member

Araq commented Oct 19, 2020

In the meantime I found a better solution: A parse flag so that "unparsable" numbers are mapped to JString.

@jackhftang
Copy link
Contributor

but then you cannot distinguish

{ "n": 12345678901234567890 }

or

{ "n": "12345678901234567890" }

one way is to map to a new type JUnknown instead of JString, but I don't think it is a good solution.

timotheecour added a commit to timotheecour/Nim that referenced this issue Oct 19, 2020
@timotheecour
Copy link
Member

timotheecour commented Oct 19, 2020

I've updated #15545 as described in #15413 (comment)

nim> parseJson(""" { "a1": 123456789012345676712345678} """)["a1"].kind
JNumber == type JsonNodeKind

nim> parseJson(""" { "a1": 123456789012345676712} """)["a1"].kind
JUInt == type JsonNodeKind

nim> parseJson(""" { "a1": 123456789012345676} """)["a1"].kind
JInt == type JsonNodeKind

EDIT: and see #15646 for my proposed approach to mitigate impact of enum changes.

timotheecour added a commit to timotheecour/Nim that referenced this issue Oct 20, 2020
timotheecour added a commit to timotheecour/Nim that referenced this issue Oct 20, 2020
@krux02
Copy link
Contributor

krux02 commented Oct 21, 2020

but then you cannot distinguish

Yes true, but not really worth to worry about. "JString" is bad, but floating point numbers need to fall back to a string already anyway if they want to serialize infinity somehow. So I think "numbers might be in a string literal" is something I think is acceptable.

timotheecour added a commit to timotheecour/Nim that referenced this issue Oct 21, 2020
Araq added a commit that referenced this issue Oct 29, 2020
@Araq Araq closed this as completed in 87a60c1 Oct 29, 2020
timotheecour added a commit to timotheecour/Nim that referenced this issue Oct 29, 2020
narimiran pushed a commit that referenced this issue Nov 5, 2020
* fixes #15413

* better hide it properly

* see if this makes our list of important packages happy

(cherry picked from commit 87a60c1)
PMunch pushed a commit to PMunch/Nim that referenced this issue Jan 6, 2021
* fixes nim-lang#15413

* better hide it properly

* see if this makes our list of important packages happy
mildred pushed a commit to mildred/Nim that referenced this issue Jan 11, 2021
* fixes nim-lang#15413

* better hide it properly

* see if this makes our list of important packages happy
irdassis pushed a commit to irdassis/Nim that referenced this issue Mar 16, 2021
* fixes nim-lang#15413

* better hide it properly

* see if this makes our list of important packages happy
ardek66 pushed a commit to ardek66/Nim that referenced this issue Mar 26, 2021
* fixes nim-lang#15413

* better hide it properly

* see if this makes our list of important packages happy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants