-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON does allow better than IEEE 754 numbers #143
Comments
Some points to consider:
|
Sounds like you want to subscribe to json@ietf.org (https://www.ietf.org/mailman/listinfo/json) :) Also, no, not all JSON implementations support IEEE 754 doubles. Heimdal's, for example, only supports C ints. This point has been made by others on the json@ietf.org list, that implementations generally use whatever representation their environment most natively provides. An interoperable subset would be as hard to nail down as water. The ship has sailed and all that. As for 64-bit integers, they have a significant advantage over doubles: exactitude for the whole range of 2^64 integers that fir in 64 bits. It's quite common on the SQLite3 list, for example, to see recommendations that monetary values be expressed as a whole number of cents or fractions of cents to avoid problems with IEEE 754 doubles. (Of course, one then has to express the shift factor somewhere, such as in a schema, and then one has to be prepared to change that when dealing with rapidly-inflating currencies, and so 64-bit integers aren't exactly great, but they are better than doubles for many uses.) |
Thanks for the link, I'll join that mailing list. I wasn't aware of Heimdal's implementation. Looking at the code, it seems be unable to parse "[42.0]", so I'm not sure it's right to call it a "JSON parser". The printer does seem to output valid JSON, but all of the values it outputs are representable exactly as doubles (by virtue of being 32bit ints). Doubles can do perfectly exact integer arithmetic on all integers from -2^53 to +2^53. This range is enough to encode the world GDP in US$ cents, or a Unix timestamp in microseconds. When you step outside their range (e.g. with a rapidly-inflating currency), doubles lose information at the 16th significant figure, while 64-bit ints lose information at the first. Still not seeing the uses where 64-bit ints win :) |
Why not express it in the top 11 bits? :-) |
On Wed, Jun 5, 2013 at 9:47 AM, Stephen Dolan notifications@github.com wrote:
Good! Review the archives (it's a new list; there's not a lot there http://datatracker.ietf.org/wg/json/charter/ The charter is always subject to change via the normal IETF process,
Indeed, it only parses integers. And the parser is not online (it's
Think of totaling up all liabilities in USD. I hear that altogether Some people dealing with financial and math applications routinely |
Right, but if we're working with $100e12 in fractions of a Yen, we can hit 2**63 (depending on how small "fractions" are). And if we go past the limit in doubles, we keep doing integer arithmetic in units of 2,4,8,... fractions of a cent. If we go past the limit in 64bit ints, we get nonsense. I don't think there are financial reasons to care about more than 16 significant digits. I don't believe there is financial data that has more than about 10. There's lots of mathy reasons to want large integers, but they aren't interesting if 2^64 is an upper limit. Crypto's probably the main use case, but JSON representations of that stuff tend to be base64 encoded binary data rather than numbers anyway. I see the point of having high-precision numbers in JSON. I'd argue that the advantages of trusting that your data can pass through any other JSON-compatible system unaffected outweigh the pain of having to encode really big numbers as something other than JSON numbers, but I definitely see the counter-argument. I don't see the point of having 64-bit integers, though. There seems to be no real use case where they're better than doubles - they have a larger arbitrary limit but much much worse behavior when you hit it. I still can't think of any application where you need to do numeric calculations, your numbers might be bigger than 2^53, you can guarantee that they will be less than 2^63, and you need to represent all integers less than 2^63 and no floating-point numbers. |
I won't try to convince you. I'll stop at pointing out that this
Particularly for a filter. For a C libjv I might not care: chances are I'm NOT using bignums.
If you'd add bignums, I agree. If not, it's just 10 more bits of |
Yeah, this is definitely getting into a long silly argument. Re SQLLite: you should definitely represent monetary values as integers, in cents or in hundreths-of-a-cent, etc. That way, you're doing arithmetic using the same units as everyone else, since 0.01 is not exactly representable in binary. My point was just that IEEE754 doubles are a pretty good integer format: If you're storing integer numeric data, I think 53-bit ints and sensible overflow behaviour is usually a better choice than 63-bit ints and catastrophic overflow behaviour. Also, agreed: arbitrary precision reals (maybe implemented as rationals given by a pair of bigints?) would be useful, much more so than 64-bit ints. |
I wasn't arguing for anything in particular. Re: 64-bit ints I was
Agreed. I think arbitrary precision reals are probably easier to |
tl;dr - rounding 64-bit ints is often unacceptable behavior for people who use JSON but not Javascript. So I just came across jq, and I've been playing around with it a bit. I'm working with logs that have been serialized with json, and so I often find myself having to write short scripts to convert my data from one form to another for various data processing tasks, which is what I was considering using jq for, and it seems like it would work great except (you guessed it) it doesn't handle 64-bit ints. Naturally, my data has some 64-bit integers (unsigned, even). For the most part, they're used as ids, but I don't exactly have the option of changing them to strings since there's already years of code that expects these things to be ints when they come out of the serialization, and actually uses the int-ness of them for useful purposes (e.g. sharding based on the high order bits, A/B testing based on the low order bits). Javascript, as you can probably imagine, isn't part of my data processing toolchain, so this doesn't really cause a problem for me normally. My point here isn't that 64-bit ints are necessarily better than doubles or strings in any important way, but that there's a lot of languages that support them, and as a result a lot of software that uses them and also uses JSON as a serialization format for data that includes them. People who have to deal with 64-bit ints may not be able to use jq, or it may create some subtle bugs for them if they do. Since there's probably a decent overlap between people who use JSON but not Javascript, and people who would find jq useful, you might want to reconsider adding 64-bit int support. If I find myself with some unexpected free time I might dig into your code and submit a pull request, but for now I'll content myself with leaving this comment and sticking with my hand-rolled Python. |
tl;dr @stedolan I just discovered
Simply put, this isn't what I expect from a Unix filter. My expectation was that
In all the above cases there's no need to try to put some semantics on the input content. Only syntax is involved (for validation).
For what it's worth, it is an integer
but that's not really important. More generally speaking, the numbers that can be expressed in JSON / ECMA-404 all have the semantics of "(some integer) times (ten to the power of some integer)", because that's all the syntax allows. [The syntax allows for things like If the integer in "ten-to-the-power-of some integer" is greater than or equal to zero, then the overall number is what is called an integer, but in terms of arithmetic operations inside I'm happy to look into integrating calc's (or another library's bignum implementation) into |
jq does deal with more than rational numbers at the moment: there's a Arbitrary-precision integers would be nice. Arbitrary-precision rationals would also be nice but can lead to unexpected performance (even a small number can take an arbitrary amount of memory). Still, they're probably a good idea. I'm not sure I like preserving the exact string syntax of an input number. That would mean that we could have Also, while we're being pedantic, the JSON syntax doesn't allow infinite digit strings, since the * in [0-9]* is the Kleene star :) |
Oops, missed that. It deserves to be documented on http://stedolan.github.io/jq/manual/
Wasn't trying to be, sorry if I came across that way. :)
Good point. There's probably sense in letting the user choose which semantics they want to use ("as-is or IEEE754", "IEEE754 always", "as-is or arbitrary-precision", "arbitrary-precision always"). Well, time for me to stop talking and instead try and write some code. |
I actually started using jq again (it's so useful!) and ended up adding a patch that leaves numbers as-is until they have some operation act on them, then it goes back to doubles. It's on my forked copy: https://github.com/airfrog/jq I agree that it's not a great solution since it has some frustrating edge cases. I might go back and do it right with arbitrary precision integers, but for now this is working. Has anyone found a bigint/bignum library that would work well for jq? I poked around a bit but I haven't really researched it, and ideally I don't want to add any dependencies to the library. |
jq is really a great tool / filter! But I do agree with airfrog, that it should not touch the representation of a number, unless jq has to (e.g. because a computation was done). My use case is processing huge logs having POSIX timestamps in micro seconds. They can be represented by doubles, so there is no problem. But still (with current HEAD): % echo 1384975447132984 | jq '.' So my integer suddenly is converted to floating point representation and the UNIX filters down the pipe suddenly have to deal with float numbers too. So I think the right way would be, not to change the representation of a number, if the number is just piped through. |
First of all, awesome tool thank you. Also... Wow, I just lost an evening trying to debug my JsonWriter class when it was jq that was truncating my 64bit IDs (63 actually). If it had converted to a scientific notation I would have known but not passing through a clean int64 makes this a much less useful pipline tool :( echo "5093397704957986680" | jq '.' |
Concrete values that only make sense as a number like Unix epoch nanotime already exceed the resolution available in the implementations that use floats for large integers. This is genuinely annoying as the Javascript implementations quietly truncate the values. (just adding) |
I just tripped myself up using jq with nanosecond-scale unix timestamps:
I appreciate that there are some real headaches around using different number implementations according to the value. But there are also headaches around silently changing the values in the data. If support for numbers not exactly representable as 64-bit floats isn't on the cards, how about an error exit rather than approximation? |
Just got bitten by this today. The API we use at work associates each entity with a 64-bit integer ID. I started to question the output of jq when some of these IDs were showing up as equal. |
Please try this branch #1752 |
The issue was raised at the IETF JSON WG this week: does JSON permit arbitrary precision numbers? The syntax certainly does. Neither RFC4627 nor ECMAScript 6 (draft) preclude it, and many implementations use whatever the native language run-time, OS, and/or machine architecture provide -- often not limited to IEEE 754 64-bit real numbers.
jq currently only supports IEEE 754 64-bit numbers. It could probably do better: also 64-bit signed integers using C int64_t (perhaps with much care regarding overflows, or perhaps not). It could even use bignum libraries. It's all probably too much, but I'm filing (and closing) this issue just to record this.
The text was updated successfully, but these errors were encountered: