Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse JSON number to double in full-precision. #137

Merged
merged 40 commits into from
Nov 30, 2014
Merged

Conversation

miloyip
Copy link
Collaborator

@miloyip miloyip commented Sep 6, 2014

Use kParseFullPrecision to turn on this option in compile-time. The implementation of new option should have no performance impact if the flag is not used.

Implementation details

The full-precision path tries to use fast-path if possible. If the criteria of fast-path cannot be met, it falls back to use strtod() to convert string.

Note that the parser still verify the JSON syntax of number as in normal-precision path.

To fulfill the above requirement, the parser needs to backup the correctly parsed characters from stream (as some streams cannot read back). A helper template class GenericReader::NumberStream is designed for this. If full-precision is set, then backup is required, and a specialized NumberStream will backup the characters during NumberStream::Take() into the GenericReader::stack_. That stack_ was previously used only for storing the decoded characters in ParseString().

Unit test

Added random numbers to test more cases for integer types and double.

This experimental results show that full precision generate exact representation (no error), while normal precision parsing has maximum error of 3 ULP.

Denormal numbers () are not tested as it varies among platforms. Implementations of strtod()` in standard libraries may also simply flush denormal to zero.

Fix #120

@Kosta-Github
Copy link
Contributor

Feedback: I was also running into the precision issues lately (I need to ensure roundtrip behavior for double values).

Your PR seems to fix the precision issues but has a severe performance impact: reading of my test datasets (~2800 files, >500MB) went up from 3 seconds (default precision) to >13 seconds (full precision)... :-(

@miloyip
Copy link
Collaborator Author

miloyip commented Sep 9, 2014

May I know which platform/compiler?
I will try to see if it is possible to make a small, fast custom strtod().

@Kosta-Github
Copy link
Contributor

Windows 7; VisualStudio 2012 latest service-pack; 64-bit mode

@pah
Copy link
Contributor

pah commented Sep 9, 2014

I have confirmed the 3x performance loss on Linux (32-bit, Clang 3.6) by running the nativejson-benchmark against the issue120floatprecision branch. Fortunately, there is no significant difference in the JSON files not containing double values, but for canada.json, it's 3.6x slower.

How do other parsers handle the precision corner cases? Most parsers, I've seen so far, go for a simple digit-based approach and should suffer from the same problems, right?

@miloyip
Copy link
Collaborator Author

miloyip commented Sep 9, 2014

I suspect that canada.json has some issues when it is generated.

// ...
"geometry": {"type":"Polygon","coordinates":[[[-65.613616999999977,43.420273000000009],
// ...

For example, I think the first two numbers should actually output as -65.613617 and 43.420273. If so, the fast-path can handle them. (currently the number of digits > 15 in these numbers)

I suggest replacing canada.json by some JSONs with "more normal" numbers.

To @pah, I have not investigated the details of other parsers. Some of them seems using even less precise conversion (well.. actually previously RapidJSON does not use a proper fast path as well). Some of them are using strtod() or similar C library to do the conversion. I have thought of testing each library for conformance and precision. It just need nativejson-benchmark's current Parse() and Strinigfy() interface, and some JSON files. But I think I will do that in longer-term.

To @Kosta-Github, I am investigating strtod() in double-conversion and some information like this. Implement a custom strtod() can save some time for duplicated work, but I am not sure how much time can be saved. And it must increase the code size if doing so.

@Kosta-Github
Copy link
Contributor

@miloyip have a look here about the number of required digits to enable roundtrip behavior: http://en.cppreference.com/w/cpp/types/numeric_limits/max_digits10. For double values it is 17...

@miloyip
Copy link
Collaborator Author

miloyip commented Sep 14, 2014

Intermediate Results

After working for a few days, I have implemented a custom strtod which can parse all doubles including subnormals (strtod in VC CRT does not work properly on subnormals) correctly in https://github.com/miloyip/rapidjson/tree/issue120floatprecision_customstrtod . Currently it only works on VC2013 x64. It has not been optimized.

The following results are generated by nativejson-benchmark.

Normal Precision

          Parse canada.json          ...  9.664 ms  222.132 MB/s
          Parse citm_catalog.json    ...  5.317 ms  318.868 MB/s
          Parse twitter.json         ...  3.030 ms  203.652 MB/s

Full Precision

          Parse canada.json          ... 90.939 ms  23.607 MB/s
          Parse citm_catalog.json    ...  5.522 ms  307.010 MB/s
          Parse twitter.json         ...  3.087 ms  199.886 MB/s

Full Precision, custom strtod

          Parse canada.json          ... 38.407 ms  55.895 MB/s
          Parse citm_catalog.json    ...  5.545 ms  305.737 MB/s
          Parse twitter.json         ...  3.032 ms  203.498 MB/s

There is performance improvement compared with the CRT's strtod. But it is still 4x to normal precision. I will firstly make the code compatible to gcc/clang and see if there is any optimization chances.

@miloyip
Copy link
Collaborator Author

miloyip commented Nov 23, 2014

This is the latest result that I have been doing. Add another method (DiyFp) to try to parse the number. If it cannot handle the number correctly, it will fallback to BigInteger method. So basically in full precision mode, it will try FastPath -> DiyFp -> BigInteger. It should have better performance in average but adding more code size.

Normal precision
          Parse canada.json          ... 10.726 ms  200.145 MB/s
          Parse citm_catalog.json    ...  5.137 ms  330.021 MB/s
          Parse twitter.json         ...  3.013 ms  204.792 MB/s

Full Precision, custom strtod with DiyFp and BigInteger
          Parse canada.json          ... 23.561 ms  91.118 MB/s
          Parse citm_catalog.json    ...  5.626 ms  301.323 MB/s
          Parse twitter.json         ...  3.119 ms  197.848 MB/s

I hope to resolve this #120 "bug" and continue to work on a 1.0 RC.

@pah
Copy link
Contributor

pah commented Nov 28, 2014

The approach in issue120floatprecision_customstrtod and its performance looks good to me. 👍

Can you update the branch issue120floatprecision in this pull-request to this improved implementation? I think, only the selection of the default (see this comment) is not done yet, right?

miloyip added a commit that referenced this pull request Nov 30, 2014
Parse JSON number to double in full-precision with custom strtod. Fix #120
@miloyip miloyip merged commit 454146b into master Nov 30, 2014
@miloyip miloyip deleted the issue120floatprecision branch December 8, 2014 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Float point reading is lossy.
3 participants