Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve dump_integer performance #1411

Merged
merged 3 commits into from
Jan 13, 2019
Merged

Conversation

nickaein
Copy link
Contributor

@nickaein nickaein commented Jan 1, 2019

This PR improves the performance of dump_integer using a tuned int2ascii implementation. The technique is discussed in Fastware talk by Andrei Alexandrescu and has been already incorporated in some of other well-known libraries like fmtlib.

In this pull request:

  1. The fast int2ascii implementation is implemented (adapted from fmtlib implementation)
  2. A benchmark for small integers is added. Previous benchmarks for integer are biased to very large integers. Also, in the original talk Andrei argues that the performance gain is larger (~2x) for smaller integers compared to a Naive implementation. Therefore, I believe it is worth covering the small integers in a separate benchmark too since in most use cases in the real world the integers are not that large. This lets us to keep track of the performance for both small and larger integer values.
  3. I couldn't find unit-tests for integer formatting that checks different number of digits. To bring the code coverage to 100%, I have added some unit tests too.

The benchmarks show an considerable performance improvement (~1.5x) on my VM linux. This could be different (and probably higher) in a non-VM environment. Here is the original performance of Dump* operations (Ubuntu 18.04 x64, Clang 6):

Run on (4 X 2601 MHz CPU s)
CPU Caches:
  L1 Data 32K (x1)
  L1 Instruction 32K (x1)
  L2 Unified 256K (x1)
  L3 Unified 6144K (x1)
Dump/jeopardy / -           310113722 ns  308222627 ns         45   162.468MB/s
Dump/jeopardy / 4           367662837 ns  367510814 ns         38   181.292MB/s
Dump/canada / -              13027324 ns   13027200 ns       1096   153.024MB/s
Dump/canada / 4              16393044 ns   16328034 ns        858   473.766MB/s
Dump/citm_catalog / -         1956277 ns    1951371 ns       7267   244.506MB/s
Dump/citm_catalog / 4         2648083 ns    2647870 ns       5007   622.081MB/s
Dump/twitter / -              1912959 ns    1912949 ns       7319    232.77MB/s
Dump/twitter / 4              2181181 ns    2171835 ns       6523   336.927MB/s
Dump/floats / -             107072183 ns  106987779 ns        128    166.43MB/s
Dump/floats / 4             117135279 ns  117133075 ns        119   192.724MB/s
Dump/signed_ints / -         72700522 ns   72323217 ns        192   268.724MB/s
Dump/signed_ints / 4         80542202 ns   80504633 ns        173   300.645MB/s
Dump/unsigned_ints / -       61854487 ns   61853677 ns        230   314.503MB/s
Dump/unsigned_ints / 4       69747684 ns   69491170 ns        202   348.555MB/s
Dump/small_signed_ints / -   31521619 ns   31516332 ns        438   223.579MB/s
Dump/small_signed_ints / 4   39854903 ns   39854301 ns        349   296.449MB/s

And this is the performance with fast int2ascii method (note the last six benchmarks):

Dump/jeopardy / -           286181485 ns  286013895 ns         48   175.083MB/s
Dump/jeopardy / 4           349974284 ns  349965305 ns         40   190.381MB/s
Dump/canada / -              13124489 ns   13047201 ns       1069   152.789MB/s
Dump/canada / 4              17334759 ns   17313192 ns        876   446.808MB/s
Dump/citm_catalog / -         1871325 ns    1871283 ns       7435   254.971MB/s
Dump/citm_catalog / 4         2462448 ns    2462047 ns       5452   669.033MB/s
Dump/twitter / -              1825330 ns    1817293 ns       7690   245.022MB/s
Dump/twitter / 4              2058264 ns    2057411 ns       6820   355.666MB/s
Dump/floats / -             107943518 ns  107941378 ns        129    164.96MB/s
Dump/floats / 4             118865187 ns  118097256 ns        117   191.151MB/s
Dump/signed_ints / -         42482922 ns   42401566 ns        329   458.355MB/s
Dump/signed_ints / 4         49180866 ns   49179644 ns        285   492.142MB/s
Dump/unsigned_ints / -       41778353 ns   41575626 ns        333   467.898MB/s
Dump/unsigned_ints / 4       48426047 ns   48365002 ns        288   500.807MB/s
Dump/small_signed_ints / -   25048480 ns   25048284 ns        554   281.313MB/s
Dump/small_signed_ints / 4   34388632 ns   34332533 ns        401   344.128MB/s

More detailed benchmark results (including gcc, msvc) is available here: https://docs.google.com/spreadsheets/d/1sGE3UvLeLT_KrnC_Ujvtl_NFXPIp28xFOe9u9-6GSU0/edit?usp=sharing

This is my first contribution so any feedback is welcome.


Pull request checklist

Read the Contribution Guidelines for detailed information.

  • Changes are described in the pull request, or an existing issue is referenced.
  • The test suite compiles and runs without error.
  • Code coverage is 100%. Test cases can be added by editing the test suite.
  • The source code is amalgamated; that is, after making changes to the sources in the include/nlohmann directory, run make amalgamate to create the single-header file single_include/nlohmann/json.hpp. The whole process is described here.

Please don't

  • The C++11 support varies between different compilers and versions. Please note the list of supported compilers. Some compilers like GCC 4.7 (and earlier), Clang 3.3 (and earlier), or Microsoft Visual Studio 13.0 and earlier are known not to work due to missing or incomplete C++11 support. Please refrain from proposing changes that work around these compiler's limitations with #ifdefs or other means.
  • Specifically, I am aware of compilation problems with Microsoft Visual Studio (there even is an issue label for these kind of bugs). I understand that even in 2016, complete C++11 support isn't there yet. But please also understand that I do not want to drop features or uglify the code just to make Microsoft's sub-standard compiler happy. The past has shown that there are ways to express the functionality such that the code compiles with the most recent MSVC - unfortunately, this is not the main objective of the project.
  • Please refrain from proposing changes that would break JSON conformance. If you propose a conformant extension of JSON to be supported by the library, please motivate this extension.
  • Please do not open pull requests that address multiple issues.

@coveralls
Copy link

coveralls commented Jan 2, 2019

Coverage Status

Coverage remained the same at 100.0% when pulling c9dd260 on nickaein:develop into b39f34e on nlohmann:develop.

@nickaein
Copy link
Contributor Author

nickaein commented Jan 2, 2019

The Appveyor is failing with following message:

Cannot assign job to clouds/groups 'pro, pro-backup'. All clouds within the group are either failing, or offline, or busy.

Apparently they are experiencing some outage for now. Is there any way to re-run Appveyor CI for this PR after they resolve the issue?

Its latest status is being reported here: https://status.appveyor.com/

@nlohmann
Copy link
Owner

nlohmann commented Jan 2, 2019

Will do. Thanks for the PR! I am looking forward to having this performance improvement in the library!

@jaredgrubb
Copy link
Contributor

This patch looks pretty good and seems reasonable.

My only concern is whether this patch would have any licensing implications. Was this patch developed from scratch or copied from somewhere? If it is copied from the video, was there anything in the video about the license that the code is released under? Just because it's on youtube doesn't necessarily make it public domain.

@nickaein
Copy link
Contributor Author

nickaein commented Jan 3, 2019

The Youtube video was the motivation to incorporate such technique, but I didn't use the implementation in the video. As I mentioned, this patch is adapted from the implementation in fmtlib. It is licensed under BSD and I am not sure the extend of its implication. If there could be any, I can rewrite the implementation from scratch.

@nickaein
Copy link
Contributor Author

nickaein commented Jan 3, 2019

I have rewritten the implementation with some optimizations along the way. The new implementation benchmark as following with clang compiler:

Dump/signed_ints / -         39328352 ns   39117135 ns        365   496.841MB/s
Dump/signed_ints / 4         46257733 ns   45992578 ns        299   526.245MB/s
Dump/unsigned_ints / -       36924171 ns   36911475 ns        385   527.022MB/s
Dump/unsigned_ints / 4       46389870 ns   45955745 ns        304   527.062MB/s
Dump/small_signed_ints / -   23804581 ns   23831036 ns        585   295.682MB/s
Dump/small_signed_ints / 4   34276080 ns   34128556 ns        414   346.184MB/s

Compared to implementation adapted from fmtlib, there is a ~5% performance improvement on average on benchmarks.

I've also tried this implementation with a larger lookup table with 1000 per-calculated numbers (zero to 999) instead of 100 numbers (zero to 99) and it lead to a speedup of ~9% compared to fmtlib implementation.

Though I haven't ran benchmarks on other compilers, I hope there will be not a performance drop on them at least.

We have three options to proceed:

  1. Use the current implementation in PR as it is. If there is a copyright concern on adapting code from fmtlib which has BSD license, this option is off the table.
  2. Use the rewritten implementation which is similar to fmtlib implementation, but with a ~5% performance boost on clang. Though I have to test this implementation more to make sure it doesn't drop performance on other compilers and it also passes CI.
  3. Use the rewritten implementation with bigger (1000 numbers) lookup table that gives us ~9% performance boost compared to fmtlib on clang. The tests required in option 2 must be done in this case too. More importantly, we also have to consider whether the larger lookup table (which lead to ~1KB increase in the object binary) worth its performance boost.

@nlohmann
Copy link
Owner

nlohmann commented Jan 3, 2019

As for the copyright issue, I would like to watch the video myself. I hope to find time for this on the weekend.

@nlohmann
Copy link
Owner

nlohmann commented Jan 3, 2019

(Any idea why the AppVeyor tests fail?)

@nickaein
Copy link
Contributor Author

nickaein commented Jan 3, 2019

@nlohmann Thanks. I will wait for your response on how to proceed.

As for Appveyor, I mentioned earlier that there was downtime in their servers when I submitted the PR, hence the failure. Here is more detail: https://status.appveyor.com/incidents/39tq9r92flzh

Apparently there is no facility that automatically re-runs the CI. I wonder if it can be done manually? As workaround, I also can submit "noop" commit to trigger the CI.

@nlohmann
Copy link
Owner

nlohmann commented Jan 3, 2019

In fact I restarted the jobs and now there are actual failures, see https://ci.appveyor.com/project/nlohmann/json/builds/21351818/job/4y0907w9i0ri4dxv.

@nlohmann
Copy link
Owner

nlohmann commented Jan 3, 2019

(And BSD and MIT should be no problem IIRC)

@nickaein
Copy link
Contributor Author

nickaein commented Jan 3, 2019

In fact I restarted the jobs and now there are actual failures, see https://ci.appveyor.com/project/nlohmann/json/builds/21351818/job/4y0907w9i0ri4dxv.

Sorry I missed that. It seems there is an overflow issue with the unit tests I have added. For instance:

test/src/unit-to_chars.cpp:502:53: warning: overflow in implicit constant conversion [-Woverflow]
         check_integer(-10000000000LL, "-10000000000");

causes its corresponding test to fail:

test/src/unit-to_chars.cpp:480: FAILED:
    CHECK( j.dump() == expected )
with expansion:
    "-1410065408" == "-10000000000"

It is failing on MinGW-W64 7.3.0 compiler. It might be due to x86/x64 or invalid use of LL literal. I will check into that.
It's weird that Travis CI test are passing though. Maybe Travis doesn't run tests on such compiler?

@nlohmann
Copy link
Owner

nlohmann commented Jan 6, 2019

Now I watched the video and I think there is no issue with copyrights. The video presents ideas and exemplifies them with code snippets. However, I did not find a source saying that it is OK to re-license BSD-licensed code under a MIT license.

@nickaein
Copy link
Contributor Author

nickaein commented Jan 6, 2019

I didn't find any sources either. In such cases where two open-source projects have different (and perhaps incompatible) licensing, the safest approach seems to be incorporating the source license into the destination project. However, that's too much burden for us in this case.

As mentioned eariler, I've already re-implemented the technique from scratch which actually has a slight performance improvement.

Is it fine to add this implementation as a commit on top of previous commits on this pull request? Or it's better to open a new pull request that drops this commit from history which in some parts contains code from fmtlib?

@theodelrieu
Copy link
Contributor

theodelrieu commented Jan 6, 2019

You can just rebase and fixup your commits, no need to open a new PR

@nickaein
Copy link
Contributor Author

nickaein commented Jan 6, 2019

I've push the new changes which are present in this commit: 7a6102f. The other commits are untouched and same as before.

@justinasvd
Copy link

justinasvd commented Jan 10, 2019

I like the algorithmic improvement, which is pretty much in line with Andrei Alexandrescu's work at Facebook. It's always good to squeeze a couple of instructions out of simple operations.

My concern, however, is not about the quality of the submission, but the structure of code. Notably, the duplication of code, count_digits being a notable example. Why is that so? Why the function in serializer is not reused?

@nickaein
Copy link
Contributor Author

Why is that so? Why the function in serializer is not reused?

The only reason could be my limited knowledge of the codebase. Can you point out where is the implementation? I couldn't find such method in serializer.hpp except mine. Also since these methods are tuned, we might get an inferior performance with a equivalent but less performing implementation.

BTW thank you for the feedback! Please let me know if you have other suggestions.

@nlohmann
Copy link
Owner

@justinasvd Could you please provide a link to the duplicated code? I can't find it myself, but it's been a while that I worked on the serializer.

@nlohmann
Copy link
Owner

I just ran the benchmarks myself and can confirm a significant performance improvement. Well done!

For the licensing part: We must not take code with a non-MIT license. This would make this library a time-bomb for future use. As long as the code was only motivated by the YouTube video, I see no problems. But anything from fmtlib should go out.

@nickaein Can you realize this?

@nickaein
Copy link
Contributor Author

You're welcome! Thank you for this great library! I'm aware that the speed is not the top design goal so there could be the ways that improves the performance without affecting public API or causes a major impact on library design.

As for licensing, I understand the significance of possible license violation. The initial implementation was partially adapted from fmtlib (e.g. static const digits string). This choice was actually by random chance and for convenient, not because fmtlib is the only/first implementer of this technique or theirs has any particular significance.

Nevertheless, I have re-written the implementation. The current code isn't derived from fmtlib in any way (neither in its design nor in implementation details).

include/nlohmann/detail/output/serializer.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/output/serializer.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/output/serializer.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/output/serializer.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/output/serializer.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/output/serializer.hpp Outdated Show resolved Hide resolved
single_include/nlohmann/json.hpp Outdated Show resolved Hide resolved
single_include/nlohmann/json.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/output/serializer.hpp Outdated Show resolved Hide resolved
…2ascii

This commits implements a faster int2ascii inspired by "Fastware" talk given
by Andrei Alexandrescu.
See: https://www.youtube.com/watch?v=o4-CwDo2zpg
This benchmark is a sample of 1 million "small" integers
in range [-1000000 1000000) sampled from uniform distribution.
Add some unit tests for formatting integers
to keep code coverage as before.
@jaredgrubb
Copy link
Contributor

Looks good to me!

@nlohmann nlohmann self-assigned this Jan 13, 2019
@nlohmann nlohmann added this to the Release 3.5.1 milestone Jan 13, 2019
@nlohmann nlohmann merged commit daeb48b into nlohmann:develop Jan 13, 2019
@nlohmann
Copy link
Owner

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants