-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf improvements for floating point math #852
Conversation
- Convert all the remaining "long double" literals to "double". - Define new literals for some inverse values, and use them to change divide operations into multiply operations, since that is generally faster for most CPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very surprised that modern C compilers aren't making these optimizations by default with the performance impact you mentioned, but very excited at improving the performance of key functions in H3. :)
I've ported the Edit: cannot repro with the benchmark of this repo either. Must be HW dependent then. |
I wasn't able to reproduce quite the reported performance improvements on Linux x64 w/ GCC, but I'm happy to retest on ARM later. edit: I see performance improving by more around 10~15% Before
After
|
@isaacbrodsky your benchmark does show an improvement on |
Sorry, I was imprecise. I did see performance improvements in many benchmarks, but more on the order of 10~15% rather than the 30% reported. |
The benefit is definitely microarchitecture specific based on how the FPU is implemented, and latency and throughput of individual operations. Also, most CPUs implement "early-out" divides, so if the computation is like {N/1, 0/N, N/N, N<<2, etc} then it doesn't incur the full latency (e.g. if unit tests have zero dividend there will be no perf benefit) . I just ran "make benchmarks" and pulled a few which looked significant:
That's {1.4x, 2.7x, 1.5x}, as measured on my Ampere AltraMax. The 1.3x I cited was from our SPEC CPU input. Thanks for considering this PR. |
I get similar or even better (40% on cellToLatLng) performance improvements when I test on Linux ARM: Before
After
|
@heshpdx Thanks for improving the performance here! |
@@ -66,7 +66,7 @@ void _hex2dToCoordIJK(const Vec2d *v, CoordIJK *h) { | |||
a2 = fabsl(v->y); | |||
|
|||
// first do a reverse conversion | |||
x2 = a2 / M_SIN60; | |||
x2 = a2 * M_RSIN60; | |||
x1 = a1 + x2 / 2.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just spotted this. I'm not sure if it matters since powers of two are quick anyway, but I figured I would document that we could change it to x1 = a1 + x2 * 0.5;
Or since this is the only usage of M_RSIN60
, just craft a M_RSIN60_DIV_BY_2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed that the same assembly is produced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the same assembly is produced it sounds like it's fine to leave as-is because compiler optimizations take care of it for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, any compiler at -O1 or higher opt figures it out.
* Further performance improvements for FP math More FDIV->FMUL opportunities unlocked, following in the spirit of #852 * Formatting fix * Update src/h3lib/lib/localij.c Co-authored-by: Nick Rabinowitz <public@nickrabinowitz.com> * Add #905 to CHANGELOG.md * Save one fdiv and maybe a cosine --------- Co-authored-by: Nick Rabinowitz <public@nickrabinowitz.com>
This completes the work from #790, where we started the removal of "long double" types.
Additionally, there is a easy performance improvement opportunity through changing some FDIV's into FMUL's. In modern CPUs, divides usually takes 3 to 4 times as long to complete compared to multiply, so we can convert the high impact divide operations by defining literals where the inverse is pre-computed. Removing divides from loops has a big impact. I measured a 30% speedup in
cellToLatLng
andcellToBoundary
on my machine. Please see what you can achieve on yours. Thank you!