Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential floating point optimizations #592

Merged
merged 4 commits into from
May 28, 2022
Merged

Potential floating point optimizations #592

merged 4 commits into from
May 28, 2022

Conversation

axic
Copy link
Member

@axic axic commented Oct 9, 2020

No description provided.


inline bool signbit(float value) noexcept
{
return (bit_cast<uint32_t>(value) & F32SignMask) != 0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generates two very simple instructions with GCC, but clang detects this pattern and replaces it with the SSE instruction.

@@ -356,7 +388,7 @@ inline T ffloor(T value) noexcept
// the __builtin_floor() outputs -0 where it should +0.
// The following workarounds the issue by using the fact that the sign of
// the output must always match the sign of the input value.
return std::copysign(result, value);
return fcopysign(result, value);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the other cases may be faster or may be slower if the rest of the function already uses SSE registers.

@@ -408,7 +440,7 @@ inline T fmin(T a, T b) noexcept
if (std::isnan(a) || std::isnan(b))
return std::numeric_limits<T>::quiet_NaN(); // Positive canonical NaN.

if (a == 0 && b == 0 && (std::signbit(a) == 1 || std::signbit(b) == 1))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chfast std::signbit actually returns a bool, why did use integer literals here and below?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember. Probably it is more clear that signbit is 1, while the meaning of true bit is confusing.

lib/fizzy/execute.cpp Outdated Show resolved Hide resolved
@axic axic added the optimization Performance optimization label Oct 9, 2020
@axic axic force-pushed the fp-optim branch 2 times, most recently from e574e0e to 532dbba Compare October 14, 2020 22:33
@codecov
Copy link

codecov bot commented Oct 14, 2020

Codecov Report

Merging #592 (c5ae9aa) into master (556a526) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #592   +/-   ##
=======================================
  Coverage   99.27%   99.27%           
=======================================
  Files          88       88           
  Lines       13154    13158    +4     
=======================================
+ Hits        13058    13062    +4     
  Misses         96       96           
Flag Coverage Δ
rust 98.47% <ø> (ø)
spectests 90.00% <100.00%> (+0.01%) ⬆️
unittests 99.21% <100.00%> (+<0.01%) ⬆️
unittests-32 99.31% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
lib/fizzy/execute.cpp 99.29% <100.00%> (+<0.01%) ⬆️

@axic
Copy link
Member Author

axic commented Dec 29, 2020

I think this is ready, just needs benchmarking to show which changes are useful.

@axic axic marked this pull request as ready for review May 23, 2022 14:38
@axic
Copy link
Member Author

axic commented May 23, 2022

Rebased.

Copy link
Collaborator

@chfast chfast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarking situation does not look good. I suspect the performance is related to the code layout in the binary. I can "stabilize" it in clang with -mbranches-within-32B-boundaries but not in GCC.

fizzy/execute/blake2b/512_bytes_rounds_1_mean                     +0.2276         +0.2276            68            83            68            83
fizzy/execute/blake2b/512_bytes_rounds_16_mean                    +0.2006         +0.2006          1022          1228          1022          1228
fizzy/execute/ecpairing/onepoint_mean                             +0.2205         +0.2205        328759        401248        328762        401251
fizzy/execute/keccak256/512_bytes_rounds_1_mean                   +0.4017         +0.4017            74           103            74           103
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  +0.3864         +0.3864          1073          1488          1073          1488
fizzy/execute/memset/256_bytes_mean                               +0.0524         +0.0524             6             6             6             6
fizzy/execute/memset/60000_bytes_mean                             +0.0534         +0.0534          1279          1347          1279          1347
fizzy/execute/mul256_opt0/input1_mean                             +0.0047         +0.0047            25            25            25            25
fizzy/execute/ramanujan_pi/33_runs_mean                           +0.0093         +0.0093           101           102           101           102
fizzy/execute/sha1/512_bytes_rounds_1_mean                        +0.0147         +0.0147            76            77            76            77
fizzy/execute/sha1/512_bytes_rounds_16_mean                       +0.0149         +0.0149          1061          1076          1061          1076
fizzy/execute/sha256/512_bytes_rounds_1_mean                      +0.0016         +0.0016            73            73            73            73
fizzy/execute/sha256/512_bytes_rounds_16_mean                     -0.0024         -0.0024          1004          1002          1004          1002
fizzy/execute/taylor_pi/pi_1000000_runs_mean                      +0.0001         +0.0001         38213         38216         38213         38216
fizzy/execute/micro/eli_interpreter/exec105_mean                  -0.0059         -0.0059             4             4             4             4
fizzy/execute/micro/factorial/20_mean                             -0.0023         -0.0023             1             1             1             1
fizzy/execute/micro/fibonacci/24_mean                             +0.0091         +0.0091          4378          4417          4378          4417
fizzy/execute/micro/host_adler32/1_mean                           +0.0121         +0.0121             0             0             0             0
fizzy/execute/micro/host_adler32/1000_mean                        -0.0041         -0.0041            26            26            26            26
fizzy/execute/micro/icall_hash/1000_steps_mean                    +0.0374         +0.0374            61            63            61            63
fizzy/execute/micro/spinner/1_mean                                +0.0155         +0.0155             0             0             0             0
fizzy/execute/micro/spinner/1000_mean                             +0.0334         +0.0334             8             8             8             8
OVERALL_GEOMEAN                                                   +0.0691         +0.0691             0             0             0             0

template <typename T>
T fcopysign(T a, T b) noexcept = delete;

template <>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should not be template specializations. Function overloading is good enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't add this, just moved this function. Can remove.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still good time to improve it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should not be template specializations. Function overloading is good enough.

Actually, I've learned this is not enough because of the way this is used. Ignore this request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, learned that yesterday too, but was lazy to close the PR 😓

@@ -362,6 +375,25 @@ inline double fneg(double value) noexcept
return bit_cast<double>(bit_cast<uint64_t>(value) ^ F64SignMask);
}

template <typename T>
T fcopysign(T a, T b) noexcept = delete;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it more effective you should define it for two different types template <typename T, template U>.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? It should be the same type we're copying between?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If types are different the template will not match. E.g. fcopysign(0.0, 0.0f) will use fcopysign(double, double) but we want to be very restrictive to what overloads are provided.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fdiv, fmin, fmax have the same issue, plus all the basic add/mul/... ones. Should those be changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved these to #826 where we can clean all of them up.

T fcopysign(T a, T b) noexcept = delete;

template <>
inline float fcopysign(float a, float b) noexcept
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we have small issue with the name: fcopysign vs std::copysign and signbit vs std::signbit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can rename fcopysign to copysign, it was an already existing feature and just moved. However if it is 100% compatible with std::copysign then should be renamed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually https://en.cppreference.com/w/cpp/numeric/math/copysign has some interesting comment:

float copysign ( float mag, float sgn );

If mag is NaN, then NaN with the sign of sgn is returned.
If sgn is -0, the result is only negative if the implementation supports the signed zero consistently in arithmetic operations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to rename to copysign and follow C std names (which are bad by themselves).

The C standard does not require IEEE 754, but WebAssembly does. So we must use compilers that support IEEE 754. E.g. GCC supports signed zeros by default, but this can be disabled with -fno-signed-zeros. https://gcc.gnu.org/wiki/FloatingPointMath

template <typename T>
T fcopysign(T a, T b) noexcept = delete;

template <>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should not be template specializations. Function overloading is good enough.

Actually, I've learned this is not enough because of the way this is used. Ignore this request.

template <typename T>
T signbit(T value) noexcept = delete;

inline bool signbit(float value) noexcept
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may get a proper constexpr signbit in C++23: https://open-std.org/JTC1/SC22/WG21/docs/papers/2019/p0533r5.pdf

@axic axic merged commit 90ac6e0 into master May 28, 2022
@axic axic deleted the fp-optim branch May 28, 2022 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Performance optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants