Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

float rounding is slow #55107

Open
raphlinus opened this issue Oct 16, 2018 · 26 comments
Open

float rounding is slow #55107

raphlinus opened this issue Oct 16, 2018 · 26 comments
Labels
A-floating-point Area: Floating point numbers and arithmetic A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@raphlinus
Copy link
Contributor

The scalar fallback for the sinewave benchmark in fearless_simd is very slow as of the current commit, and the reason is the f32::round() operation. When that's changed to (x + 0.5).floor() it goes from 1622ns to 347ns, and 205ns with target_cpu=haswell. With default x86_64 cpu, floorf() is a function call, but it's an efficient one. The asm of roundf() that I looked at was very unoptimized (it moved the float value into int registers and did bit fiddling there). In addition, round() doesn't get auto-vectorized, but floor() does.

I think there's a rich and sordid history behind this. The C standard library has 3 different functions for rounding: round, rint, and nearbyint. Of these, the first rounds values with a 0.5 fraction away from zero, and the other two use the stateful rounding direction mode. This last is arguably a wart on C and it's a good thing the idea doesn't exist in Rust. In any case, the default value is FE_TONEAREST, which rounds these values to the nearest even integer (see Gnu libc documentation and Wikipedia; the latter does a reasonably good job of motivating why you'd want to do this, the tl;dr is that it avoids some biases).

The implementation of f32::floor is usually intrinsics::floorf32 (but it's intrinsics::floorf64 on msvc, for reasons described there). That in turn is llvm.floor.f32. Generally the other round functions are similar, til it gets to llvm. Inside llvm, one piece of evidence that "round" is special is that it's not listed in the list of instrinsics that get auto-vectorized.

Neither the C standard library nor llvm intrinsics have a function that rounds with "round half to even" behavior. This is arguably a misfeature. A case can be made that Rust should have this function; in cases where a recent Intel CPU is set as target_cpu or target_feature, it compiles to roundps $8 (analogous to $9 and $a for floor and ceil, respectively), and in compatibility mode the asm shouldn't be any slower than the existing code. I haven't investigated non-x86 architectures though.

For signal processing (the main use case of fearless_simd) I don't care much about the details of rounding of exactly 0.5 fraction values, and just want rounding to be fast. Thus, I think I'll use the _mm_round intrinsics in simd mode (with round half to even behavior) and (x + 0.5).floor() in fallback mode (with round half up behavior). It's not the case now (where I call f32::round) that the rounding behavior matches the SIMD case anyway. If there were a function with "round half to even" behavior, it would match the SIMD, would auto-vectorize well, and would have dramatically better performance with modern target_cpu.

@m1el
Copy link
Contributor

m1el commented Oct 16, 2018

Edit: reworded If you use (x+0.5).floor() to work as a replacement for round function, it will produce incorrect results, not only for exact 0.5 fraction values.

You can test this by enumerating all f32 values (the same works for f64, but not as easily enumerable): https://gist.github.com/m1el/dc18a85b82357589825d9ffa4532b0a6

For example, 0.49999997 + 0.5 == 1.0 and 8388609.0 + 0.5 == 8388610.0, due to loss of precision.

@bstrie bstrie added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Oct 16, 2018
@bstrie
Copy link
Contributor

bstrie commented Oct 16, 2018

@raphlinus , I'm actually surprised to discover that round doesn't do half-to-even, I guess this is a case of LLVM's choices inadvertently dictating Rust's semantics (though, reading LLVM's docs, it seems that they chose away-from-zero in order to have identical output to libm (path dependence!)). It's probably too late to change the semantics of round now (though it seems minor enough that if there were very vocal community demand, people might consider it a bugfix), which is a shame since e.g. Python's stdlib rounds half-to-even and I've never heard of anyone complaining about it, and the bias prevention is certainly nice. And if we did have a hypothetical round_unbiased (or w/e, let the bikeshed begin!), it would be a shame to effectively penalize people for reaching for the slower-and-more-obvious round (maybe we ought to consider deprecating round for round_away or smth).

At the same time, people may also find this a good opportunity to implement the rest of the five rounding modes specified by IEEE 754: https://en.wikipedia.org/wiki/IEEE_754#Rounding_rules . Would be ideal if LLVM could be the one handling all of this.

@bstrie
Copy link
Contributor

bstrie commented Oct 16, 2018

@m1el I'm afraid I don't see how that's relevant to this issue. Please file a separate bug if you think that the behavior of floor is incorrect.

@m1el
Copy link
Contributor

m1el commented Oct 16, 2018

@bstrie as far as I can tell, the behavior of floor is correct. My point is that the behavior of (x+0.5).floor() as a round function is not.

@raphlinus
Copy link
Contributor Author

raphlinus commented Oct 16, 2018

@bstrie Right. I do think there are strong benefits to llvm implementing this; it's a lot easier to make auto-vectorization work, and then all of (round, floor, ceil) will be similar.

@m1el I do appreciate the heads-up, I hadn't considered the fact that the addition would round up on precision loss. I agree it's technically not relevant to this issue but is a good warning to people who might come across this issue. It's also a signal of the importance of fixing this problem, as the obvious workaround is flawed.

Personally I would be fine with the rounding-on-half behavior to change. The number of people who will get hit by the performance and the inconsistency with SIMD is large. The number of people who actively depend on 0.5f32.round() == 1.0 is small; I can imagine it maybe breaking a golden master in one test or two. But I understand any non-backwards compatible changes can be a sensitive point.

@bstrie
Copy link
Contributor

bstrie commented Oct 16, 2018

Poking around, there may be some tentative LLVM support for this after all, though it's marked "experimental":

https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics

https://llvm.org/docs/LangRef.html#id1888

https://llvm.org/docs/LangRef.html#id1939

@raphlinus , I don't know a lot about how rounding modes work at a low-level, but does that look like it might be a promising start?

@raphlinus
Copy link
Contributor Author

@bstrie On a first glance, these don't seem to be what we're looking for, it seems like these experimental intrinsics affect rounding mode for the precision loss in other floating point operations. Also, I didn't specifically see round-half-to-even behavior listed.

One thing we should probably do before nailing down new semantics for rounding is research what other chips implement performantly (I'm personally mostly concerned about arm). Want me to look into that?

@varkor
Copy link
Member

varkor commented Oct 16, 2018

On a first glance, these don't seem to be what we're looking for, it seems like these experimental intrinsics affect rounding mode for the precision loss in other floating point operations.

llvm.experimental.constrained.rint seems to be the appropriate intrinsic. The LLVM "round.tonearest" mirrors C's FE_TONEAREST, but this appears not to specify behaviour under ties (though most comments I found indicated it was round-to-even). It should be possible to make this explicit in LLVM if we wanted.

@raphlinus
Copy link
Contributor Author

I'm still not convinced the llvm.experimental.constrained.rint intrinsic is the one we want. Reading the docs, I believe it still compiles to roundps $4, which means it gets the rounding mode from the environment. What I think the extra arguments do is let llvm do optimizations assuming that the environment matches those arguments. I see the point in that but it's a very thin win. I will experiment to make sure though. The phabricator issue introducing these intrinsics makes very good reading.

I verified with llc and hand-editing IR code that llvm.experimental.constrained.rint compiles to roundps $4 and llvm.experimental.constrained.nearbyint compiles to roundps $12. (These are decimal, I had hex above for ceil, apologies.)

We already have intrinsics::nearbyintf32 which generates roundps $12 (or call nearbyintf in fallback mode, which is ok). Btw, I do think nearbyint is the better choice than rint here; as I read it, the only difference is that the latter generates exceptions, and I'm pretty sure we don't want the exceptions.

So it looks like the way forward is to push on llvm for a new intrinsic. Having started reading their stuff, this is pretty deep water; it has to be supported through many optimization and codegen components, and there needs to be a fallback implementation for all targets, because this is not something that's represented anywhere in libm.

Another possibility to explore is to expose nearbyint, with the understanding that it has an implicit dependence on the FP state. This is certainly not much work, but it might be a bad idea.

Lastly, I checked availability of these operations on arm64, and it's much nicer than Intel - the FRINTN instruction exists in both scalar and vector, with "ties to even" behavior, as do instructions for the all the libm rounding functions. So I believe this is a small amount of additional support for the idea that it should be exposed as an intrinsic in llvm.

@estebank estebank added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Oct 16, 2018
@raphlinus
Copy link
Contributor Author

raphlinus commented Oct 17, 2018

I looked also at wasm; here's what I found. Short story is that current codegen is both wrong (for emscripten) and inefficient.

The code generated for wasm32-unknown-emscripten for round is effectively if x >= 0 { (x + 0.5).floor() } else { (x - 0.5).ceil() }. For f32, the value is converted to f64 then back to f32, and this is correct but slow. For f64, this code will produce incorrect values for 0.49999999999999994 (correct 0.0, actual 1.0) and 4503599627370497.0 (correct same as input, actual 4503599627370498.0) among others. I think the source of this code may be src/library.js from kripken/emscripten, but I'm not 100% sure.

(Note: edited, I thought I had a clever fix, but it was a testing error).

For wasm32-unknown-unknown, the generated code seems to match japaric/libm. I haven't checked this code for correctness, but it looks pretty carefully written and very very slow (it's something like 92 wasm instructions and lots of branches). (Edit: I did verify that this produces identical results as libm, with the exception that std produces negative zero for negative inputs near zero and japaric/libm produces positive zero).

In any case, that's the current situation. The designers of wasm are also a fan of round-half-to-even behavior, and so wasm includes the nearest instruction in both f32 and f64 flavors for that. They don't have an instruction for round-away-from-zero behavior.

Lastly, I found that there is an architecture-specific intrinsic for this already in llvm: int_aarch64_neon_frintn. Based on what I've found, I think the argument that this should be promoted into an llvm-wide intrinsic is strong.

@raphlinus
Copy link
Contributor Author

raphlinus commented Oct 17, 2018

I branched bugs to Emscripten and japaric/libm, referenced above.

I have also developed a workaround. I believe this the best implementation of round on x86 and wasm (on aarch64, the FRINTA instruction is better):

fn round(x: f32) {
    ((x.abs() + (0.25 - 0.5 * f32::EPSILON)) + (0.25 + 0.5 * f32::EPSILON)).floor().copysign(x)
}

However, it requires exposing the copysign intrinsic in the Rust std library. Aside from the use here, I believe there's a good case to be made for it in general, and the request has come up before. There's nothing inherently difficult about copysign (it's already in core::intrinsics), other than the fact that it will need to be implemented in japaric/libm. Does this seem reasonable, and should I file a separate issue for that?

A significant advantage of the above implementation (with copysign) is that it should be auto-vectorizable, which the current llvm.floor.f32 intrinsic is not. Arguably this codegen choice should be made in llvm rather than rust, considering that arm needs to be special-cased.

I also plan to file an issue against llvm asking that the FRINTN (aarch64) / nearby (wasm) intrinsic be added.

Lastly, I'd like to nip the naming bikeshed in the bud by declaring that the name for this function in the Rust standard library be "propinquitous".

@raphlinus
Copy link
Contributor Author

Also note that my idea (propagated by Twitter) is being reviewed in Julia: JuliaLang/julia#29700

@simonbyrne
Copy link

simonbyrne commented Oct 18, 2018

In case you're interested, we had a long argument in Julia about the specific behaviour of round, and settled on ties-to-even behaviour since (a) it's the most efficient on modern hardware, and (b) it is conceptually more elegant in that it matches normal floating point behaviour.

As you've discussed, on x86 SSE4.1 hardware (basically anything since 2011) there is actually an instruction for this, but LLVM doesn't expose an intrinsic. We settled on just using rint since LLVM doesn't really support changing the rounding mode anyway. There is also a C technical specification (TS 18661-1) which proposes a roundeven function, so it may make it into the C spec at some point in future, in which case LLVM may get around to it (the rust devs may also be able to apply some pressure here?).

If you really want to keep C behaviour, then some options (in Julia notation, but hopefully it should be clear what they mean) are:

trunc(x + copysign(prevfloat(0.5), x))

or

y = trunc(x)
ifelse(x==y, y, trunc(2*x-y))

I think we settled on the latter because it sets the floating point status flags correctly, but if you're not worried about that, I suspect the first will be faster.

@raphlinus
Copy link
Contributor Author

@simonbyrne Thanks, that is interesting for sure, as is the pointer to the C proposal. I think that makes the case for LLVM implementing this even stronger.

The Rust translation of the first of the two code snippets is:

    (x + (0.5 - 0.25 * f32::EPSILON).copysign(x)).trunc()

This is, I think, the best code for C-semantics round compatibility.

I was wondering how I missed this, because I do remember testing this. I think I've reconstructed my thinking - I was originally looking for an implementation of round up on tie break, and floor(x + prevfloat(0.5)) is close but misses -0.5 (it rounds that down). The one with the two near-0.25 values does capture this (and thus is arguably a good candidate to fix Java's behavior without needing a branch). But that case is not relevant when we're using copysign and trunc to handle the negative branch for "away" semantics.

My personal preference is still changing the semantics of round to even, because I think "round" is much more discoverable than whatever other name we come up with. But we'd want to get a good handle on how much breakage that would cause before making such a change.

@simonbyrne
Copy link

simonbyrne commented Oct 18, 2018 via email

@m1el
Copy link
Contributor

m1el commented Oct 18, 2018

I decided to bench and test different implementations of round. Here's the program I used: https://gist.github.com/m1el/873de570c1af6131f707a850108c6ced

The results on my machine (Core i7-5820K@3.0Ghz):

std::f32::round: 45.6942782s
round via builtin trunc: 39.8599112s
round reimplementation: 10.9416779s
round reimplementation unwrapped: 7.1464196s
std::f32::trunc: 37.1004949s
trunc reimplementation: 7.1815313s

Please note that using f32::trunc was measured to be slow as well.

All functions were tested to produce bit-equivalent results with the standard library.

@raphlinus
Copy link
Contributor Author

@m1el brings up a good point which I didn't clarify above. I don't have time today to dig very deeply, but there are three cases we have to consider:

  1. Native intrinsics are not available. This is the case for target_cpu=x86_64 and also low-power embedded platforms. We have to do the best we can with standard floats and bit manipulation. Calling into floor or trunc might be very expensive. Codegen should probably be a call to a function, but inlining it might be viable if we can get it very concise. I think choosing the fastest and most concise of @m1el 's snippets (or polishing further) is most appropriate.

  2. Native intrinsics are not available for round-away, but are available for floor, trunc, copysign, etc. This is the case for any x86_64 target_cpu within the last 7 or 8 years and (importantly, I think) wasm. Codegen should be, I think, the snippet I posted above. This can generate short, fast, inline code, and can also auto-vectorize.

  3. Native intrinsics are available for round-away. This is the case for aarch64 with neon. Obviously we should generate the llvm round intrinsic, and let llvm generate the perfect one-instruction code.

Whose responsibility is this, ours or llvm? A case can be made for either, but a problem with the latter is that they're constrained to respect C semantics for things like status flags for inexact rounding, and (I believe) we're not. So we might be able to generate better code based on explicit casing of target_cpu.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Oct 18, 2018

LLVM is not constrained by floating point exception flags, or even settings for dynamic rounding modes. In fact it's the opposite, LLVM has always gladly ignored these issues and is only now slowly growing an additional set of intrinsics that do respect those things. The long-existing LLVM math intrinsics (e.g. @llvm.sqrt) specifically ignore the floating point environment as well as C's errno, they just happen to sometimes be codegen'd as libm calls for various other reasons. So I think there shouldn't be any problem with proposing new intrinsics that are suitable for our desire, someone just has to put the work in (writing a proposal, gathering consensus for it, writing patches, getting those patches upstream).

kennytm pushed a commit to kennytm/rust that referenced this issue Oct 19, 2018
This patch adds a `copysign` function to the float primitive types.
It is an exceptionally useful function for writing efficient numeric
code, as it often avoids branches, is auto-vectorizable, and there
are efficient intrinsics for most platforms.

I think this might work as-is, as the relevant `copysign` intrinsic
is already used internally for the implementation of `signum`. It's
possible that an implementation might be needed in japaric/libm for
portability across all platforms, in which case I'll do that also.

Part of the work towards rust-lang#55107
@simonbyrne
Copy link

simonbyrne commented Oct 19, 2018

Please note that using f32::trunc was measured to be slow as well.

I'm surprised this isn't using roundps/roundss. What target did you use?

One way to get round-to-even without native intrinsics (again, apologies for Julia code, but hopefully it's reasonably obvious):

k = 1/eps(Float32)
a = abs(x)
a < k ? copysign((a+k)-k,x) : x

Of course, this assumes no x87 excess precision business, but in that case you should just use frndint.

kennytm added a commit to kennytm/rust that referenced this issue Oct 19, 2018
Add a `copysign` function to f32 and f64

This patch adds a `copysign` function to the float primitive types. It is an exceptionally useful function for writing efficient numeric code, as it often avoids branches, is auto-vectorizable, and there are efficient intrinsics for most platforms.

I think this might work as-is, as the relevant `copysign` intrinsic is already used internally for the implementation of `signum`. It's possible that an implementation might be needed in japaric/libm for portability across all platforms, in which case I'll do that also.

Part of the work towards rust-lang#55107
ischeinkman pushed a commit to ischeinkman/libnx-rs-std that referenced this issue Dec 20, 2018
This patch adds a `copysign` function to the float primitive types.
It is an exceptionally useful function for writing efficient numeric
code, as it often avoids branches, is auto-vectorizable, and there
are efficient intrinsics for most platforms.

I think this might work as-is, as the relevant `copysign` intrinsic
is already used internally for the implementation of `signum`. It's
possible that an implementation might be needed in japaric/libm for
portability across all platforms, in which case I'll do that also.

Part of the work towards rust-lang#55107
@jrus
Copy link

jrus commented Nov 7, 2019

@simonbyrne It doesn’t seem like this can overflow, and as far as I can tell it keeps giving correct results for arbitrarily large floats, so you can skip your conditional check and just do

k = 1/eps(Float32)
copysign((abs(x)+k)-k, x)

https://observablehq.com/d/db47dc21f691d5e9

The only slight subtlety is deciding what the correct behavior should be for values in the range −0.5..−0. Should these round to −0 or +0?

@simonbyrne
Copy link

That doesn't work for values in the binade where eps(x) == 1, since they will be rounded to the nearest multiple of two, e.g.

julia> x = 1/eps(Float32)+1
8.388609f6

julia> isinteger(x)
true

julia> k = 1/eps(Float32)
8.388608f6

julia> copysign((abs(x)+k)-k, x)
8.388608f6

The only slight subtlety is deciding what the correct behavior should be for values in the range −0.5..−0. Should these round to −0 or +0?

I think the convention is the sign should always match.

@jrus
Copy link

jrus commented Nov 8, 2019

@simonbyrne How about this one?

k = 2/eps(Float32)
copysign((abs(x)-k)+k, x)

@simonbyrne
Copy link

simonbyrne commented Nov 8, 2019

The great thing about Float32 is you can test them all:

function testround()
  for u = typemin(UInt32):typemax(UInt32)
    x = reinterpret(Float32, u)
    k = 2/eps(Float32)
    y = copysign((abs(x)-k)+k, x)
    if !isequal(round(x), y)
        @show x
        return false
    end
  end
  return true
end
julia> testround()
x = 2.81475f14
false

(I apologize again for spamming this thread with Julia code...)

@jrus
Copy link

jrus commented Nov 8, 2019

I’m a little confused what that code is doing. It doesn’t look like f1(x) is defined anywhere, or like copysign((abs(x)-k)+k, x) gets used in the test.

But okay, darn. I wonder if there’s anything else we can do.

(Still, if you need the conditional the subtractive version is probably still better to use, since the other branch should be pretty rare.)

@simonbyrne
Copy link

Good point, now updated.

@workingjubilee workingjubilee added the A-floating-point Area: Floating point numbers and arithmetic label Apr 9, 2021
bors added a commit to rust-lang-ci/rust that referenced this issue Mar 7, 2023
…=pnkfelix,m-ou-se,scottmcm

Add `round_ties_even` to `f32` and `f64`

Tracking issue: rust-lang#96710

Redux of rust-lang#82273. See also rust-lang#55107

Adds a new method, `round_ties_even`, to `f32` and `f64`, that rounds the float to the nearest integer , rounding halfway cases to the number with an even least significant bit. Uses the `roundeven` LLVM intrinsic to do this.

Of the five IEEE 754 rounding modes, this is the only one that doesn't already have a round-to-integer function exposed by Rust (others are `round`, `floor`, `ceil`, and `trunc`).  Ties-to-even is also the rounding mode used for int-to-float and float-to-float `as` casts, as well as float arithmentic operations. So not having an explicit rounding method for it seems like an oversight.

Bikeshed: this PR currently uses `round_ties_even` for the name of the method. But maybe `round_ties_to_even` is better, or `round_even`, or `round_to_even`?
bjorn3 pushed a commit to bjorn3/rust that referenced this issue Mar 15, 2023
…=pnkfelix,m-ou-se,scottmcm

Add `round_ties_even` to `f32` and `f64`

Tracking issue: rust-lang#96710

Redux of rust-lang#82273. See also rust-lang#55107

Adds a new method, `round_ties_even`, to `f32` and `f64`, that rounds the float to the nearest integer , rounding halfway cases to the number with an even least significant bit. Uses the `roundeven` LLVM intrinsic to do this.

Of the five IEEE 754 rounding modes, this is the only one that doesn't already have a round-to-integer function exposed by Rust (others are `round`, `floor`, `ceil`, and `trunc`).  Ties-to-even is also the rounding mode used for int-to-float and float-to-float `as` casts, as well as float arithmentic operations. So not having an explicit rounding method for it seems like an oversight.

Bikeshed: this PR currently uses `round_ties_even` for the name of the method. But maybe `round_ties_to_even` is better, or `round_even`, or `round_to_even`?
antoyo pushed a commit to antoyo/rust that referenced this issue Jun 19, 2023
…=pnkfelix,m-ou-se,scottmcm

Add `round_ties_even` to `f32` and `f64`

Tracking issue: rust-lang#96710

Redux of rust-lang#82273. See also rust-lang#55107

Adds a new method, `round_ties_even`, to `f32` and `f64`, that rounds the float to the nearest integer , rounding halfway cases to the number with an even least significant bit. Uses the `roundeven` LLVM intrinsic to do this.

Of the five IEEE 754 rounding modes, this is the only one that doesn't already have a round-to-integer function exposed by Rust (others are `round`, `floor`, `ceil`, and `trunc`).  Ties-to-even is also the rounding mode used for int-to-float and float-to-float `as` casts, as well as float arithmentic operations. So not having an explicit rounding method for it seems like an oversight.

Bikeshed: this PR currently uses `round_ties_even` for the name of the method. But maybe `round_ties_to_even` is better, or `round_even`, or `round_to_even`?
@RalfJung
Copy link
Member

RalfJung commented Jan 6, 2024

#96710 seems to largely close this issue: round_ties_even exists now (not yet stable though), which hopefully should be faster.

Changing the semantics of the existing round is a tall order. It subtly changes the behavior of existing code, and it's bad news for floating-point crates that modeled their API after the std names. This would almost surely require an RFC and is not the kind of proposal we usually keep an issue open for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-floating-point Area: Floating point numbers and arithmetic A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants