float rounding is slow

### Update (2025-02-03)

The semantics of `round` will not be changed. However, the docs for `round` should point out that on most hardware, `round_ties_even` is faster.

### Original issue

The scalar fallback for the sinewave benchmark in [fearless_simd](https://github.com/raphlinus/fearless_simd/blob/1cb5202e4c96233a90a170ba183aed1b64aa2fb1/README.md) is very slow as of the current commit, and the reason is the f32::round() operation. When that's changed to (x + 0.5).floor() it goes from 1622ns to 347ns, and 205ns with target_cpu=haswell. With default x86_64 cpu, floorf() is a function call, but it's an efficient one. The asm of roundf() that I looked at was very unoptimized (it moved the float value into int registers and did bit fiddling there). In addition, round() doesn't get auto-vectorized, but floor() does.

I think there's a rich and sordid history behind this. The C standard library has 3 different functions for rounding: [`round`](http://www.cplusplus.com/reference/cmath/round/), [`rint`](http://www.cplusplus.com/reference/cmath/rint/), and [`nearbyint`](http://www.cplusplus.com/reference/cmath/nearbyint/). Of these, the first rounds values with a 0.5 fraction away from zero, and the other two use the stateful rounding direction mode. This last is arguably a wart on C and it's a good thing the idea doesn't exist in Rust. In any case, the _default_ value is FE_TONEAREST, which rounds these values to the nearest even integer (see [Gnu libc documentation](https://www.gnu.org/software/libc/manual/html_node/Rounding.html) and [Wikipedia](https://en.wikipedia.org/wiki/Rounding#Round_half_to_even); the latter does a reasonably good job of motivating why you'd want to do this, the tl;dr is that it avoids some biases).

The [implementation](https://doc.rust-lang.org/src/std/f32.rs.html#51) of [f32::floor](https://doc.rust-lang.org/std/primitive.f32.html#method.floor) is usually intrinsics::floorf32 (but it's intrinsics::floorf64 on msvc, for reasons described there). That in turn is [llvm.floor.f32](https://github.com/rust-lang/rust/blob/b8b4150c042b06c46e29a9d12101f91fe13996e0/src/librustc_codegen_llvm/intrinsic.rs#L67). Generally the other round functions are similar, til it gets to llvm. Inside llvm, one piece of evidence that "round" is special is that it's not listed in the [list of instrinsics that get auto-vectorized](https://llvm.org/docs/Vectorizers.html#vectorization-of-function-calls).

Neither the C standard library nor llvm intrinsics have a function that rounds with "round half to even" behavior. This is arguably a misfeature. A case can be made that Rust should have this function; in cases where a recent Intel CPU is set as target_cpu or target_feature, it compiles to `roundps $8` (analogous to `$9` and `$a` for floor and ceil, respectively), and in compatibility mode the asm shouldn't be any slower than the existing code. I haven't investigated non-x86 architectures though.

For signal processing (the main use case of fearless_simd) I don't care much about the details of rounding of exactly 0.5 fraction values, and just want rounding to be fast. Thus, I think I'll use the _mm_round intrinsics in simd mode (with round half to even behavior) and (x + 0.5).floor() in fallback mode (with round half up behavior). It's not the case now (where I call f32::round) that the rounding behavior matches the SIMD case anyway. If there were a function with "round half to even" behavior, it would match the SIMD, would auto-vectorize well, and would have dramatically better performance with modern target_cpu.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

float rounding is slow #55107

Update (2025-02-03)

Original issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

float rounding is slow #55107

Description

Update (2025-02-03)

Original issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions