Skip to content

Tracking Issue for algebraic floating point methods #136469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 6 tasks
calder opened this issue Feb 3, 2025 · 61 comments
Open
2 of 6 tasks

Tracking Issue for algebraic floating point methods #136469

calder opened this issue Feb 3, 2025 · 61 comments
Labels
A-floating-point Area: Floating point numbers and arithmetic C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-lang Relevant to the language team T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@calder
Copy link
Contributor

calder commented Feb 3, 2025

Feature gate: #![feature(float_algebraic)]

This is a tracking issue for exposing core::intrinsics::f*_algebraic in stable Rust, including in const.

Public API

// core::num::f16

impl f16 {
    pub const fn algebraic_add(self, rhs: f16) -> f16;
    pub const fn algebraic_sub(self, rhs: f16) -> f16;
    pub const fn algebraic_mul(self, rhs: f16) -> f16;
    pub const fn algebraic_div(self, rhs: f16) -> f16;
    pub const fn algebraic_rem(self, rhs: f16) -> f16;
}
// core::num::f32

impl f32 {
    pub const fn algebraic_add(self, rhs: f32) -> f32;
    pub const fn algebraic_sub(self, rhs: f32) -> f32;
    pub const fn algebraic_mul(self, rhs: f32) -> f32;
    pub const fn algebraic_div(self, rhs: f32) -> f32;
    pub const fn algebraic_rem(self, rhs: f32) -> f32;
}
// core::num::f64

impl f64 {
    pub const fn algebraic_add(self, rhs: f64) -> f64;
    pub const fn algebraic_sub(self, rhs: f64) -> f64;
    pub const fn algebraic_mul(self, rhs: f64) -> f64;
    pub const fn algebraic_div(self, rhs: f64) -> f64;
    pub const fn algebraic_rem(self, rhs: f64) -> f64;
}
// core::num::f128

impl f128 {
    pub const fn algebraic_add(self, rhs: f128) -> f128;
    pub const fn algebraic_sub(self, rhs: f128) -> f128;
    pub const fn algebraic_mul(self, rhs: f128) -> f128;
    pub const fn algebraic_div(self, rhs: f128) -> f128;
    pub const fn algebraic_rem(self, rhs: f128) -> f128;
}

Steps / History

Unresolved Questions

References

cc @rust-lang/lang @rust-lang/libs-api

Footnotes

  1. https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

@calder calder added C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Feb 3, 2025
@Noratrieb Noratrieb marked this as a duplicate of #136468 Feb 3, 2025
@jieyouxu jieyouxu added the A-floating-point Area: Floating point numbers and arithmetic label Feb 3, 2025
@tgross35 tgross35 added the T-lang Relevant to the language team label Feb 5, 2025
@tgross35
Copy link
Contributor

tgross35 commented Feb 5, 2025

Added lang as requested in rust-lang/libs-team#532 (comment). That comment also mentions algebraic_mul_add, which could be added after the initial PR if it makes sense.

@RalfJung
Copy link
Member

These operations are non-deterministic. @nikic do you know if LLVM scalar evolution handles that properly? We had codegen issues in the past when SE assumed that an operation was deterministic and then actually it was not.

@RalfJung
Copy link
Member

RalfJung commented Feb 17, 2025

That comment also mentions algebraic_mul_add, which could be added after the initial PR if it makes sense.

We also have the "may or may not fuse" intrinsics added in #124874, which so far have not been exposed in any way that has a path to stabilization. Would algebraic_mul_add use those intrinsics, or would it replace them since we likely want some of the algebraic fast-math flags on that operation as well (i.e., we want to give the compiler more freedom than just "fuse or don't fuse").

@nikic
Copy link
Contributor

nikic commented Feb 17, 2025

These operations are non-deterministic. @nikic do you know if LLVM scalar evolution handles that properly? We had codegen issues in the past when SE assumed that an operation was deterministic and then actually it was not.

Are these "algebraic" in the sense that they have reassoc FMF? If so, then yes, SE should be treating them as non-deterministic already: https://github.com/llvm/llvm-project/blob/6684a5970e74b8b4c0c83361a90e25dae9646db0/llvm/lib/Analysis/ConstantFolding.cpp#L1437-L1444

@RalfJung
Copy link
Member

reassoc and a few more, yeah. Seems like that is accounted for, thanks. :)

@tgross35 tgross35 changed the title Tracking Issue for exposing algebraic floating point intrinsics Tracking Issue for algebraic floating point methods Feb 17, 2025
@tgross35
Copy link
Contributor

On that note I added an unresolved question for naming since algebraic isn't the most clear indicator of what is going on.

@tgross35
Copy link
Contributor

In theory the flags could also be represented via const generics, which would allow more fine tuned control and more flags without a method explosion. Something like:

#[derive(Clone, Copy, Debug, Default, ...)]
#[non_exhaustive]
struct FpArithOps {
    reassociate: bool,
    contract: bool,
    reciprocal: bool,
    no_signed_zeros: bool,
    ftz: bool,
    daz: bool,
    poison_nan: bool,
    poison_inf: bool,
}

impl FpArithOps {
    // Current algebraic_* flags
    const ALGEBRAIC: Self = Self { reassociate: true, contract: true, reciprocal: true, no_signed_zeros: true, ..false };
}

// Panics if `poison_*` is set
fn add_with_ops<const OPS: FpArithOps>(self, y: Self) -> Self;

// Alows `poison_*`
unsafe fn add_with_ops_unchecked<const OPS: FpArithOps>(self, y: Self) -> Self;

// same as `f32::algebraic_div`
let x = 1.0f32.div_with_ops::<FpArithOps::ALGEBRAIC>(y);

(Using a struct needs the next step of const generic support, or some way to bless types in std)

@hanna-kruppe
Copy link
Contributor

As I wrote in the ACP:

While this goes on the direction of having multiple combinations of LLVM “fast math flags” exposed, I don’t think that there’s more than two or three sets that are broadly useful enough and well-behaved enough to be worth exposing. And one of those is just “let the hardware do whatever is fast w.r.t. subnormals” which is something that really wants to be applied to entire regions of code and not to individual operations, so it may need an entirely different design. (It’s also a function attribute rather than an instruction flag in LLVM.)

I don’t think that we should expose the power set of LLVM’s flags, nor so many ad-hoc combinations that “method explosion” becomes a realistic problem. I’m not even sure if it’s a good idea to ever expose any of FMFs that can make an operation return poison (has anyone ever used those soundly while still getting a speedup?). And FTZ/DAZ shouldn’t be modeled as per-operation flags because you don’t want to toggle the CPU control registers for that constantly.

Zalathar added a commit to Zalathar/rust that referenced this issue Feb 18, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps
@zroug
Copy link

zroug commented Feb 18, 2025

Should @llvm.arithmetic.fence be exposed? Sometimes it would be nice to reason about the effects of algebraic operations locally. Currently, there is no clean way to prevent reassociation, etc. with operations outside your local context, because you can't know if the caller is using algebraic operations as well.

bors added a commit to rust-lang-ci/rust that referenced this issue Feb 19, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
bors added a commit to rust-lang-ci/rust that referenced this issue Feb 26, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
fmease added a commit to fmease/rust that referenced this issue Feb 26, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
bors added a commit to rust-lang-ci/rust that referenced this issue Feb 26, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

~~try-job: x86_64-gnu-nopt~~
try-job: x86_64-gnu-aux
bors added a commit to rust-lang-ci/rust that referenced this issue Mar 11, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux
bors added a commit to rust-lang-ci/rust that referenced this issue Mar 11, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux
bors added a commit to rust-lang-ci/rust that referenced this issue Mar 12, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux
@orlp
Copy link
Contributor

orlp commented Mar 16, 2025

On that note I added an unresolved question for naming since algebraic isn't the most clear indicator of what is going on.

I think it is fairly clear. The operations allow algebraically justified optimizations, as-if the arithmetic was real arithmetic.

I don't think you're going to find a clearer name, but feel free to provide suggestions. One alternative one might consider is real_add, real_sub, etc.

bors added a commit to rust-lang-ci/rust that referenced this issue Apr 2, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux
bors added a commit to rust-lang-ci/rust that referenced this issue Apr 3, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux
bors added a commit to rust-lang-ci/rust that referenced this issue Apr 4, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux
@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented May 13, 2025

@bjoernager The “fast math” terminology (also -Ofast) has proven to be a huge disservice to users and the Rust project should not propagate it further. If you call something the “fast” version, everyone’s like “of course I want my code to be fast” and that’s not something to encourage even with how relatively tame these operations are compared to the “funsafe-math” chimera.

@bjoernager
Copy link
Contributor

bjoernager commented May 13, 2025

I don't see why "fast" must inherently also equate to "exact" or "precise." The whole point of these functions is to optimise for execution speed, not accuracy. Just like I can run "fast" with a basket but I might drop some stuff from it, I can also take it slow and assure that its contents will be kept intact. Anyone that reads the docs would know these complications of fast arithmetic, and I don't see a reason to idiot-proof the language for people that won't bother... After all, there is a plethora of other pitfalls that can lead to junk logic or calculations etc. (which in and of itself does not break any safety guarantees).

@tczajka
Copy link

tczajka commented May 13, 2025

I agree that fast is a better name than algebraic.

"Algebraic" can mean many things. "Algebraic numbers" are roots of polynomials, so √2 is algebraic and π is not an algebraic number. There is also a related concept of "algebraic function". It's obviously not what is meant here.

The name "algebraic" tries to suggest what kind of optimizations are possible on this, but names shouldn't concentrate on what optimizations are allowed, they should talk about the semantics of the operation, and optimizations follow from that.

It's hard for me to interpret "algebraic" in any semantic way. real would be even worse, because that's just semantically wrong. The optimizer might be "assuming" these are exact operations on real numbers, but that's just its own internal incorrect assumption, no reason to expose that in the names.

I still think approximate is even better. fast concentrates on performance, approximate concentrates on the semantics that implicitly enables better performance.

I think the root of the naming issue is that there is no clearly defined semantics: "The exact set of optimizations is unspecified". So it's really kind of best_effort_add.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented May 13, 2025

@bjoernager Naming shouldn’t be constrained by the worst hypothetical misunderstanding someone can invent. But with this exact name, in this exact domain, we have decades of experience showing that it leads to misunderstandings and misuse by many users. And I don’t want to just blame users for it: it’s a maximally misleading name because it focuses on the aspect that needs the least attention. Sure, you can carefully interpret it as “fast as a trade-off against some other desirable property” but that’s not where most people’s minds go first. Especially not in Rust, which has “efficiency” right in its tagline and prides itself on not trading off reliability for it.

@tgross35
Copy link
Contributor

I believe fast has been discussed in a number of threads and the point that kept coming up is what Hanna mentioned, that we don't want to direct users toward thinking "if these are fast, everything else must be slow; why would I use them?".

How about approximate_add, approximate_sub?

I strongly disagree with this naming. With the few exceptions of exactly representable input/output pairs, almost all floating point math is already an approximation. And it would imply optimizations that aren't permissible.

The response in #136469 (comment) resonates with me as well here. "approximate" would be relative to the default operations, which are as exact as possible given finite precision. I don't think there is any risk of a confusing implication that default operations act with infinite precision. That said, the logic holds less well if we wind up with something like approximate_sin since libm functions like sin tend to be inexact even within the available precision.

Personally I do like inexact_* a bit more than approximate_*, since IMO it connotes a smaller tolerance. We could also go the direction of naming based on what the compiler is allowed to do, e.g. foldable_*.

@tczajka
Copy link

tczajka commented May 13, 2025

Personally I do like inexact_* a bit more than approximate_*, since IMO it connotes a smaller tolerance.

inexact is also reasonable.

But I'm not sure how much tolerance is reasonable to imply because even a simple re-association such as (a+b)+c => a+(b+c) can in some cases cause 100% precision loss. It's essentially impossible to build anything reliable on these functions without having any bounds in the spec on the re-writings that can happen.

@scottmcm
Copy link
Member

scottmcm commented May 16, 2025

Given this

We're not turning on:

* `afn` - Approximate functions - Allow substitution of approximate calculations for functions (`sin`, `log`, `sqrt`, etc). See floating-point intrinsic definitions for places where this can apply to LLVM’s intrinsic math functions.

I don't think "inexact" makes sense. Because to me, I'd expect the "inexact" ones to be the ones that are allowed to give less precision from inverse square roots and such.


100% in agreement that we shouldn't propagate the "fast" name for these.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented May 16, 2025

FWIW, at least some reciprocal square root optimizations are enabled by algebraic_div + normal square root: https://play.rust-lang.org/?version=nightly&mode=release&edition=2024&gist=0c8197e47894238719a5a1098621f419

@traviscross
Copy link
Contributor

traviscross commented May 16, 2025

Interesting. Specifically, it's the arcp flag. Compare:

define float @alg_rsqrt(float %x) {
  %1 = call float @llvm.sqrt.f32(float %x)
  %2 = fdiv arcp float 1., %1
  ret float %2
}

define float @rsqrt(float %x) {
  %1 = call float @llvm.sqrt.f32(float %x)
  %2 = fdiv float 1., %1
  ret float %2
}
.LCPI0_0:
        .long   0xc0400000                      # float -3
.LCPI0_1:
        .long   0xbf000000                      # float -0.5
alg_rsqrt:                              # @alg_rsqrt
        rsqrtss xmm1, xmm0
        mulss   xmm0, xmm1
        mulss   xmm0, xmm1
        addss   xmm0, dword ptr [rip + .LCPI0_0]
        mulss   xmm1, dword ptr [rip + .LCPI0_1]
        mulss   xmm0, xmm1
        ret
.LCPI1_0:
        .long   0x3f800000                      # float 1
rsqrt:                                  # @rsqrt
        sqrtss  xmm1, xmm0
        movss   xmm0, dword ptr [rip + .LCPI1_0] # xmm0 = [1.0E+0,0.0E+0,0.0E+0,0.0E+0]
        divss   xmm0, xmm1
        ret

(The mul, mul, add, mul, mul there is a Newton-Raphson iteration to partially bring the accuracy back up.)

Godbolt link

I wonder if that's intentional. It's not clear to me how to justify that behavior on the basis of their documentation.

@tczajka
Copy link

tczajka commented May 16, 2025

I don't think "inexact" makes sense. Because to me, I'd expect the "inexact" ones to be the ones that are allowed to give less precision from inverse square roots and such.

Is there a qualitative difference between losing precision from approximate inverse square root algorithms and losing precision from applying various algebraic manipulations designed for infinite precision numbers to finite precision numbers? Can the distinction be defined?

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented May 16, 2025

I don’t know of a fundamental reason why we couldn’t include afn in the “algebraic” operations (they’re safe in the Rust sense at least, but we may no longer want to call them “algebraic” then). It’s not applicable to the current basic set of operations (add, sub, mul, div) but we could have algebraic_ or inexact_ variants of sqrt and other functions if there’s sufficient motivation. The main concern is quality of implementation: do the optimizations LLVM actually makes based on afn have significant benefit and little downside in some common cases, so that it’s worth exposing this to users who may benefit from it?

@selendym
Copy link

Given the previous arguments, an intentionally opaque naming would probably be for the best, since a full description in the name is unlikely to work. Opaque naming would force the users to read the docs and then make an informed decision about whether to use these or not.

I'm personally fine with the current naming, since it's quite ambiguous and thus opaque (which is good).

But for the sake of some bikeshedding, how about fuzzy_*? It's short, opaque, and implies that things might not be exactly as the code would describe. When people reach for these, they probably just need a result that is correct enough without caring about the exact sequence of operations - fuzzy, indeed.

@jmillikin
Copy link
Contributor

jmillikin commented May 31, 2025

Adding another datapoint to the naming discussion: WebAssembly refers to such operations as "relaxed", reflected in the core::arch intrinsics such as core::arch::wasm32::f32x4_relaxed_madd. Their Relaxed SIMD proposal goes into more detail.


I have a question about the combination of non-deterministic results and const. Consider this snippet:

const fn fadd_a(x: f32, y: f32, z: f32) -> f32 {
   x.algebraic_add(y).algebraic_add(z)
}

const fn fadd_b(x: f32, y: f32, z: f32) -> f32 {
   x.algebraic_add(y).algebraic_add(z)
}

const fn fadd_c(x: f32, y: f32, z: f32) -> f32 {
   x.algebraic_add(z).algebraic_add(y)
}

const X: f32 = ... ;
const Y: f32 = ... ;
const Z: f32 = ... ;

const A: f32 = fadd_a(X, Y, Z);
const B: f32 = fadd_b(X, Y, Z);
const C: f32 = fadd_c(X, Y, Z);

const A_2: f32 = fadd_a(X, Y, Z);

Are A and B allowed to differ? The implementations of fadd_a and fadd_b are identical. The runtime behavior might be different (the compiler is allowed to emit the adds in either order), but does that extend to const?

Are A and A_2 allowed to differ? In other words, is the behavior in a const context required to be consistent for the same function, as if it had been emitted to some deterministic-but-unspecified instruction sequence?

If A and C are the same for a given binary, is that guaranteed to be true for future compilations of the same source code with the same version of rustc? If yes, does that guarantee hold regardless of which architecture is running rustc?

I'm pretty nervous about the prospect of non-deterministic const behavior generating non-reproducible binaries.

@hanna-kruppe
Copy link
Contributor

Wasm relaxed SIMD instructions are not really the same as what we’re discussing here because they only allow very bounded non-determinism of individual results. For example, relaxed FMA allows small differences in how each FMA is evaluated, but it doesn’t allow any rewrites of multiple adjacent FMAs that can result in arbitrarily large differences (as discussed above). Other relaxed operations are even more well-behaved, e.g., many nontrivial uses of relaxed swizzles are fully deterministic because only out-of-bounds lane indices have implementation-defined behavior.

@bjoernager
Copy link
Contributor

bjoernager commented May 31, 2025

Are A and B allowed to differ?

I would say so. The compiler is allowed to optimise the two expressions differently, and for whatever reason it may like. An optimisation in this case would be allowed to change the result. See also next answer.

Are A and A_2 allowed to differ?

I believe so. The methods do not necessarily have to equate to some then deterministic instructions (e.g. the function could compile to some hardware-accelerated, non-deterministic algebraic instructions). Furthermore, the compiler is expected to interpret constant expressions as if running on the target, so it is not possible for there to be portable determinism in this case at all.

If A and C are the same for a given binary, is that guaranteed to be true for future compilations of the same source code with the same version of rustc?

No. I don't believe we at all have any guarantees about the exact representation of the control flow outside that which is specified via inline assembly and a few attributes.

Note that in any case, as the docs mention, these functions may be non-deterministic without any apparent reason, meaning that there pedantically speaking really aren't any guarantees about the precision other than it will be "mostly good enough."

@programmerjake
Copy link
Member

programmerjake commented May 31, 2025

I'd expect rustc to be deterministic in practice, so if you compile the exact same source (this includes the code generated by proc-macros and build scripts) using the exact same build options using the exact same compiler starting from the same incremental compilation state (basically the contents of the target directory), rustc will generate the same compiled output.

rustc doesn't intentionally inject randomness (struct layout randomness iirc is deterministic based on the random seed passed on the command line) and iirc goes to great lengths to ensure multithreaded compilation's exact order that operations were run on different cores or scheduling choices by the os doesn't affect the output.

there may be bugs that inject randomness though, e.g. depending on pointer addresses from the os (which likes to have pointers returned by mmap and friends and the addresses the code is loaded at be random for security), or depending on the order of iteration of a hashmap using std's RandomState.

however, it can be non-deterministic in the sense that if any of the source changes in any way (including changes to unrelated code) or if you change the compiler, the build options, or the incremental compilation state in any way then it may or may not generate different output.

@jmillikin
Copy link
Contributor

jmillikin commented May 31, 2025

Sacrificing the determinism of const seems like a big deal, because if two executions of the same const fn in the same compiler can produce different object code then rustc becomes incompatible with projects such as https://reproducible-builds.org/ and checksum-based build systems such as Bazel.

Short-term, this means that algebraic_* should not be stabilized as const fn.

Medium-term, the const definition of each algebraic_* could be made identical to the non-algebraic version:

  • It is allowed for x.algebraic_add(y).algebraic_add(z) to be compiled to the same instructions as (x + y) + z.
  • A floating-point value produced by a const fn is allowed to be different in CTFE and runtime.
  • Together, those rules imply it's valid for const fn to produce deterministic output (within the limits of IEEE754's determinism) even if the same function gets auto-vectorized or rearranged by LLVM.

Long-term it would be nice if all behavior of rustc_apfloat could be specified, including NaNs and such, but that's a big field to plough. And it's arguably not really important for algebraic_* to be const, assuming folks aren't running complex numeric simulations in their const definitions.

@programmerjake
Copy link
Member

Sacrificing the determinism of const seems like a big deal, because if two executions of the same const fn in the same compiler can produce different object code then rustc becomes incompatible with projects such as https://reproducible-builds.org/ and checksum-based build systems such as Bazel.

the point of #136469 (comment) is that it should be suitable for reproducible builds (though rustc may have bugs as mentioned), though it can still change for reasons you wouldn't expect e.g. changes to a different unrelated function in the same file or incremental compilation state.

@jmillikin
Copy link
Contributor

Sacrificing the determinism of const seems like a big deal, because if two executions of the same const fn in the same compiler can produce different object code then rustc becomes incompatible with projects such as https://reproducible-builds.org/ and checksum-based build systems such as Bazel.

the point of #136469 (comment) is that it should be suitable for reproducible builds (though rustc may have bugs as mentioned), though it can still change for reasons you wouldn't expect e.g. changes to a different unrelated function in the same file or incremental compilation state.

If the compiler flags or source file change then it's expected and normal for the output to change. I'm not worried about rustc producing different output if given different inputs.

If const fn is allowed to be non-deterministic, then two invocations of the same rustc binary on the same input sources with the same flags could produce different object code.

I think that should not be allowed, especially not quietly tacked on to an otherwise unrelated codegen optimization hint.

@programmerjake
Copy link
Member

I expect the algebraic functions have this already-approved style of non-determinism.

@jmillikin
Copy link
Contributor

jmillikin commented May 31, 2025

Yes, I am aware of the current behavior. My point is that the semantics proposed by @bjoernager for const fn algebraic_* are different from the current guarantees around const floating-point values because they remove the guarantee that the compile-time interpreter is deterministic.

edit: To be more concrete, given this code (adapted from #136469 (comment)):

for _ in 0..100000000 { ... } at compile time
const fn compute_algebraic(count: i32, delta: f32) -> f32 {
    let mut x = 0.0f32;
    let mut ii = 0;
    while ii < count {
        x = x.algebraic_add(delta);
        ii += 1;
    }
    x
}

const fn compute_plain(count: i32, delta: f32) -> f32 {
    let mut x = 0.0f32;
    let mut ii = 0;
    while ii < count {
        x += delta;
        ii += 1;
    }
    x
}

const RESULT_ALGEBRAIC: f32 = compute_algebraic(100000000, 0.00000001);
const RESULT_PLAIN: f32 = compute_plain(100000000, 0.00000001);

fn main() {
    println!("RESULT_ALGEBRAIC: {:?}", RESULT_ALGEBRAIC);
    println!("RESULT_PLAIN: {:?}", RESULT_PLAIN);
}

Then use the same source code with the same compiler to produce two binaries:

$ rustc -o loop1 loop.rs
$ rustc -o loop2 loop.rs

If const fn algebraic_* is allowed to be non-deterministic during compilation then it would not be considered a rustc bug if the binaries loop1 and loop2 contained different content and wrote different output when executed.

@RalfJung
Copy link
Member

There's nothing new wrt to non-determinism for algebraic operations, so there's no specific concern for const here. Even "normal" float operations are already non-determinism when they produce a NaN, and we have those in const today. So any concern you may have about reproducability already applies today -- but for the reasons others explained above, there is no concern with reproducabilty here. Algebraic operations are "more" non-determinstic, but that's merely a quantitative difference and does not change anything fundamental.

@bjoernager
Copy link
Contributor

My point is that the semantics proposed by bjoernager for const fn algebraic_* are different from the current guarantees around const floating-point values because they remove the guarantee that the compile-time interpreter is deterministic.

I was simply trying to deduce what the current semantics roughly were according to the reference book:

const-eval.const-fn.const-context:

When called from a const context, the function is interpreted by the compiler at compile time. The interpretation happens in the environment of the compilation target and not the host.

@RalfJung
Copy link
Member

because they remove the guarantee that the compile-time interpreter is deterministic.

They don't, though. They can't. The docs of a library function cannot and do not alter the fundamental properties of core language components. (In fact, the compile-time interpreter being non-deterministic would imply that Rust is unsound.)

The operation is non-deterministic (both existing float ops, and the algebraic ops, in fundamentally exactly the same way). The compile-time interpreter implements the operation. The compile-time interpreter is deterministic. All of these things are true at the same time. All it means is that we make no guarantee about which deterministic refinement of the non-deterministic Abstract Machine the compile-time interpreter implements, and we reserve the right to change that choice in the future.

@jmillikin
Copy link
Contributor

@RalfJung it sounds like your answers to #136469 (comment) would be different from the reply posted earlier.

I'd be happy to hear that (for example) A and A_2 are guaranteed to be equal, or that if A.to_bits() == C.to_bits() in one compilation then the same is true for other compilations with the same compiler and sources.

They don't, though. They can't. The docs of a library function cannot and do not alter the fundamental properties of core language components. (In fact, the compile-time interpreter being non-deterministic would imply that Rust is unsound.)

I'm less worried about whatever implementation of CTFE currently exists, and more about what guarantees are provided for future versions. If the behavior of a const fn is explicitly documented to be non-deterministic even within the same rustc process (!) then that has negative downstream effects on the build ecosystem.

The operation is non-deterministic (both existing float ops, and the algebraic ops, in fundamentally exactly the same way).

Floating-point addition with a specified sequence of operations and with NaNs excluded (through analysis or just if isnan { panic } is deterministic. The issue of NaN bit patterns is well-known and small enough in scope that it doesn't really matter for the places where floating-point math happens in a const context.

The algebraic_* functions allow the compiler to change the sequence of operations, which means the result is non-deterministic but potentially can be computed much faster. That's excellent for runtime, but in a const context it means the interpreter is allowed to evaluate a.algebraic_add(b).algebraic_add(c) as either (a + b) + c or a + (b + c) even if those expressions produce different results.

The compile-time interpreter implements the operation. The compile-time interpreter is deterministic. All of these things are true at the same time. All it means is that we make no guarantee about which deterministic refinement of the non-deterministic Abstract Machine the compile-time interpreter implements, and we reserve the right to change that choice in the future.

I understand that is the situation now, my worries are about what the modified guarantees would allow the compiler to do in the future should const fn algebraic_* be stabilized with the currently described semantics.

@RalfJung
Copy link
Member

RalfJung commented Jun 1, 2025

My main point is that your questions are off-topic here. We already have non-determinism in const fn and we have a sound way of dealing with it that does not jeopardize reproducibility. Feel free to open a thread on IRLO or Zulip if you want to understand better how that works. There's nothing in the algebraic operations that would cause any new issues here.

Are A and B allowed to differ?

yes.

Are A and A_2 allowed to differ?

yes

If A and C are the same for a given binary, is that guaranteed to be true for future compilations of the same source code with the same version of rustc?

The entire input has to be the same -- source code, depenencies, rustc version, compiler flags, target triple, compiler-relevant environment variables, everything. But assuming that, then yes: const-eval is deterministic, as I and others said above.

If yes, does that guarantee hold regardless of which architecture is running rustc?

yes (assuming an explicit --target to make the target triple host-independent)

I understand that is the situation now, my worries are about what the modified guarantees would allow the compiler to do in the future should const fn algebraic_* be stabilized with the currently described semantics.

The semantics are described in the same terms as e.g. the general float semantics at https://doc.rust-lang.org/nightly/std/primitive.f32.html: we just say that the operations are non-deterministic. How non-determinism behaves in const is largely an implementation detail. If you think that should be better documented somewhere, please file a new issue, but note that this documentation would not live with the individual const fn, it would be somewhere in the Reference as it is a general point about all const-eval, not specific to any particular operation.

And now please let's stop derailing this thread with questions we have already resolved in previous const-eval stabilizations, questions which are not even specific to the new operations at all but already affect stable rustc today. A tracking issue is the wrong place to catch up with the status quo of how Rust works before the new feature is even added. Please use Zulip or IRLO for that.

@RalfJung
Copy link
Member

RalfJung commented Jun 1, 2025

To answer to the other part of your post, which is thankfully on-topic:

Adding another datapoint to the naming discussion: WebAssembly refers to such operations as "relaxed", reflected in the core::arch intrinsics such as core::arch::wasm32::f32x4_relaxed_madd. Their Relaxed SIMD proposal goes into more detail.

My understanding is that wasm relaxed SIMD operations are required to behave the same way when the same op is executed multiple times in the same program. So, there's a single non-deterministic choice at program startup, but then behavior must be consistent.

algebraic operations cannot and do not provide any guarantee like that.

@traviscross
Copy link
Contributor

Setting aside its use in WASM and different semantics there, and considering it just on first principles, relaxed_ isn't the worst name for this. It's shorter than algebraic_, and the operations are opting in to relaxed ordering, among other things.

@RalfJung
Copy link
Member

RalfJung commented Jun 1, 2025

operations are opting in to relaxed ordering

Not to be confused with atomic::Ordering::Relaxed, though...

@traviscross
Copy link
Contributor

Not to be confused with atomic::Ordering::Relaxed, though...

Not to be confused, but it is spiritually similar. We're giving the compiler more room to reorder things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-floating-point Area: Floating point numbers and arithmetic C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-lang Relevant to the language team T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests