Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose algebraic floating point intrinsics #136457

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

calder
Copy link

@calder calder commented Feb 3, 2025

Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

C++: 10us ✅

With Clang 18.1.3 and -O2 -march=haswell:

C++ Assembly
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}

Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and -C opt-level=3 -C target-feature=+avx2,+fma:

Rust Assembly
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}

Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and -C opt-level=3 -C target-feature=+avx2,+fma:

Rust Assembly
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}

Proposed Change

Add core::intrinsics::f*_algebraic wrappers to f16, f32, f64, and f128 gated on a new float_algebraic feature.

Alternatives Considered

#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

References

try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux

@rustbot
Copy link
Collaborator

rustbot commented Feb 3, 2025

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @thomcc (or someone else) some time within the next two weeks.

Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (S-waiting-on-review and S-waiting-on-author) stays updated, invoking these commands when appropriate:

  • @rustbot author: the review is finished, PR author should check the comments and take action accordingly
  • @rustbot review: the author is ready for a review, this PR will be queued again in the reviewer's queue

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Feb 3, 2025
@rustbot
Copy link
Collaborator

rustbot commented Feb 3, 2025

Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter
gets adapted for the changes, if necessary.

cc @rust-lang/miri, @rust-lang/wg-const-eval

@calder
Copy link
Author

calder commented Feb 3, 2025

Thanks for the speedy review @saethlin! I had a couple questions:

  • Should we add these methods to f16 and f128 as well?
  • Are intrinsic wrappers tested explicitly? I see tests for the intrinsics themselves but not for other wrappers.

@RalfJung
Copy link
Member

RalfJung commented Feb 3, 2025 via email

@RalfJung
Copy link
Member

RalfJung commented Feb 3, 2025 via email

@calder
Copy link
Author

calder commented Feb 3, 2025

@RalfJung: Thanks for the quick response!

@RalfJung
Copy link
Member

RalfJung commented Feb 3, 2025

https://doc.rust-lang.org/nightly/std/primitive.f32.html seems like a good place, that's where we already document everything special around NaNs. So a new section on algebraic operations there would probably be a good fit.

@calder
Copy link
Author

calder commented Feb 4, 2025

Changes

  • Added f16 and f128 methods as requested by @tgross35 here.
  • Renamed to algebraic_*() to match checked_*() as requested by @tgross35 here.
  • Added central documentation with verbage suggested by @RalfJung here.

calder added a commit to calder/rust that referenced this pull request Feb 4, 2025
@tgross35
Copy link
Contributor

tgross35 commented Feb 4, 2025

I don't see an existing codegen test for the intrinsics so these should probably get one. https://github.com/rust-lang/rust/tree/3f33b30e19b7597a3acbca19e46d9e308865a0fe/tests/codegen/float would be a reasonable home.

For reference, this would just be a file containing functions like this for each of the new methods, in order to verify the flags that we expect are getting set:

// CHECK-LABEL: float @f32_algebraic_add(
#[no_mangle]
pub fn f32_algebraic_add(a: f32, b: f32) -> f32 {
    // CHECK: fadd reassoc nsz arcp contract float %a, %b
    a.algebraic_add(b)
}

@rustbot
Copy link
Collaborator

rustbot commented Feb 5, 2025

The Miri subtree was changed

cc @rust-lang/miri

// CHECK-LABEL: fp128 @f128_algebraic_add(
#[no_mangle]
pub fn f128_algebraic_add(a: f128, b: f128) -> f128 {
// CHECK: fadd reassoc nsz arcp contract fp128 {{(%a, %b)|(%b, %a)}}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition and multiplication cases both end up as %b, %a rather than %a, %b which surprised me but isn't incorrect. I opted to allow either in case behavior changes in the future.

@tgross35
Copy link
Contributor

tgross35 commented Feb 5, 2025

This looks pretty reasonable to me but all the public functions should get some examples. I think it may also be good to give a small demo of how this may work at the end of the new "Algebraic operators" section, e.g.:

For example, the below:

```
x = x.algebraic_add(x, a.algebraic_mul(b));
```

May be rewritten as either of the following:

```
x = x + (a * b); // As written
x = (a * b) + x; // Reordered to allow using a single `fma`
```

@calder
Copy link
Author

calder commented Feb 5, 2025

Per function examples

Did you have specific examples in mind? I'm struggling to think of ones that aren't repetitive / low signal-to-noise (i.e. assert this algebraic add is approximately equal to a normal add). Should we instead link to the central documentation and test approximate equality of each of the ops in normal tests like this so they don't clutter the documentation?

Central example

Made a couple small edits for brevity. How's this look?

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Contributor

bors commented Feb 26, 2025

💔 Test failed - checks-actions

@tgross35
Copy link
Contributor

@calder would you be able to update this? The thresholds likely need to be increased

@calder
Copy link
Author

calder commented Mar 10, 2025

@tgross35: Fixed tolerances, sorry for the delay!

@tgross35
Copy link
Contributor

@bors try

It looks like this PR includes an accidental backtrace downgrade, not sure why rustbot didn't flag that but it should be dropped.

bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 11, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux
@bors
Copy link
Contributor

bors commented Mar 11, 2025

⌛ Trying commit acc3d85 with merge 55ea7cb...

@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Contributor

bors commented Mar 11, 2025

💔 Test failed - checks-actions

@calder calder force-pushed the master branch 2 times, most recently from b703ec1 to b3f4720 Compare March 11, 2025 18:47
@calder
Copy link
Author

calder commented Mar 11, 2025

@bors try

Tweaked tolerances some more. Is there an easy way to run these MIRI tests locally?

@bors
Copy link
Contributor

bors commented Mar 11, 2025

@calder: 🔑 Insufficient privileges: not in try users

@tgross35
Copy link
Contributor

@bors try

I think you should be able to run ./x miri library

@bors
Copy link
Contributor

bors commented Mar 11, 2025

⌛ Trying commit 0151a01 with merge dabe80a...

bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 11, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux
@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Contributor

bors commented Mar 11, 2025

💔 Test failed - checks-actions

@tgross35
Copy link
Contributor

@bors delegate+ for @bors try, just don't merge it (I'll take a final look once things pass)

@tgross35
Copy link
Contributor

@bors try

It's a bit better to use --keep-base when you rebase locally to squash, that way GH shows a diff of only relevant things that changed. Unfortunately GH has the limitation that if the push includes an actual rebase onto latest (sometimes unavoidable ofc), then the "compare" link is pretty unhelpful e.g. https://github.com/rust-lang/rust/compare/53944e53dca5f93f1559b3914470c61dda2c350c..dfd4b6cddf5b2a3936492fe5c7cb97198fec3761

bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 12, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux
@bors
Copy link
Contributor

bors commented Mar 12, 2025

⌛ Trying commit dfd4b6c with merge 2d8ed6b...

Comment on lines +919 to +929
#[test]
fn test_algebraic() {
let a: f32 = 123.0;
let b: f32 = 456.0;

assert_approx_eq!(a.algebraic_add(b), a + b, 1e-2);
assert_approx_eq!(a.algebraic_sub(b), a - b, 1e-2);
assert_approx_eq!(a.algebraic_mul(b), a * b, 1e-1);
assert_approx_eq!(a.algebraic_div(b), a / b, 1e-5);
assert_approx_eq!(a.algebraic_rem(b), a % b, 1e-2);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is actually going on that causes these tests to fail? a + b should be 579.0, but the result is 579.0011. Afaict none of the algebraic effects can come into play here, so shouldn't the result be exact?

@bors
Copy link
Contributor

bors commented Mar 12, 2025

☀️ Try build successful - checks-actions
Build commit: 2d8ed6b (2d8ed6b4bf07905ce9e8b131b366708f273717c0)

@leonardo-m
Copy link

leonardo-m commented Mar 13, 2025

By they way, in my codebase I use something minimal like this:

#[derive(Copy, Clone, Default)]
#[repr(transparent)]
struct Af64(pub f64);

impl Af64 {
    const ZERO: Af64 = Af64(0.0);
    const ONE: Af64 = Af64(1.0);
    fn from(x: i32) -> Self { Self(f64::from(x)) }
}

impl Add for Af64 {
    type Output = Self;

    #[inline]
    fn add(self, other: Self) -> Self {
        Self(fadd_algebraic(self.0, other.0))
    }
}

impl Sub for Af64 {
    type Output = Self;

    #[inline]
    fn sub(self, other: Self) -> Self {
        Self(fsub_algebraic(self.0, other.0))
    }
}

impl Mul for Af64 {
    type Output = Self;

    #[inline]
    fn mul(self, other: Self) -> Self {
        Self(fmul_algebraic(self.0, other.0))
    }
}

impl Div for Af64 {
    type Output = Self;

    #[inline]
    fn div(self, other: Self) -> Self {
        Self(fdiv_algebraic(self.0, other.0))
    }
}

impl Rem for Af64 {
    type Output = Self;

    #[inline]
    fn rem(self, other: Self) -> Self {
        Self(frem_algebraic(self.0, other.0))
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.