-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose algebraic floating point intrinsics #136457
base: master
Are you sure you want to change the base?
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @thomcc (or someone else) some time within the next two weeks. Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (
|
Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter cc @rust-lang/miri, @rust-lang/wg-const-eval |
Thanks for the speedy review @saethlin! I had a couple questions:
|
Thanks for the PR!
Going through the usual process, the first step would be a t-libs-api ACP to gauge the team's thinking on whether and how this should be exposed. This PR cannot land before a corresponding FCP has been accepted.
|
In terms of tests, usually we have doc tests. Also, given that the semantics are far from obvious (these operations are nondeterministic!), they need to be documented more carefully - probably in some central location, which is then referenced from everywhere.
|
@RalfJung: Thanks for the quick response!
|
https://doc.rust-lang.org/nightly/std/primitive.f32.html seems like a good place, that's where we already document everything special around NaNs. So a new section on algebraic operations there would probably be a good fit. |
I don't see an existing codegen test for the intrinsics so these should probably get one. https://github.com/rust-lang/rust/tree/3f33b30e19b7597a3acbca19e46d9e308865a0fe/tests/codegen/float would be a reasonable home. For reference, this would just be a file containing functions like this for each of the new methods, in order to verify the flags that we expect are getting set: // CHECK-LABEL: float @f32_algebraic_add(
#[no_mangle]
pub fn f32_algebraic_add(a: f32, b: f32) -> f32 {
// CHECK: fadd reassoc nsz arcp contract float %a, %b
a.algebraic_add(b)
} |
The Miri subtree was changed cc @rust-lang/miri |
// CHECK-LABEL: fp128 @f128_algebraic_add( | ||
#[no_mangle] | ||
pub fn f128_algebraic_add(a: f128, b: f128) -> f128 { | ||
// CHECK: fadd reassoc nsz arcp contract fp128 {{(%a, %b)|(%b, %a)}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition and multiplication cases both end up as %b, %a
rather than %a, %b
which surprised me but isn't incorrect. I opted to allow either in case behavior changes in the future.
This looks pretty reasonable to me but all the public functions should get some examples. I think it may also be good to give a small demo of how this may work at the end of the new "Algebraic operators" section, e.g.: For example, the below:
```
x = x.algebraic_add(x, a.algebraic_mul(b));
```
May be rewritten as either of the following:
```
x = x + (a * b); // As written
x = (a * b) + x; // Reordered to allow using a single `fma`
``` |
Per function examplesDid you have specific examples in mind? I'm struggling to think of ones that aren't repetitive / low signal-to-noise (i.e. assert this algebraic add is approximately equal to a normal add). Should we instead link to the central documentation and test approximate equality of each of the ops in normal tests like this so they don't clutter the documentation? Central exampleMade a couple small edits for brevity. How's this look? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
💔 Test failed - checks-actions |
@calder would you be able to update this? The thresholds likely need to be increased |
@tgross35: Fixed tolerances, sorry for the delay! |
@bors try It looks like this PR includes an accidental backtrace downgrade, not sure why rustbot didn't flag that but it should be dropped. |
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
This comment has been minimized.
This comment has been minimized.
💔 Test failed - checks-actions |
b703ec1
to
b3f4720
Compare
@bors try Tweaked tolerances some more. Is there an easy way to run these MIRI tests locally? |
@calder: 🔑 Insufficient privileges: not in try users |
@bors try I think you should be able to run |
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
This comment has been minimized.
This comment has been minimized.
💔 Test failed - checks-actions |
@bors delegate+ for |
@bors try It's a bit better to use |
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
#[test] | ||
fn test_algebraic() { | ||
let a: f32 = 123.0; | ||
let b: f32 = 456.0; | ||
|
||
assert_approx_eq!(a.algebraic_add(b), a + b, 1e-2); | ||
assert_approx_eq!(a.algebraic_sub(b), a - b, 1e-2); | ||
assert_approx_eq!(a.algebraic_mul(b), a * b, 1e-1); | ||
assert_approx_eq!(a.algebraic_div(b), a / b, 1e-5); | ||
assert_approx_eq!(a.algebraic_rem(b), a % b, 1e-2); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is actually going on that causes these tests to fail? a + b
should be 579.0, but the result is 579.0011. Afaict none of the algebraic effects can come into play here, so shouldn't the result be exact?
☀️ Try build successful - checks-actions |
By they way, in my codebase I use something minimal like this: #[derive(Copy, Clone, Default)]
#[repr(transparent)]
struct Af64(pub f64);
impl Af64 {
const ZERO: Af64 = Af64(0.0);
const ONE: Af64 = Af64(1.0);
fn from(x: i32) -> Self { Self(f64::from(x)) }
}
impl Add for Af64 {
type Output = Self;
#[inline]
fn add(self, other: Self) -> Self {
Self(fadd_algebraic(self.0, other.0))
}
}
impl Sub for Af64 {
type Output = Self;
#[inline]
fn sub(self, other: Self) -> Self {
Self(fsub_algebraic(self.0, other.0))
}
}
impl Mul for Af64 {
type Output = Self;
#[inline]
fn mul(self, other: Self) -> Self {
Self(fmul_algebraic(self.0, other.0))
}
}
impl Div for Af64 {
type Output = Self;
#[inline]
fn div(self, other: Self) -> Self {
Self(fdiv_algebraic(self.0, other.0))
}
}
impl Rem for Af64 {
type Output = Self;
#[inline]
fn rem(self, other: Self) -> Self {
Self(frem_algebraic(self.0, other.0))
}
} |
Problem
A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.
See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.
C++: 10us ✅
With Clang 18.1.3 and
-O2 -march=haswell
:Nightly Rust: 10us ✅
With rustc 1.86.0-nightly (8239a37) and
-C opt-level=3 -C target-feature=+avx2,+fma
:Stable Rust: 84us ❌
With rustc 1.84.1 (e71f9a9) and
-C opt-level=3 -C target-feature=+avx2,+fma
:Proposed Change
Add
core::intrinsics::f*_algebraic
wrappers tof16
,f32
,f64
, andf128
gated on a newfloat_algebraic
feature.Alternatives Considered
#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.
In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.
References
try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux