Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLVM is allowed to... be creative... with NANs according to Rust float semantics #134417

Closed
01mf02 opened this issue Dec 17, 2024 · 12 comments
Closed
Labels
A-floating-point Area: Floating point numbers and arithmetic C-discussion Category: Discussion or questions that doesn't represent real issues.

Comments

@01mf02
Copy link

01mf02 commented Dec 17, 2024

We have recently observed some very strange behaviour related to float handling, that seems to be linked to float parsing (01mf02/jaq#243).

Consider the following program:

fn main() {
    let t: f64 = "2.0".parse().unwrap();
    let z: f64 = "0.0".parse().unwrap();
    std::dbg!(t.to_bits(), z.to_bits());
    std::dbg!((t / z).total_cmp(&(z / z)));

    let t: f64 = 2.0;
    let z: f64 = 0.0;
    std::dbg!(t.to_bits(), z.to_bits());
    std::dbg!((t / z).total_cmp(&(z / z)));
}

This yields:

[src/main.rs:4:5] t.to_bits() = 4611686018427387904
[src/main.rs:4:5] z.to_bits() = 0
[src/main.rs:5:5] (t / z).total_cmp(&(z / z)) = Greater
[src/main.rs:9:5] t.to_bits() = 4611686018427387904
[src/main.rs:9:5] z.to_bits() = 0
[src/main.rs:10:5] (t / z).total_cmp(&(z / z)) = Less

Given that for both t and z, their bits are the same, I expected total_cmp to yield the same outputs; however, one yields Greater, the other yields Less!

I then refactored the program to an (IMO) equivalent one:

fn test(t: f64, z: f64) {
    std::dbg!(t.to_bits(), z.to_bits());
    std::dbg!((t / z).total_cmp(&(z / z)));
}

fn main() {
    test("2.0".parse().unwrap(), "0.0".parse().unwrap());
    test(2.0, 0.0);
}

Now, the output of total_cmp is the same!

[src/main.rs:2:5] t.to_bits() = 4611686018427387904
[src/main.rs:2:5] z.to_bits() = 0
[src/main.rs:3:5] (t / z).total_cmp(&(z / z)) = Greater
[src/main.rs:2:5] t.to_bits() = 4611686018427387904
[src/main.rs:2:5] z.to_bits() = 0
[src/main.rs:3:5] (t / z).total_cmp(&(z / z)) = Greater

This occurs both with cargo run and cargo run --release.

Could this be a compiler bug?

Meta

rustc --version --verbose:

rustc 1.83.0 (90b35a623 2024-11-26)
binary: rustc
commit-hash: 90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf
commit-date: 2024-11-26
host: x86_64-unknown-linux-gnu
release: 1.83.0
LLVM version: 19.1.1
@01mf02 01mf02 added the C-bug Category: This is a bug. label Dec 17, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 17, 2024
@theemathas
Copy link
Contributor

I believe this is expected behavior. The documentation for f32 says:

Rust does not currently guarantee that the bit patterns of NaN are preserved over arithmetic operations, and they are not guaranteed to be portable or even fully deterministic! This means that there may be some surprising results upon inspecting the bit patterns, as the same calculations might produce NaNs with different bit patterns.

@Urgau
Copy link
Member

Urgau commented Dec 17, 2024

They are not equal, both z / z return NaN but with a different sign, which total_cmp takes into account, even for NaNs.

@jieyouxu jieyouxu added A-floating-point Area: Floating point numbers and arithmetic C-discussion Category: Discussion or questions that doesn't represent real issues. and removed C-bug Category: This is a bug. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Dec 17, 2024
@wader
Copy link

wader commented Dec 17, 2024

@Urgau i'm curious why parsing vs using a float literal would affects signedness of z / z, how are the z:s different?

@rawler
Copy link
Contributor

rawler commented Dec 17, 2024

Related:

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=d388a8acac4b0c15d03d2dc54dabe324

    let a : f64 = "0.0".parse().unwrap();
    let b : f64 = 0.0;
    dbg!(a.is_sign_negative(), b.is_sign_negative());
    
    let x = a/a;
    let y = b/b;
    dbg!(x.is_sign_negative(), y.is_sign_negative());

Output:

[src/main.rs:4:5] a.is_sign_negative() = false
[src/main.rs:4:5] b.is_sign_negative() = false
[src/main.rs:8:5] x.is_sign_negative() = true
[src/main.rs:8:5] y.is_sign_negative() = false

sign seems to be same for the parsed and literal variable, but the division still executed differently.

@01mf02
Copy link
Author

01mf02 commented Dec 17, 2024

Even if we consider that "0.0".parse().unwrap() is not the same as 0.0, how is it then possible that my refactored program (which should perform the same operations as the original program) yields different output (not only for the bit patterns, but for total_cmp)?

@saethlin
Copy link
Member

The initial example can be further reduced to

fn main() {
    let z: f64 = std::hint::black_box(0.0);
    dbg!((z / z).to_bits());

    let z: f64 = 0.0;
    dbg!((z / z).to_bits());
}

Which on my machine prints this:

[src/main.rs:3:5] (z / z).to_bits() = 18444492273895866368
[src/main.rs:6:5] (z / z).to_bits() = 9221120237041090560

LLVM is doing some kind of optimization here that we do not have a mechanism to turn off, which is itself disturbing. Clang does the same transformation, which breaks (depending on your definition) some parts of musl's libm which try to do the equivalent of black_box(0.0 / 0.0) to raise a floating point exception, gcc will actually do that division at runtime but clang will use constant propagation to compute some NaN bit pattern at compile time, thus not raising an exception.

Quite vexing.

@workingjubilee workingjubilee changed the title Equal floats yield different results LLVM is allowed to fuck up NANs according to Rust float semantics Dec 17, 2024
@workingjubilee workingjubilee changed the title LLVM is allowed to fuck up NANs according to Rust float semantics LLVM is allowed to... be creative... with NANs according to Rust float semantics Dec 17, 2024
@workingjubilee
Copy link
Member

The ability to use total_cmp to order NaNs should not be considered as guaranteeing a NaN bitpattern on its creation.

Even if we consider that "0.0".parse().unwrap() is not the same as 0.0, how is it then possible that my refactored program (which should perform the same operations as the original program) yields different output (not only for the bit patterns, but for total_cmp)?

...you did the same operations, and the operations had inconsistent results, so the output was similarly inconsistent?

@wader
Copy link

wader commented Dec 18, 2024

Could one usability improvement be that different NaNs convert to string representation looks different? now they all say "NaN".

@saethlin
Copy link
Member

saethlin commented Dec 18, 2024

Is there any precedent for doing that in other language? At a glance, in Python:

>>> float('-nan')
nan

But in any case, I think this whole issue went off in the wrong direction. The root cause here is that jaq's comparison seems to have been written without the knowledge that in Rust, NaN generation is nondeterministic. This nondet has technically been around for a long time due to this LLVM codegen oddity I pointed out, but it's really quite hard to run into cases where the nondeterminism becomes visible, and only rather recently have we documented that NaN generation is nondeterministic.

@wader
Copy link

wader commented Dec 18, 2024

@saethlin no precedent that i know, and now when i think about it maybe different strings could be confusing in some other contexts. and yeap i guess jaq will have to do this in some explicit way using is_nan etc

@jhorstmann
Copy link
Contributor

Regarding the title, I would rather say it is x86 that is a bit creative, by setting the sign bit of the NaN if none of the inputs to the operation are themselves already NaN. This seems to be documented, for example searching for "QNaN floating-point indefinite" in Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture (Found via stackoverflow). LLVM is of course allowed to use a different NaN representation when constant folding during optimizations, and users should not rely on a specific bit representation.

@01mf02
Copy link
Author

01mf02 commented Dec 19, 2024

I think that for me, all is said in the following paragraph:

The non-deterministic choice happens when the operation is executed; i.e., the result of a NaN-producing floating-point operation is a stable bit pattern (looking at these bits multiple times will yield consistent results), but running the same operation twice with the same inputs can produce different results.

This is, to quote @saethlin, greatly disturbing --- but at least it's documented. I suppose that we must do with this.

I'm closing this issue --- may it serve as guidance for those who trod down the same path of misery as we did.

@01mf02 01mf02 closed this as completed Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-floating-point Area: Floating point numbers and arithmetic C-discussion Category: Discussion or questions that doesn't represent real issues.
Projects
None yet
Development

No branches or pull requests

11 participants