Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Make floats non-NaN by default #11234

Open
topolarity opened this issue Mar 19, 2022 · 21 comments
Open

Proposal: Make floats non-NaN by default #11234

topolarity opened this issue Mar 19, 2022 · 21 comments
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@topolarity
Copy link
Contributor

topolarity commented Mar 19, 2022

Introduction

The idea behind this proposal comes from observing that NaN has some surprising commonalities with null pointers:

  1. It represents an invalid value using a specialized bit sequence
  2. Some functions expect to receive this invalid value, others assume they do not (for optimal performance)

In combination with the arithmetic/comparison behavior of NaNs, these troubles lead to a number of footguns in real-life code.

Examples include sorting algorithms failing for NaN data, every NaN colliding in a hash map, NaNs propagating virally in streaming outputs, invalid image filtering when operating on NaNs, nan values persisting after filtering, parsers failing on NaN inputs, and formatting/display unintentionally exposing NaN to the user.

Footguns abound when there is disagreement about whether NaN needs to be handled correctly.

Proposal

Option A: Replace f32 with error{NaN}!f32

  • All floating point operations (+-*/%) yield error{NaN}!f32
  • Arithmetic is overloaded on error{NaN}!f32
  • error{NaN}!f32 can be unwrapped with try, catch, and if like any other error union
  • Comparison of f32 yields bool. Comparison of error{NaN}!f32 yields error{NaNOperand}!bool

Other error unions, such as error{Foo}!f32 are not treated specially (no arithmetic, no special layout, etc.).

"NaN-boxing" is to be supported via getNaNPayload and setNaNPayload

Option B: Make comparisons of floats return error{NaNOperand}!bool

This is a minimal change to the language that would force users to explicitly account for NaN in floating point comparisons, which is the central oversight in above-mentioned bugs.

API Impacts

This means that "nan-safe" functions can be given a type that reflects their special handling of NaN. Meanwhile, highly-optimized routines that don't handle NaN correctly can be given a type that reflects their assumptions:

// Returns median, ignoring any NaN values
pub fn median(in: []const error{NaN}!f32) f32 { ... }
// Assumes that inputs do not include NaN
pub fn convolve(a: []const f32, b: []const f32,  out: []f32) { ... }

Example

/// Insertion sort. NaN values are sorted to the end
fn sort_inplace(vec: []error{NaN}!f32) void {
    for (vec) |maybe_key, i| {
        if (maybe_key) |key| { // If maybe_key is NaN, treat it as greater than everything (i.e. don't move it)
            var j = i;
            // `error{NaN}!f32` forces us to explicitly handle the NaN case here
            while ((vec[j - 1] > key) catch true) { // If vec[j - 1] is NaN, treat it as greater
                vec[j] = vec[j - 1];
                j = j - 1;
                if (j == 0) break;
            }
            vec[j] = key;
        }
    }
}

Meanwhile, the code for a non-NaN-safe version of this function would look exactly like it does today.

Supplemental Ideas

These related ideas can be accepted/rejected separately from the main proposal:

  1. Size Optimization for ?f32: Define ?f32 to be stored in a typical float, by assigning it a NaN payload with a special value. This is similar to R's "NA" value, except that ?f32 would not support arithmetic or comparison (except with null), meaning that NA/NaN propagation is not an issue. It behaves like any other optional.

  2. @assertFinite/@assertNonNaN built-ins: The UB-introducing @setFloatMode(.Optimize) assumptions are that inputs/outputs are non-Inf and non-NaN. All other fast-math optimization flags make a different performance/accuracy trade-off, but do not directly introduce poison/undefined into the program.@assertFinite would allow the programmer to make these dangerous assumptions explicit in their code where it's obvious exactly which operands it affects.

(1) can be particularly important for performance when operating on large, structured data, since it affects how many values can fit into a cache line. This is why it's common to see in statistical software, including R and Pandas.

Edit: Updated 4/11 to use error{NaN}!f32 instead of ?f32 + add supplemental ideas

@andrewrk andrewrk added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Mar 19, 2022
@andrewrk andrewrk added this to the 0.10.0 milestone Mar 19, 2022
@andrewrk
Copy link
Member

Can you address the following operations of IEEE 754 floats?

  • -Infinity + Infinity produces NaN
  • Infinity - Infinity produces NaN
  • Infinity * 0 produces NaN
  • 0 / 0 produces NaN

(Thanks @rtfeldman for pointing these out to me.) How do these operations fit into your vision?

It looks like you are saying:

Safety-checked UB to observe or produce a NaN

What might that look like in machine code for each of these operations? Can we estimate whether such safety checks in debug builds will be reasonable, as they are with overflow arithmetic, or whether they might be debilitatingly slow?

@mrakh
Copy link
Contributor

mrakh commented Mar 20, 2022

Can you address the following operations of IEEE 754 floats?

  • -Infinity + Infinity produces NaN
  • Infinity - Infinity produces NaN
  • Infinity * 0 produces NaN
  • 0 / 0 produces NaN

(Thanks @rtfeldman for pointing these out to me.) How do these operations fit into your vision?

One possibility would be to always return an optional float for all floating-point operations, whether the operands themselves are optional or not. Then, a coercion from ?fN to fN could invoke safety-checked undefined behavior when the value is NaN.

It looks like you are saying:

Safety-checked UB to observe or produce a NaN

What might that look like in machine code for each of these operations? Can we estimate whether such safety checks in debug builds will be reasonable, as they are with overflow arithmetic, or whether they might be debilitatingly slow?

x64 and arm64 will appropriately set a flag register if a floating-point compare operand is NaN. From there, you should be able to conditionally branch on that flag, for a total of two instructions for a NaN check. It looks like x64's ucomiss has ~7 clock cycles of latency on modern AMD chips and ~3 on modern Intel chips, and that arm64's fcmp has 2 clock cycles of latency on the Apple M1 Firestorm. My guess is that Broadcom ARM chips will have similar latencies.

@MasonRemaley
Copy link
Contributor

MasonRemaley commented Mar 20, 2022

I like the idea of encoding NaNs in the type system!

I have a reservation to do with UB--in something very dynamic like a physics engine for a game, it's very tough to be 100% certain that there's no way to push the system past its limits and get a NaN (likely by first getting an infinity.) Due to how NaNs propagate, this kind of glitch tends to be contained to a single object in the game, but if getting a NaN unexpectedly is UB then it becomes a much more serious concern.

I'd propose having debug builds check fXs for NaNs, but release builds treat fX and ?fX identically.

@topolarity
Copy link
Contributor Author

topolarity commented Mar 20, 2022

Can you address the following operations of IEEE 754 floats?

  • -Infinity + Infinity produces NaN
  • Infinity - Infinity produces NaN
  • Infinity * 0 produces NaN
  • 0 / 0 produces NaN

(Thanks @rtfeldman for pointing these out to me.) How do these operations fit into your vision?

Great question.

My original thinking was that "responsible code" would check for these cases pre-emptively. However, I'm starting to think panics upon producing NaN is the wrong way to go.

Here's why: It can be more performant to propagate invalid errors throughout a computation and check the result at the end, rather than to pre-emptively avoid generating invalid values. For example:

fn sum(vec: []const f32) f32 {
    var accum = vec[0];
    for (vec[1..]) |x| {
        accum += x;
    }
    return accum;
}

According to the original proposal, this would panic if vec contained an "Infinity" followed by "-Infinity". Adding an if (...) inside the hot loop to avoid generating a NaN would impact performance for the well-behaved case. But leaving it out means allowing NaN to leak into non-NaN f32 values. The user is also too unlikely to hit this panic in the first place.

The right choice is to make accum an ?f32 and then handle the invalid value at the end (or return the ?f32 directly), and I believe this is the pattern intended by IEEE 754.

Amendment

I'd like to amend the proposal so that all floating point operations return ?f32, as @mrakh suggested. For ergonomics, we'd also want to consider supporting the standard comparison operators on ?f32, which would return a ?bool (null when either argument is null). Finally, conversion from ?f32 to f32 should be explicit, like any other optional.

This would achieve what the original proposal intended: it prevents invalid comparisons with NaN and unintentional storage of NaN, which are the primary footguns responsible for the unexpected behavior mentioned in the introduction.

Finally, it's still worth adding safety panics upon generating NaN or Inf in blocks with @setFloatMode(.Optimized), since this mode implies -ffinite-math-only. This can be done by checking the FPU status bits as described above, or alternatively by enabling traps for floating point exceptions (these are typically masked by default)

@topolarity
Copy link
Contributor Author

@MasonRemaley I believe your concern is addressed by the latest amendment, since floating point optionals would have the same safety checks as other optionals in Zig (only .? panics and only in safe builds) - Let me know if you have any concerns for the new design

@topolarity
Copy link
Contributor Author

topolarity commented Mar 21, 2022

Another thought occurred to me: .? coercions can be used to mark operations as nnan for LLVM optimizations, indicating that the operation can be assumed not to yield NaN. If we add another built-in @assertFinite(?f32) f32, this can be used to mark operations as both ninf and nnan in Release builds.

For example:

var x: f32 = 1.5;
var accum: f32 = 0;
accum = (accum + x).?; // implies `nnan`
accum = @assertFinite(accum + x); // implies `nnan` and `ninf`

could be lowered to the following LLVM IR:

%1 = load float, float* %accum, align 4, !dbg !2882
%2 = load float, float* %x, align 4, !dbg !2883
%3 = fadd nnan float %1, %2, !dbg !2884      ; optimizations enabled by nnan
...
%6 = fadd nnan ninf float %4, %5, !dbg !2887 ; optimizations enabled by nnan and ninf
store float %6, float* %accum, align 4, !dbg !2887

With these assumptions made explicit, @setFloatMode could be used just to enable/disable the other fast-math flags, which do not make assumptions about the range of operands or results or introduce LLVM poison to the program.

Existing -ffast-math-like optimization would be achieved by liberal usage of @assertFinite and @setFloatMode(.Optimized)

@SpexGuy
Copy link
Contributor

SpexGuy commented Mar 21, 2022

I think encoding nan-ability in the type system is an interesting idea, but making it the null value of an optional feels very wrong to me. For one, there are many NaN values, and IEEE 754 defines rules for how the bits which can vary propagate through operations. This can be used in a technique called "NaN packing" or "NaN boxing" to associate extra data with NaN values, which can be used to indicate where they were first discovered or other attributes. Additionally, many CPUs designate one of these bits to indicate "signaling NaN" values, which will generate a CPU exception when used. None of that fits into this model.

So that would instead leave us with the fXX family of IEEE 754 floats, and an additional family of eXX or xXX "fast finite floats". Those would obey the following rules:

  • eXX values cannot take on NaN
  • If a eXX value is created from a representation that would give it a non-finite value, it instead becomes undefined
  • eXX coerces to fXX
  • fXX can be converted to eXX with the @assertFinite builtin

@rohlem
Copy link
Contributor

rohlem commented Mar 21, 2022

I think the original post gives good arguments why NaN should be considered in function signatures.
I also think we should consider the full potential scope of this.

Maybe obvious/unnecessary, but to distinguish floating-point number types, here's everything one _could_ expect from them:
  • representing actual numbers:
    • normalized representation
    • denormalized representation around 0
    • representing 0 (optionally with sign attached)
  • representing signed infinities
  • representing non-number bit patterns in NaN (one bit meaning quiet vs signalling)

I think it's reasonable to understand a floating-point number as a union of these states (let me know if there's more still). Constructing and destructuring could be provided by a library, though since floats are language-provided, std or compiler builtins would also be reasonable places imo.

Edit: Adding this paragraph because I'm unsure whether my main point came across well enough:
I'd advocate to have distinct floating-point types that:

  • may (not) represent infinity,
  • may (not) represent NaN (with/-out payload),
  • may (not) represent signed zero,
  • (potentially?) may (not) be subnormal,
  • (potentially?) may (not) be normal.

If we expose all of these combinations of options in std.builtin.TypeInfo.Float (for querying via @typeInfo and construction via @Type), users can already make full use of them - as arguments to basic operators, and for asserting properties via @floatCast to the respective type. (-> std.meta/math.assertFinite is implementable via @floatCast.)

Using NaN as a null-encoding is an opportune size optimization. If the compiler understands floating-point types well enough, any unused bit pattern is fit for this purpose: If it is always a number, use NaN representation, if it is always finite but may be NaN, use an infinity, etc. .


Except for ergonomics, the first half would already make the language feature-complete in this regard.
If that is an acceptable design, the remaining questions are:

  1. Which subset should we map to the fXX-family?
  2. Which subsets should we support via primitive-type syntax? (rXX/eXX)
  3. Which subset(s) should we map to the default arithmetic operators? (+, -, *, /)
  4. Which subsets should we support via other arithmetic operators? (new families like +%, +|)
(my thoughts)
  1. If based on hardware-support, the most-developed-for systems (afaik?) by default support all float states from above, including +/- infinities, +/- 0, and NaN with payload.
    (Maybe obvious, I don't like the idea of changing this depending on target platform.)
  2. IMHO using a standard function const r32 = std.math.FiniteRealNumber(32); is also readable, so it's really about which are fundamentally important enough to reserve letter-prefix syntax for.
    Note that it does help discoverability a bit, but then again this is a rather advanced topic already.
  3. I personally don't like the idea of changing their behaviour based on input types. Again, based on hardware, most operations can yield +/- inf or NaN in some cases, so even a/b in finite numbers needs to either yield a "full" float, or imply an assertion.
  4. If we want both full-IEEE float and finite number-only floats (and maybe some other combinations), I think having new operator families for these would be a perfect fit, if we can find agreeable mnemonic syntax.
    • +| for saturating finite float addition - if we want to support that - would be perfectly intuitive imo.
    • Maybe +< would signal assert finite, as in "has a limit"?
    • Or maybe single-letter suffixes end up more readable.
      (A bit iffy with single-letter-identifiers, but at least it would be a parsing error if you forget or add a space.)

@topolarity
Copy link
Contributor Author

I appreciate the detailed thoughts! Your alternative sounds a lot like tracking general bounds for floats (similar to what was proposed for integers in #3806), and I absolutely agree that would solve this problem and potentially bring other benefits, if the ergonomic and other challenges involved can be resolved 🙂

FWIW, Inf-able and NaN-able are the subsets critical for safety/optimization. These are the only subsets that interact -ffast-math to introduce undefined behavior, and NaN is the only value which is generally dangerous/incorrect to allow in comparisons (><=!)

@gwenzek
Copy link
Contributor

gwenzek commented Mar 22, 2022

So type systems help prevent bugs.
What kind of bug is this proposal preventing?

Looking at an example average function:

var avg1 = (x + y) / 2;
var avg2 = 0.5 * x + 0.5 * y;

The first implementation can overflow while the second can't. I think your proposition will force me to add explicit Nan checking to both implementation but doesn't help me write the correct implementation.

@rohlem
Copy link
Contributor

rohlem commented Mar 22, 2022

@gwenzek I think "[helping someone] write the correct implementation" is a rather general statement of rather high expectations. Does an easy way to add an assertion meet your criterion?

I think your proposition will force me to add explicit Nan checking

The original post proposes arithmetic operators to be overloaded for ?f32, meaning you need no checks around operations where you want to accept NaN as arguments and results, as you would in status-quo.
You would only check to convert from NaN-able ?f32 to non-NaN-able f32.


Your particular example is about overflow though, which doesn't intersect with the original proposal at all.
If you use my comment above as basis, since overflow results in positive infinity, this would be your transformed example:

const finite32 = std.math.FiniteFloatNumber(32); //this type never represents NaN nor infinities
var avg1 = @floatCast(finite32, (x + y) / 2); //asserts no overflow => panics if infinity is reached
var avg2 = @floatCast(finite32, 0.5 * x + 0.5 * y); //asserts no overflow (which passes if x and y are finite)

If we added an operator family, say +<, -<, *<, /< to represent "float arithmetic that asserts results to be finite", then you could write it as:

var avg1 = (x +< y) /< 2; //asserts no overflow => panics if infinity is reached
var avg2 = 0.5 *< x +< 0.5 *< y; //asserts no overflow (which passes if x and y are finite)

Edit: Also note that while avg1 may overflow, avg2 is more likely to involve subnormal numbers (a performance penalty), and truncate precision of smallest-possible epsilon numbers towards 0. Neither approach is perfect for all scenarios.

@InKryption
Copy link
Contributor

(Tried stitching together some thoughts into a coherent set of paragraphs, apologies if this comes across like rambling, feel free to dismiss it if you believe it lacks substance).

I think this proposal follows a trend I've seen in a couple of proposals that tend to run counter to the actual momentum of zig's design, in particular #7512 which, while not rejected, is pretty barren in terms of discussion, and has a comment under it by Andrew stating that it is unlikely to be accepted.

The actual direction zig seems to be going in doesn't necessarily reject the introduction of new types for particular use cases, but it appears to have a preference for the addition of new operations on existing types to achieve particular semantics (wrapping and saturating arithmetic operators, @divFloor vs @divTrunc vs @divExact). This thought process even goes as far for it to apparently be reasonable to propose removing volatile from the type system, replacing it special operations.

This is all to say, I think that it would make more sense to add operators/builtins that allow one to produce the desired behavior, as is suggested by @rohlem, rather than overloading the usage of ?fXX or adding new type variants of fXX with specific behaviors.

I would loosely draw a comparison between this, and the usages of RAII (Rust, C++, ...) vs defer (Zig, Go, ...): where the former requires a program's state to operate and evolve in terms of its types (and the operations that are associated with them), the latter allows a program's state to operate almost exclusively in terms of the desired behaviors. Not saying that either is better or worse, but I would argue that the latter is quite a bit more aligned with the philosophy of Zig.

@topolarity
Copy link
Contributor Author

topolarity commented Mar 22, 2022

So type systems help prevent bugs. What kind of bug is this proposal preventing?

Comparison with nan is the bug I'm trying to avoid:

// insertion sort -- can you spot the bug?
fn sort_inplace(vec: []f32) void {
    for (vec) |key, i| {
        var j = i;
        while (j > 0 and vec[j - 1] > key) {
            vec[j] = vec[j - 1];
            j = j - 1;
        }
        vec[j] = key;
    }
}

When given "{ 1.5, 0.5, 3.5, nan, 0.25, 6.0, 1.5 }" this returns "{ 0.5, 1.5, 3.5, nan, 0.25, 1.5, 6 }"

Under this proposal, the function signature of sort_inplace already makes it clear that this function is not written to handle NaN correctly. If you wanted it to handle NaN you'd write it like this.

This also intends to improve support for generic implementations.

A function that accepts ?T will generally correctly handle ?f32 too. This extends to hash maps: AutoHashMap currently refuses to accept f32 as a key, because it might be NaN. But with this change, it can support f32 and even ?f32 correctly.

@topolarity
Copy link
Contributor Author

topolarity commented Mar 22, 2022

Perhaps in line with @InKryption 's thinking, there's a pared-down version of this proposal where we make comparisons between floats return ?bool or error{NaNOperand}!bool and leave the rest as status-quo.

The downside is that: (1) if a function asserts a non-NaN comparison internally, it won't be clear from its signature, and (2) generic functions will have to special-case this weird comparison result type.

That might be worth it to avoid complicating the types, though

@MichaelByrneAU
Copy link
Contributor

Whilst I like the idea of controlling the handling of NaNs better, I feel this proposal overloads the concept of the option type in a distinctly unintuitive way from a language design standpoint.

There has been a fair amount of discussion on the representation of a NaN and some of the optimisations this proposal can facilitate, but fundamentally I feel that an option represents nullability whilst a NaN is something completely orthogonal to this semantically.

Phrased another way, I expect ?f32 to represent a float that may simply not exist, whilst a NaN has a distinctly different semantic meaning (e.g. the result of an ambiguous calculation). Further, I would want to preserve the ability to represent an ?f32 that itself could contain a NaN.

@topolarity
Copy link
Contributor Author

topolarity commented Mar 23, 2022

That's a very compelling argument

If ?f32 implies the wrong semantics, we could potentially use error{NaN}!f32 instead.

FWIW, the R statistical software use NaN for missing values as an important optimization, but it also distinguishes between NA (missing value) and NaN (invalid result) in the way you describe. NA gets a special NaN payload, with the caveat that R cannot guarantee any consistent answer to NA * NaN since payload propagation is under-specified by IEEE 754

I'd also be curious to know what you think of the reduced change mentioned above: to simply make comparison between floats yield error{NaNOperand}!bool

@ominitay
Copy link
Contributor

Uhh, doesn't changing floats from pretending to be an optional to pretending to be an error union undermine your key advantage here of improved compatibility with generics?

@topolarity
Copy link
Contributor Author

Uhh, doesn't changing floats from pretending to be an optional to pretending to be an error union undermine your key advantage here of improved compatibility with generics?

Hashmap would work just fine, for example. Can you think of an example that's broken?

@ominitay
Copy link
Contributor

Oh,,, I think I see what you mean now

@topolarity
Copy link
Contributor Author

Updated proposal to use error{NaN}!fX instead of ?fX.

Also factored out two ideas that don't depend on the main proposal:

  • @assertFinite/@assertNonNaN built-ins (for explicit fastmath assumptions)
  • ?f32 encoded using NaN payloads, as a size/performance optimization

@andrewrk andrewrk modified the milestones: 0.10.0, 0.11.0 Apr 16, 2022
@dvmason
Copy link

dvmason commented Aug 30, 2022

I'm using NaN boxing in a (dynamic) language runtime I'm building, but I wouldn't let the floating-point unit get ahold of any NaN values that I use to encode other things, so I don't think this proposal has any impact on NaN boxing.... or at least I can't see how it would.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests