Skip to content

Latest commit

 

History

History
137 lines (89 loc) · 11 KB

3453-f16-and-f128.md

File metadata and controls

137 lines (89 loc) · 11 KB

Summary

This RFC proposes adding new IEEE-compliant floating point types f16 and f128 into the core language and standard library. We will provide a soft float implementation for all targets, and use hardware support where possible.

Motivation

The IEEE 754 standard defines many binary floating point formats. The most common of these types are the binary32 and binary64 formats, available in Rust as f32 and f64. However, other formats are useful in various uncommon scenarios. The binary16 format is useful for situations where storage compactness is important and low precision is acceptable, such as HDR images, mesh quantization, and AI neural networks.1 The binary128 format is useful for situations where high precision is needed, such as scientific computing contexts.

This RFC proposes adding f16 and f128 primitive types in Rust to represent IEEE 754 binary16 and binary128, respectively. Having f16 and f128 types in the Rust language would allow Rust to better support the above mentioned use cases, allowing for optimizations and native support that may not be possible in a third party crate. Additionally, providing a single canonical data type for these floating point representations will make it easier to exchange data between libraries.

This RFC does not have the goal of covering the entire IEEE 754 standard, since it does not include f256 and the decimal-float types. This RFC also does not have the goal of adding existing platform-specific float types such as x86's 80-bit double-extended-precision. This RFC does not make a judgement of whether those types should be added in the future, such discussion can be left to a future RFC, but it is not the goal of this RFC.

Guide-level explanation

f16 and f128 are primitive floating types, they can be used like f32 or f64. They always conform to binary16 and binary128 formats defined in the IEEE 754 standard, which means the size of f16 is always 16-bit, the size of f128 is always 128-bit, the amount of exponent and mantissa bits follows the standard, and all operations are IEEE 754-compliant. Float literals of these sizes have f16 and f128 suffixes respectively.

let val1 = 1.0; // Default type is still f64
let val2: f128 = 1.0; // Explicit f128 type
let val3: f16 = 1.0; // Explicit f16 type
let val4 = 1.0f128; // Suffix of f128 literal
let val5 = 1.0f16; // Suffix of f16 literal

println!("Size of f128 in bytes: {}", std::mem::size_of_val(&val2)); // 16
println!("Size of f16 in bytes: {}", std::mem::size_of_val(&val3)); // 2

Every target should support f16 and f128, either in hardware or software. Most platforms do not have hardware support and therefore will need to use a software implementation.

All operators, constants, and math functions defined for f32 and f64 in core, must also be defined for f16 and f128 in core. Similarly, all functionality defined for f32 and f64 in std must also be defined for f16 and f128.

Reference-level explanation

f16 type

f16 consists of 1 bit of sign, 5 bits of exponent, 10 bits of mantissa. It is exactly equivalent to the 16-bit IEEE 754 binary16 half-precision floating-point format.

The following traits will be implemented for conversion between f16 and other types:

impl From<f16> for f32 { /* ... */ }
impl From<f16> for f64 { /* ... */ }
impl From<bool> for f16 { /* ... */ }
impl From<u8> for f16 { /* ... */ }
impl From<i8> for f16 { /* ... */ }

Conversions to f16 will also be available with as casts, which allow for truncated conversions.

f16 will generate the half type in LLVM IR. It is also equivalent to C++ std::float16_t, C _Float16, and GCC __fp16. f16 is ABI-compatible with all of these. f16 values must be aligned in memory on a multiple of 16 bits, or 2 bytes.

On the hardware level, f16 can be accelerated on RISC-V via the Zfh or Zfhmin extensions, on x86 with AVX-512 via the FP16 instruction set, on some Arm platforms, and on PowerISA via VSX on PowerISA v3.1B and later. Most platforms do not have hardware support and therefore will need to use a software implementation.

f128 type

f128 consists of 1 bit of sign, 15 bits of exponent, 112 bits of mantissa. It is exactly equivalent to the 128-bit IEEE 754 binary128 quadruple-precision floating-point format.

The following traits will be implemented for conversion between f128 and other types:

impl From<f16> for f128 { /* ... */ }
impl From<f32> for f128 { /* ... */ }
impl From<f64> for f128 { /* ... */ }
impl From<bool> for f128 { /* ... */ }
impl From<u8> for f128 { /* ... */ }
impl From<i8> for f128 { /* ... */ }
impl From<u16> for f128 { /* ... */ }
impl From<i16> for f128 { /* ... */ }
impl From<u32> for f128 { /* ... */ }
impl From<i32> for f128 { /* ... */ }
impl From<u64> for f128 { /* ... */ }
impl From<i64> for f128 { /* ... */ }

Conversions from i128/u128 to f128 will also be available with as casts, which allow for truncated conversions.

f128 will generate the fp128 type in LLVM IR. It is also equivalent to C++ std::float128_t, C _Float128, and GCC __float128. f128 is ABI-compatible with all of these. f128 values must be aligned in memory on a multiple of 128 bits, or 16 bytes. LLVM provides support for 128-bit float math operations.

On the hardware level, f128 can be accelerated on RISC-V via the Q extension, on IBM S/390x G5 and later, and on PowerISA via BFP128, an optional part of PowerISA v3.0C and later. Most platforms do not have hardware support and therefore will need to use a software implementation.

Drawbacks

While f32 and f64 have very broad support in most hardware, hardware support for f16 and f128 is more niche. On most systems software emulation will be required. Therefore, the main drawback is implementation difficulty.

Rationale and alternatives

There are some crates aiming for similar functionality:

  • f128 provides binding to the __float128 type in GCC.
  • half provides an implementation of binary16 and bfloat16 types.

However, besides the disadvantage of usage inconsistency between primitive types and types from a crate, there are still issues around those bindings.

The ability to accelerate additional float types heavily depends on CPU/OS/ABI/features of different targets heavily. Evolution of LLVM may unlock possibilities of accelerating the types on new targets. Implementing them in the compiler allows the compiler to perform optimizations for hardware with native support for these types.

Crates may define their type on top of a C binding, but extended float type definition in C is complex and confusing. The meaning of C types may vary by target and/or compiler options. Implementing f16 and f128 in the Rust compiler helps to maintain a stable codegen interface and ensures that all users have one single canonical definition of 16-bit and 128-bit float types, making it easier to exchange data between crates and languages.

Prior art

As noted above, there are crates that provide these types, one for f16 and one for f128. Another prior art to reference is RFC 1504 for int128.

Many other languages and compilers have support for these proposed float types. As mentioned above, C has _Float16 and _Float128 (IEC 60559 WG 14 N2601), and C++ has std::float16_t and std::float128_t (P1467R9). Glibc supports 128-bit floats in software on many architectures. GCC also provides the libquadmath library for 128-bit float math operations.

This RFC was split from RFC 3451, which proposed adding a variety of float types beyond what is in this RFC including interoperability types like c_longdouble. The remaining portions RFC 3451 has since developed into RFC 3456.

Both this RFC and RFC 3451 are built upon the discussion in issue 2629.

The main consensus of the discussion thus far is that more float types would be useful, especially the IEEE 754 types proposed in this RFC as f16 and f128. Other types can be discussed in a future RFC.

Unresolved questions

The main unresolved parts of this RFC are the implementation details in the context of the Rust compiler and standard library. The behavior of f16 and f128 is well-defined by the IEEE 754 standard, and is not up for debate. Whether these types should be included in the language is the main question of this RFC, which will be resolved when this RFC is accepted.

Several future questions are intentionally left unresolved, and should be handled by another RFC. This RFC does not have the goal of covering the entire IEEE 754 standard, since it does not include f256 and the decimal-float types. This RFC also does not have the goal of adding existing platform-specific float types such as x86's 80-bit double-extended-precision.

Future possibilities

See RFC 3456 for discussion about adding more float types including f80, bf16, and c_longdouble, which is an extension of the discussion in RFC 3451.

Footnotes

  1. Existing AI neural networks often use the 16-bit brain float format instead of 16-bit half precision, which is a truncated version of 32-bit single precision. This is done to allow performing operations with 32-bit floats and quickly convert to 16-bit for storage.