-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BFloat16 #98643
base: main
Are you sure you want to change the base?
Add BFloat16 #98643
Conversation
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
Tagging subscribers to this area: @dotnet/area-system-numerics Issue DetailsCloses #96295. Delegating all functional members to float, since the upcast is trivial. The only logic is rounding from double.
|
@huoyaoyuan, I believe this should be unblocked now |
// Exponent displacement #1 | ||
const ulong Exponent942 = 0x3ae0_0000_0000_0000u; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @MineCake147E I figured out the value of this magic number for double->BFloat16 by debugging, but can't give an expression to calculate it. Do you have any information around this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exponent45
reflects the difference of the number of fraction bits. bfloat16
has 7 fraction bits, while double
has 52. So, 52 - 7 = 45
.
Exponent942
is a little bit complicated.
We want 1.0
to be converted to 1.0(bf16)
.
The line:
value += BitConverter.UInt64BitsToDouble(exponentOffset0); |
transforms
0b0_011_1111_1111_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000
into 0b0_100_0010_1100_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_1000_0000
.
If we skip subtracting Exponent942
, the ulong newExponent = bitValue >> 45;
will be 0b0010_0001_0110_0000_0000
By adding the value before shifting, we get 0b010000101100000000000000000000000000000000000010_0_001_0110_1_000_0000
The internal representation of 1.0(bf16)
is 0b0_011_1111_1_000_0000
.
0b010000101100000000000000000000000000000000000010_0_001_0110_1_000_0000 - 0b0_011_1111_1_000_0000 = 0b010000101100000000000000000000000000000000000_0011_1010_1110_0000_000
And we shift this value 45 bits left, we get 0b0011_1010_1110_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000
, which is 0x3AE0_0000_0000_0000
.
In other words,
For single
-> half
,
For double
-> bfloat16
,
We can test this formula by applying this to double
-> single
,
const ulong SingleBiasedExponentMask = double.BiasedExponentMask;
const ulong Exponent926 = 0x39e0_0000_0000_0000u;
const ulong Exponent29 = 0x01D0_0000_0000_0000u;
var q = BitConverter.DoubleToUInt64Bits(Math.PI + BitConverter.UInt64BitsToDouble(Exponent29 +
(BitConverter.DoubleToUInt64Bits(Math.PI) & SingleBiasedExponentMask))) - Exponent926;
BitConverter.UInt32BitsToSingle((uint)(q + (q >> 29)))
And we get 3.1415927
which matches (float)Math.PI
.
I hope it helps.
Now every test is passing locally. Marking this as ready for review. |
Closes #96295.
Delegating all functional members to float, since the upcast is trivial. The only logic is rounding from double.
Tests are borrowed from Half.