Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BFloat16 #98643

Open
wants to merge 62 commits into
base: main
Choose a base branch
from
Open

Add BFloat16 #98643

wants to merge 62 commits into from

Conversation

huoyaoyuan
Copy link
Member

Closes #96295.

Delegating all functional members to float, since the upcast is trivial. The only logic is rounding from double.
Tests are borrowed from Half.

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost
Copy link

ghost commented Feb 19, 2024

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #96295.

Delegating all functional members to float, since the upcast is trivial. The only logic is rounding from double.
Tests are borrowed from Half.

Author: huoyaoyuan
Assignees: -
Labels:

area-System.Numerics, new-api-needs-documentation

Milestone: -

@huoyaoyuan huoyaoyuan added the community-contribution Indicates that the PR has been added by a community member label Feb 19, 2024
Comment on lines +464 to +465
// Exponent displacement #1
const ulong Exponent942 = 0x3ae0_0000_0000_0000u;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @MineCake147E I figured out the value of this magic number for double->BFloat16 by debugging, but can't give an expression to calculate it. Do you have any information around this?

Copy link
Contributor

@MineCake147E MineCake147E Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exponent45 reflects the difference of the number of fraction bits. bfloat16 has 7 fraction bits, while double has 52. So, 52 - 7 = 45.
Exponent942 is a little bit complicated.
We want 1.0 to be converted to 1.0(bf16).

The line:

value += BitConverter.UInt64BitsToDouble(exponentOffset0);

transforms
0b0_011_1111_1111_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000 into 0b0_100_0010_1100_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_1000_0000.
If we skip subtracting Exponent942, the ulong newExponent = bitValue >> 45; will be 0b0010_0001_0110_0000_0000
By adding the value before shifting, we get 0b010000101100000000000000000000000000000000000010_0_001_0110_1_000_0000
The internal representation of 1.0(bf16) is 0b0_011_1111_1_000_0000.
0b010000101100000000000000000000000000000000000010_0_001_0110_1_000_0000 - 0b0_011_1111_1_000_0000 = 0b010000101100000000000000000000000000000000000_0011_1010_1110_0000_000
And we shift this value 45 bits left, we get 0b0011_1010_1110_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000, which is 0x3AE0_0000_0000_0000.
In other words,
${ExponentDisplacement_1} = E_{source}(1.0) - E_{result}(1.0) + (FractionBits_{source} - FractionBits_{result}) + 1$
For single -> half, $127 - 15 + (23 - 10) + 1 = 113 + 13 = 126$.
For double -> bfloat16, $1023 - 127 + (52 - 7) + 1 = 897 + 45 = 942$.
We can test this formula by applying this to double -> single, $1023 - 127 + (52 - 23) + 1 = 897 + 29 = 926$.

const ulong SingleBiasedExponentMask = double.BiasedExponentMask;
const ulong Exponent926 = 0x39e0_0000_0000_0000u;
const ulong Exponent29 = 0x01D0_0000_0000_0000u;
var q = BitConverter.DoubleToUInt64Bits(Math.PI + BitConverter.UInt64BitsToDouble(Exponent29 +
    (BitConverter.DoubleToUInt64Bits(Math.PI) & SingleBiasedExponentMask))) - Exponent926;
BitConverter.UInt32BitsToSingle((uint)(q + (q >> 29)))

And we get 3.1415927 which matches (float)Math.PI.

I hope it helps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should be a comment in the code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really verbose and I haven't found best way to comment about it. Presumably I could simplify it into a simple formula like IBinaryFloatParseAndFormatInfo numbers.

@huoyaoyuan
Copy link
Member Author

Now every test is passing locally. Marking this as ready for review.

@huoyaoyuan huoyaoyuan marked this pull request as ready for review June 10, 2024 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Numerics community-contribution Indicates that the PR has been added by a community member new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[API Proposal]: BFloat16
6 participants