Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for float16 (half-precision floats) and related operations such as hgemm() #234

Open
jacobgorm opened this issue Jul 16, 2018 · 28 comments

Comments

@jacobgorm
Copy link

I am using BLIS for neural networks on embedded platforms (mostly ARMv8a), and I would like to reap the potential memory savings as well as possibly some speedups from running with half-precision floats. Are there any plans to support these in BLIS?

@fgvanzee
Copy link
Member

@jacobgorm Thanks for the suggestion. This is something that is in our medium-range plans. Of course, as you probably already know, the complicating factor is that there is no standard C language support for a float16 datatype, so any solution would necessarily not be portable. (In principle, we can add float16 operations, but it would take a non-trivial amount of work. Also, we would need to design things so that the user could disable the system-specific float16 support if it were not available.)

@fgvanzee fgvanzee changed the title It would be great if BLIS had HGEMM() using half-precision floats. Add support for float16 (half-precision floats) and related operations such as hgemm() Jul 16, 2018
@jeffhammond
Copy link
Member

Some useful information from mpi-forum/mpi-issues#65:

@fgvanzee
Copy link
Member

@jeffhammond Thank you for taking the time to rustle up these links, Jeff. This will surely prove very useful.

@jeffhammond
Copy link
Member

I recommend that BLIS not support float16 but rather bfloat16. The latest research in machine learning suggests that float16 is inferior to bfloat16 for training because of the software and processing overheads associated with handling the limited numerical range associated with a 5-bit exponent.

In any case, implementing both float16 and bfloat16 on hardware that doesn't have native support is relatively easy. In both cases, you use float32 compute. For float16, you can AVX vcvtps2ph to convert from float16 storage to float32 storage and then do the compute as you would float32 (the latency is 4-7 cycles in the documentation I've found online). For bfloat16, the conversion is trivial, because you just copy the float16 data into the upper half of a float32 register and proceed as before.
It might be possible to reuse the float32 microkernel.

Google recommends the use of bfloat16 with TensorFlow and it is relatively straightforward to understand that it is a better use of bits to have an 8-bit exponent like float32 than the 5-bit exponent used by IEEE float16.

Intel's public statement on bfloat16 is:

Over time, Intel will be extending bfloat16 support across our AI product lines, including Intel Xeon processors and Intel FPGAs. This is part of a cohesive and comprehensive strategy to bring leading AI training capabilities to our silicon portfolio.

Disclaimer: I work for Intel.

Additional references:

@fgvanzee
Copy link
Member

@jeffhammond Once again, this was very helpful Jeff. Thank you.

I had never even heard of bfloat16 before today. I can see why it would be preferable (especially for ML/AI applications) given the trade-off between exponent and mantissa.

@poulson
Copy link

poulson commented Jul 18, 2018

Yes, bfloat16 is all the rage for inference right now for deciding which bucket to put something in. It's also worth mentioning the 8-bit integer quantization approach taken by https://github.com/google/gemmlowp.
Disclaimer: I sit next to the author at work.

@jeffhammond
Copy link
Member

jeffhammond commented Jul 18, 2018 via email

@Maratyszcza
Copy link
Contributor

ARMv8.2 defined instructions for FP16 (IEEE format) computations. These are natively supported in Cortex-A55 and Cortex-A75 cores, e.g. Snapdragon 845, with the same per-instruction throughput and 2x FLOPS of FP32 computations.

@jacobgorm
Copy link
Author

hi again. Are you guys still considering adding half-precision support to BLIS? FWIW there does seem to be a bit of a hole in the market for portable LA library that supports this. I know of FBGEMM from Facebook but it is x86-only and uses a scary JIT, and last I tested the ARM Compute Library's GEMM it was really slow compared to BLIS. CLBlast is nice, but only works with OpenCL.

@jeffhammond
Copy link
Member

https://arxiv.org/pdf/1904.06376.pdf ("Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations") is relevant reading for anyone following this thread.

@jeffhammond
Copy link
Member

@jacobgorm I have spoken to @dnparikh and @fgvanzee about this on a number of occasions and I am confident that this is a priority for them.

@jeffhammond
Copy link
Member

@fgvanzee I'd like to recant my prior comment in #234 (comment). For quantum chemistry, float16 might end up being more interesting. We are still studying this but it is ideal to have both for our experiments.

@jeffhammond
Copy link
Member

Intel published the BF16 ISA in the April 2019 update (319433-036) of the Intel® Architecture Instruction Set Extensions and Future Features Programming Reference.

There is an unofficial synopsis for those who don't want to search the 149-page PDF on Anandtech.

@fgvanzee
Copy link
Member

fgvanzee commented May 2, 2019

@fgvanzee I'd like to recant my prior comment in #234 (comment). For quantum chemistry, float16 might end up being more interesting. We are still studying this but it is ideal to have both for our experiments.

I'm trying to imagine what could have changed (what observations you could have made) that would flip the polarity on this issue. (You need those extra three bits of mantissa after all?)

@rvdg
Copy link
Collaborator

rvdg commented May 2, 2019 via email

@jeffhammond
Copy link
Member

jeffhammond commented May 2, 2019 via email

@fgvanzee
Copy link
Member

fgvanzee commented May 2, 2019

We don’t need the exponent bits so why not use for mantissa?

Touche. Anyhow, I'm less concerned with what people want than I am with whether there is basic support for the datatype in either the compiler or the ISA (or both).

@jacobgorm
Copy link
Author

Clang now has experimental _Float16 support, but only on ARM : https://clang.llvm.org/docs/LanguageExtensions.html .

@rvdg
Copy link
Collaborator

rvdg commented May 2, 2019 via email

@jeffhammond
Copy link
Member

@jacobgorm https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point also says

__fp16 is supported on every target, as it is purely a storage format; see below.

and

__fp16 is a storage and interchange format only. This means that values of __fp16 are immediately promoted to (at least) float when used in arithmetic operations...

I would argue that BLIS should use typedef to support either format as input data.

@jacobgorm
Copy link
Author

@jeffhammond the advantage as the library developer to having _Float16 in the compiler is that it does not promote to float, which should make initial development easier. I agree that the external interface could just as well be __fp16.

@jeffhammond
Copy link
Member

@jacobgorm Yes, of course, but since I work for Intel, I have an interest in implementing something that is not restricted to ARM architectures 😃 In any case, since BLIS is going to do all the important math explicitly in the microkernel, compiler promotion shouldn't be a major issue.

@fgvanzee
Copy link
Member

fgvanzee commented May 3, 2019

In any case, since BLIS is going to do all the important math explicitly in the microkernel, compiler promotion shouldn't be a major issue.

Let's all remember that BLIS allows the user to do more than level-3 operations! My goal is for full operation support for float16 (or bfloat16), even if the implementation is sub-optimal. So the issues around float16 and the compiler are very much important to me (even if efficiency is not).

@jhogg41
Copy link

jhogg41 commented May 12, 2019

So far as I'm aware, there isn't a standardized calling convention for _Float16 on intel, or at least if there is, my version of clang doesn't have it yet. As such we can't pass data by value, which makes things a little messy (and using __fp16 would imply we worked as __fp16 rather than as _Float16).

@amirgholami
Copy link

I also wanted to request support for reduced precision support. I think it would be valuable to add both IEEE 754's FP16 as well as Bfloat16 as the former has major issues for training ML.

P.S: There is also a new TF32 format from Nvidia:
https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/

@jeffhammond
Copy link
Member

@amirgholami BLIS doesn't support GPUs but TF32 is just a form of 19-bit floating-point with 32b data. In the absence of hardware support, there is no upside versus SGEMM. In the presence of SGEMM, the implementation is going to be the same as SGEMM but with a different microkernel, except for the loss of accuracy in the results, of course.

@amirgholami
Copy link

amirgholami commented Aug 19, 2020

Hey @jeffhammond

Yes I am aware of the fact that TF32 is supported on Ampere architecture. I mentioned it as evidence that there is still a lot of active research on low precision arithmetics. On that note I should also add MSFP8 and MSFP11 which are from Microsoft and being used in their brainwave fpga project.

Aside from the above formats, which are relatively newer formats, there are a lot of different LA algorithms that have already incorporated FP16 or BFloat16 (for example as preconditioners), and it would be great if bliss would support them.

P.S: Regarding hardware support, Intel CooperLake that was announced last month has support for bfloat16 arithmetics.

@AngryLoki
Copy link
Contributor

amd/blis fork adds aocl_gemm addon, that adds bf16 support to gemm for BF16-capable CPUs and sequence of functions for s8-u8 gemm for VNNI-capable CPUs. Additionally it adds support of ReLU/GeLU/Downscale/CLIP post-ops.

Merge of amd/blis changes is discussed in #770.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants