-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for float16 (half-precision floats) and related operations such as hgemm() #234
Comments
@jacobgorm Thanks for the suggestion. This is something that is in our medium-range plans. Of course, as you probably already know, the complicating factor is that there is no standard C language support for a |
Some useful information from mpi-forum/mpi-issues#65:
|
@jeffhammond Thank you for taking the time to rustle up these links, Jeff. This will surely prove very useful. |
I recommend that BLIS not support float16 but rather bfloat16. The latest research in machine learning suggests that float16 is inferior to bfloat16 for training because of the software and processing overheads associated with handling the limited numerical range associated with a 5-bit exponent. In any case, implementing both float16 and bfloat16 on hardware that doesn't have native support is relatively easy. In both cases, you use float32 compute. For float16, you can AVX Google recommends the use of bfloat16 with TensorFlow and it is relatively straightforward to understand that it is a better use of bits to have an 8-bit exponent like float32 than the 5-bit exponent used by IEEE float16. Intel's public statement on bfloat16 is:
Disclaimer: I work for Intel. Additional references: |
@jeffhammond Once again, this was very helpful Jeff. Thank you. I had never even heard of |
Yes, |
int8 and int16 are usually employed for inference although I’m aware of
some efforts to use in training. Not sure if worth the software pain though.
|
ARMv8.2 defined instructions for FP16 (IEEE format) computations. These are natively supported in Cortex-A55 and Cortex-A75 cores, e.g. Snapdragon 845, with the same per-instruction throughput and 2x FLOPS of FP32 computations. |
hi again. Are you guys still considering adding half-precision support to BLIS? FWIW there does seem to be a bit of a hole in the market for portable LA library that supports this. I know of FBGEMM from Facebook but it is x86-only and uses a scary JIT, and last I tested the ARM Compute Library's GEMM it was really slow compared to BLIS. CLBlast is nice, but only works with OpenCL. |
https://arxiv.org/pdf/1904.06376.pdf ("Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations") is relevant reading for anyone following this thread. |
@jacobgorm I have spoken to @dnparikh and @fgvanzee about this on a number of occasions and I am confident that this is a priority for them. |
@fgvanzee I'd like to recant my prior comment in #234 (comment). For quantum chemistry, float16 might end up being more interesting. We are still studying this but it is ideal to have both for our experiments. |
Intel published the BF16 ISA in the April 2019 update (319433-036) of the Intel® Architecture Instruction Set Extensions and Future Features Programming Reference. There is an unofficial synopsis for those who don't want to search the 149-page PDF on Anandtech. |
I'm trying to imagine what could have changed (what observations you could have made) that would flip the polarity on this issue. (You need those extra three bits of mantissa after all?) |
Jacob,
Investigating bfloat16 is on our priority list. We are waiting for word on funding from a sponsor, which may bump it higher on the priority list.
Robert
|
I'm trying to imagine what could have changed (what observations you could
have made) that would flip the polarity on this issue. (You need those
extra three bits of mantissa after all?)
We don’t need the exponent bits so why not use for mantissa?
|
Touche. Anyhow, I'm less concerned with what people want than I am with whether there is basic support for the datatype in either the compiler or the ISA (or both). |
Clang now has experimental _Float16 support, but only on ARM : https://clang.llvm.org/docs/LanguageExtensions.html . |
Sounds like ARM should sponsor this effort, so we can bump it up on our priority list! :-).
Thank you for sharing.
|
@jacobgorm https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point also says
and
I would argue that BLIS should use |
@jeffhammond the advantage as the library developer to having _Float16 in the compiler is that it does not promote to float, which should make initial development easier. I agree that the external interface could just as well be __fp16. |
@jacobgorm Yes, of course, but since I work for Intel, I have an interest in implementing something that is not restricted to ARM architectures 😃 In any case, since BLIS is going to do all the important math explicitly in the microkernel, compiler promotion shouldn't be a major issue. |
Let's all remember that BLIS allows the user to do more than level-3 operations! My goal is for full operation support for float16 (or bfloat16), even if the implementation is sub-optimal. So the issues around float16 and the compiler are very much important to me (even if efficiency is not). |
So far as I'm aware, there isn't a standardized calling convention for _Float16 on intel, or at least if there is, my version of clang doesn't have it yet. As such we can't pass data by value, which makes things a little messy (and using __fp16 would imply we worked as __fp16 rather than as _Float16). |
I also wanted to request support for reduced precision support. I think it would be valuable to add both IEEE 754's FP16 as well as Bfloat16 as the former has major issues for training ML. P.S: There is also a new TF32 format from Nvidia: |
@amirgholami BLIS doesn't support GPUs but TF32 is just a form of 19-bit floating-point with 32b data. In the absence of hardware support, there is no upside versus SGEMM. In the presence of SGEMM, the implementation is going to be the same as SGEMM but with a different microkernel, except for the loss of accuracy in the results, of course. |
Hey @jeffhammond Yes I am aware of the fact that TF32 is supported on Ampere architecture. I mentioned it as evidence that there is still a lot of active research on low precision arithmetics. On that note I should also add MSFP8 and MSFP11 which are from Microsoft and being used in their brainwave fpga project. Aside from the above formats, which are relatively newer formats, there are a lot of different LA algorithms that have already incorporated FP16 or BFloat16 (for example as preconditioners), and it would be great if bliss would support them. P.S: Regarding hardware support, Intel CooperLake that was announced last month has support for bfloat16 arithmetics. |
amd/blis fork adds aocl_gemm addon, that adds bf16 support to gemm for BF16-capable CPUs and sequence of functions for s8-u8 gemm for VNNI-capable CPUs. Additionally it adds support of ReLU/GeLU/Downscale/CLIP post-ops. Merge of amd/blis changes is discussed in #770. |
I am using BLIS for neural networks on embedded platforms (mostly ARMv8a), and I would like to reap the potential memory savings as well as possibly some speedups from running with half-precision floats. Are there any plans to support these in BLIS?
The text was updated successfully, but these errors were encountered: