Skip to content

[SYCL] Add bfloat16 utils based on libdevice bfloat16 support. #7503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

jinge90
Copy link
Contributor

@jinge90 jinge90 commented Nov 23, 2022

Signed-off-by: jinge90 ge.jin@intel.com

@jinge90 jinge90 requested a review from a team as a code owner November 23, 2022 08:02
@jinge90 jinge90 marked this pull request as draft November 23, 2022 08:02
…ort.

Signed-off-by: jinge90 <ge.jin@intel.com>
Copy link
Contributor

@aelovikov-intel aelovikov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know somebody who is familiar with math/half/bfloat/etc to review this instead of me? I'm afraid I lack knowledge in the area.

Comment on lines 33 to 37
// Need to ensure that sycl bfloat16 defined in bfloat16.hpp is compatible
// with uint16_t in layout.
#if __cplusplus >= 201703L
static_assert(sizeof(sycl_bfloat16) == sizeof(_iml_bfloat16_internal),
"sycl bfloat16 is not compatible with _iml_bfloat16_internal.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a distinct alias at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @aelovikov-intel
What distinct alias do you refer to?
Thanks very much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use the same type in all places? Why do we need both sycl_bfloat16 and _iml_bfloat16_internal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @aelovikov-intel
The functions defined in sycl::ext::intel::math:: namespace such as hge, hgt... are c++ wrappers for c functions provided in libdevice. All these c++ functions will call corresponding c functions and sycl bfloat16 users will only work with these c++ functions. sycl bfloat16 is a c++ class defined in https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/oneapi/bfloat16.hpp and its current implementation is based on "uint16_t" type. So, our c++ functions such as hge, hgt... can only accept user-visible sycl bfloat16. However, the c functions implemented in libdevice can't use c++ sycl bfloat16 type, they can only accept native C types , so we need to use native C uint16_t type for them. In the future, we may use native bfloat16 type instead emulation based on uint16_t.
Thanks very much.

@jinge90
Copy link
Contributor Author

jinge90 commented Nov 24, 2022

Hi, @aelovikov-intel
We are working on providing bfloat16 utils which are similar to CUDA math bfloat16 APIs, current implementation will base on fp32 emulation and SYCL libdevice, we use "uint16_t" to represent bfloat16 and implement convert functions between bfloat16 and fp32, these convert functions are implemented in SYCL libdevice. For other utils such comparison, arithmetic... , we will convert bfloat16 to fp32 and finish the work in fp32, then convert the fp32 result to bfloat16 and return.
The PR is still in progress, once I finish it and related tests, I will add numeric team to reviewers.
Thanks very much for your quick response.

@JackAKirk
Copy link
Contributor

Some of the comparison functions appear to be duplicates of those already defined in the bfloat16 class: https://github.com/intel/llvm/blob/37b05f03cb91b1ee04efe41bd5895a9476a2e793/sycl/include/sycl/ext/oneapi/bfloat16.hpp

Also some of the math functions: max/min for example, are duplicates of those already added in this oneapi extension: https://github.com/intel/llvm/blob/37b05f03cb91b1ee04efe41bd5895a9476a2e793/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc I think that the idea was to add to these existing implementations, with Intel builtins or generic implementations if they don't exist, and then add more math functions progressively, mainly through generic implementations that do the operations using float as here.

Generally the code looks target agnostic: can this not be added to the oneapi extension directly, rather than the Intel namespace?

@jinge90
Copy link
Contributor Author

jinge90 commented Dec 6, 2022

Some of the comparison functions appear to be duplicates of those already defined in the bfloat16 class: https://github.com/intel/llvm/blob/37b05f03cb91b1ee04efe41bd5895a9476a2e793/sycl/include/sycl/ext/oneapi/bfloat16.hpp

Also some of the math functions: max/min for example, are duplicates of those already added in this oneapi extension: https://github.com/intel/llvm/blob/37b05f03cb91b1ee04efe41bd5895a9476a2e793/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc I think that the idea was to add to these existing implementations, with Intel builtins or generic implementations if they don't exist, and then add more math functions progressively, mainly through generic implementations that do the operations using float as here.

Generally the code looks target agnostic: can this not be added to the oneapi extension directly, rather than the Intel namespace?

Hi, @JackAKirk
The intention of this PR is not to add basic bfloat16 functionalities. This PR is a part of effort to provide bfloat16 APIs which are similar CUDA bfloat16 math APIs, this will make work easier for users who need to port CUDA code to SYCL.
You are correct that some of these functions are duplicates of those already added in sycl bfloat16 but current sycl bfloat16 comparison functions can't provide all functionalities what we need ,for example, we need provide both ordered and unordered comparison functions which CUDA provides and we have also talked to users that they prefer the functions names to be similar to CUDA math APIs.
This is why I put them in intel ext instead of oneapi ext.
Thanks very much.

@jinge90 jinge90 marked this pull request as ready for review December 6, 2022 13:44
@jinge90
Copy link
Contributor Author

jinge90 commented Dec 6, 2022

Hi, @aelovikov-intel
Could you help take a look?
Thanks very much.

@JackAKirk
Copy link
Contributor

Some of the comparison functions appear to be duplicates of those already defined in the bfloat16 class: https://github.com/intel/llvm/blob/37b05f03cb91b1ee04efe41bd5895a9476a2e793/sycl/include/sycl/ext/oneapi/bfloat16.hpp
Also some of the math functions: max/min for example, are duplicates of those already added in this oneapi extension: https://github.com/intel/llvm/blob/37b05f03cb91b1ee04efe41bd5895a9476a2e793/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc I think that the idea was to add to these existing implementations, with Intel builtins or generic implementations if they don't exist, and then add more math functions progressively, mainly through generic implementations that do the operations using float as here.
Generally the code looks target agnostic: can this not be added to the oneapi extension directly, rather than the Intel namespace?

Hi, @JackAKirk The intention of this PR is not to add basic bfloat16 functionalities. This PR is a part of effort to provide bfloat16 APIs which are similar CUDA bfloat16 math APIs, this will make work easier for users who need to port CUDA code to SYCL. You are correct that some of these functions are duplicates of those already added in sycl bfloat16 but current sycl bfloat16 comparison functions can't provide all functionalities what we need ,for example, we need provide both ordered and unordered comparison functions which CUDA provides and we have also talked to users that they prefer the functions names to be similar to CUDA math APIs. This is why I put them in intel ext instead of oneapi ext. Thanks very much.

Thanks for the explanation. For the hmin hmax functions that are duplicates of the min max functions from https://github.com/intel/llvm/blob/37b05f03cb91b1ee04efe41bd5895a9476a2e793/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc could we consider just renaming min -> hmin etc if users prefer this CUDA established naming convention, and just having a single min function in the oneapi namespace? Or do we want to have these h* naming conventions functions in a separate namespace and keep the duplicates? @gmlueck what do you think? I suppose an issue is that the h* naming convention clashes with the naming convention of e.g. min used for floating point types in the sycl 2020 spec.

Note that we planned on adding many more bfloat16 math functions to https://github.com/intel/llvm/blob/37b05f03cb91b1ee04efe41bd5895a9476a2e793/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc eventually.

@jinge90
Copy link
Contributor Author

jinge90 commented Dec 6, 2022

Hi, @JackAKirk
For hmin/hmax, I prefer to keeping the name for 2 reasons:

  1. the feedback we got from users which need to port CUDA code to SYCL
  2. CUDA bfloat16 math has hmax, hmax_nan which appears in pair and the difference between their semantics is the way handling NAN input: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____BFLOAT16__COMPARISON.html#group__CUDA__MATH____BFLOAT16__COMPARISON_1ga02516645529fe52f80c3341d8376150
  3. The semantics of functions implemented in this PR aligns with CUDA math instead of sycl oneapi spec
    To point 3, even the hmax implemented in this PR still has a minor difference from sycl bfloat16 max, sycl oneapi spec description is following:
    "Returns y if x < y, otherwise it returns x. If one argument is a NaN, fmax() returns the other argument. If both arguments are NaNs, fmax() returns a NaN."
    And the CUDA math descriptions is:

"Calculates nv_bfloat16 max(a, b) defined as (a > b) ? a : b.
If either of inputs is NaN, the other input is returned.
If both inputs are NaNs, then canonical NaN is returned.
If values of both inputs are 0.0, then +0.0 > -0.0"

Thanks very much.

@JackAKirk
Copy link
Contributor

JackAKirk commented Dec 6, 2022

3. The semantics of functions implemented in this PR aligns with CUDA math instead of sycl oneapi spec
   To point 3, even the hmax implemented in this PR still has a minor difference from sycl bfloat16 max, sycl oneapi spec description is following:

If values of both inputs are 0.0, then +0.0 > -0.0"

That was my mistake: I wrote the SYCL definition and I should have added this missing information that is the sole discrepancy between the definitions. However the key point is that these bfloat16 functions , hmin in CUDA, and min in the bfloat16 math function oneapi extension are intentionally identical: The only existing implementation of min(bfloat16, bfloat16) intentionally uses the exact same ptx instruction as the cuda function: min.bf16. So this definition discrepancy (which is a documentation mistake only) is not a reason for discounting the possibility for removing the duplication of oneapi::min(bfloat16, bfloat16) and intel::hmin(bfloat16, bfloat16).

Note also that we have a similar situation in this extension: #7397, where we adopted the cuda naming of __ldg. This raises another question: here you have transformed __hmin (CUDA naming) to hmin. For consistency this would mean we should also transform __ldg to ldg.

Whatever the solution we should probably decide on a standard set of naming conventions to avoid duplications and confusion.

We also planned to introduce e.g. hmax_nan(bfloat16, bfloat16) in the oneapi namespace (or __hmax_nan max_nan) for the same reasons you give.

@JackAKirk
Copy link
Contributor

JackAKirk commented Dec 6, 2022

btw I'm curious whether there is a reason to prioritize all the bfloat16 comparison functions ahead of the math functions: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____BFLOAT16__FUNCTIONS.html#group__CUDA__MATH____BFLOAT16__FUNCTIONS
?

@jinge90
Copy link
Contributor Author

jinge90 commented Dec 6, 2022

btw I'm curious whether there is a reason to prioritize all the bfloat16 comparison functions ahead of the math functions: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____BFLOAT16__FUNCTIONS.html#group__CUDA__MATH____BFLOAT16__FUNCTIONS ?

Hi, @JackAKirk
We received urgent request from internal users to provide corresponding functions for those CUDA bfloat16 APIs, so we did the work.
I totally agree with your point that we should avoid duplicates, when sycl bfloat16 functions are ready, let's review which functions in intel::math:: namespace can be removed, is it OK?
Thanks very much.

@jinge90
Copy link
Contributor Author

jinge90 commented Dec 6, 2022

Hi, @JackAKirk
Another question not related to this PR, do we have plan to add "bfloat162" type in bfloat16 spec? For half, sycl has half and half2, shall we provide "bfloat162" similarly? CUDA math supports it.

Thanks very much.

@JackAKirk
Copy link
Contributor

JackAKirk commented Dec 6, 2022

btw I'm curious whether there is a reason to prioritize all the bfloat16 comparison functions ahead of the math functions: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____BFLOAT16__FUNCTIONS.html#group__CUDA__MATH____BFLOAT16__FUNCTIONS ?

Hi, @JackAKirk We received urgent request from internal users to provide corresponding functions for those CUDA bfloat16 APIs, so we did the work. I totally agree with your point that we should avoid duplicates, when sycl bfloat16 functions are ready, let's review which functions in intel::math:: namespace can be removed, is it OK? Thanks very much.

All the bfloat16 math functions mentioned in the extension doc are already implemented in the ext_oneapi_cuda backend: https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/oneapi/experimental/bfloat16_math.hpp . They include min and max. If we can resolve the naming issues it would be good to add your general implementations for other backends to the existing bfloat16_math.hpp header.

@JackAKirk
Copy link
Contributor

JackAKirk commented Dec 6, 2022

Hi, @JackAKirk Another question not related to this PR, do we have plan to add "bfloat162" type in bfloat16 spec? For half, sycl has half and half2, shall we provide "bfloat162" similarly? CUDA math supports it.

Thanks very much.

It was decided that __nv_bfloat162 should map to sycl::marray<bfloat16, 2>: note that the math functions are also supported by the general sycl::marray<bfloat16, N> case. It is unclear for whether we will also add support for sycl::vec<bfloat16, N>: this is still an open question afaik: any input is welcomed: the thinking was we didn't want to add duplicates if the vec impl wasn't required, since the marray implementation should be equivalent but more general.

Note also that Nvidia libraries use an analogue of sycl::marray<bfloat16, N>, an analogue doesn't exist in the CUDA runtime api.

@jinge90
Copy link
Contributor Author

jinge90 commented Dec 6, 2022

sycl::marray<bfloat16, 2>:

btw I'm curious whether there is a reason to prioritize all the bfloat16 comparison functions ahead of the math functions: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____BFLOAT16__FUNCTIONS.html#group__CUDA__MATH____BFLOAT16__FUNCTIONS ?

Hi, @JackAKirk We received urgent request from internal users to provide corresponding functions for those CUDA bfloat16 APIs, so we did the work. I totally agree with your point that we should avoid duplicates, when sycl bfloat16 functions are ready, let's review which functions in intel::math:: namespace can be removed, is it OK? Thanks very much.

All the bfloat16 math functions mentioned in the extension doc are already implemented in the ext_oneapi_cuda backend: https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/oneapi/experimental/bfloat16_math.hpp . They include min and max. If we can resolve the naming issues it would be good to add your general implementations for other backends to the existing bfloat16_math.hpp header.

Hi, @JackAKirk
Thanks for sharing the latest status for sycl:: bfloat16, I will continue with this PR to add more bfloat16 functions and let's review them together to see what sycl bfloat16 can use, if any function can be used, moving it to sycl bfloat16 is good choice.
And besides this PR, we will cover other bfloat16 functions which CUDA provides such as type conversion utils, I will sync with you to see if anything can be useful to sycl bfloat16.
Thanks very much.

@JackAKirk
Copy link
Contributor

sycl::marray<bfloat16, 2>:

btw I'm curious whether there is a reason to prioritize all the bfloat16 comparison functions ahead of the math functions: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____BFLOAT16__FUNCTIONS.html#group__CUDA__MATH____BFLOAT16__FUNCTIONS ?

Hi, @JackAKirk We received urgent request from internal users to provide corresponding functions for those CUDA bfloat16 APIs, so we did the work. I totally agree with your point that we should avoid duplicates, when sycl bfloat16 functions are ready, let's review which functions in intel::math:: namespace can be removed, is it OK? Thanks very much.

All the bfloat16 math functions mentioned in the extension doc are already implemented in the ext_oneapi_cuda backend: https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/oneapi/experimental/bfloat16_math.hpp . They include min and max. If we can resolve the naming issues it would be good to add your general implementations for other backends to the existing bfloat16_math.hpp header.

Hi, @JackAKirk Thanks for sharing the latest status for sycl:: bfloat16, I will continue with this PR to add more bfloat16 functions and let's review them together to see what sycl bfloat16 can use, if any function can be used, moving it to sycl bfloat16 is good choice. And besides this PR, we will cover other bfloat16 functions which CUDA provides such as type conversion utils, I will sync with you to see if anything can be useful to sycl bfloat16. Thanks very much.

So I think it would make sense for the sycl::ext::oneapi bfloat16 math extension (I think convert functions can also be covered in this extension: perhaps it should be renamed to bfloat16 builtins extension) to support all the functions that cuda supports.

Copy link
Contributor

@aelovikov-intel aelovikov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @aelovikov-intel
Could you help take a look?

I agree with @JackAKirk 's concerns regarding duplicated functionality and would like to hear from @gmlueck as well.

Also, why do we have an extension implementation without the extension specification?


// Need to ensure that sycl bfloat16 defined in bfloat16.hpp is compatible
// with uint16_t in layout.
#if __cplusplus >= 201703L
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check this - C++17 is the minimal supported version.

@gmlueck
Copy link
Contributor

gmlueck commented Dec 6, 2022

There should definitely be an extension specification whenever we added extended APIs to SYCL. However, rather than creating a new extension, can we augment the existing sycl_ext_oneapi_bfloat16_math_functions?

One thing that's not clear to me is whether the math functions in this PR are available on all devices, or if they are only available on Nvidia devices. I think the ones in sycl_ext_oneapi_bfloat16_math_functions are only available on Nvidia?

We should also pay attention to the namespace. The functions in sycl_ext_oneapi_bfloat16_math_functions live in sycl::ext::oneapi::experimental while the ones in this PR live in sycl::ext::intel::math. Is there a good reason to have different namespaces or can we use the same namespace?

@jinge90 mentions above that we might decide to remove some of these math functions later. If this is our intention, the functions in this PR must be an "experimental" extension, and they must reside in the experimental namespace. Once we add a "supported" function, we are committed to supporting it into the foreseeable future unless we go through a lengthy deprecation process.

@JackAKirk
Copy link
Contributor

JackAKirk commented Dec 7, 2022

One thing that's not clear to me is whether the math functions in this PR are available on all devices, or if they are only available on Nvidia devices. I think the ones in sycl_ext_oneapi_bfloat16_math_functions are only available on Nvidia?

Currently the ones in sycl_ext_oneapi_bfloat16_math_functions are only implemented in Nvidia here: https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/oneapi/experimental/bfloat16_math.hpp but as discussed below this was just because we used nvidia specific builtins for thoses cases.
For the bfloat16 min function we can straightforwardly combine the generic implementation added here with the Nvidia specific builtin (see https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/oneapi/experimental/bfloat16_math.hpp) as:

template <typename T>
std::enable_if_t<std::is_same<T, bfloat16>::value, T> fmin(T x, T y) {
#if defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__)
  oneapi::detail::Bfloat16StorageT XBits = oneapi::detail::bfloat16ToBits(x);
  oneapi::detail::Bfloat16StorageT YBits = oneapi::detail::bfloat16ToBits(y);
  return oneapi::detail::bitsToBfloat16(__clc_fmin(XBits, YBits));
#else
  uint16_t canonical_nan = 0x7FC0;
  uint16_t b1a = __builtin_bit_cast(uint16_t, b1);
  uint16_t b2a = __builtin_bit_cast(uint16_t, b2);
  if (hisnan(b1) && hisnan(b2))
    return __builtin_bit_cast(sycl::ext::oneapi::bfloat16, canonical_nan);
  if (hisnan(b1))
    return b2;
  else if (hisnan(b2))
    return b1;
  else if (((b1a | b2a) == 0x8000) && ((b1a & b2a) == 0x0))
    return __builtin_bit_cast(sycl::ext::oneapi::bfloat16,
                              static_cast<uint16_t>(0x8000));
  else {
    return (hlt(b1, b2) ? b1 : b2);
  }
#endif // defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__)
}

The only contentious point is that we need to decide the naming convention for these functions as discussed in earlier messages.

The intention was always that the bfloat16 functions that we have added so far would be implemented also on other backends (hence the oneapi namespace), via backend specific builtins, or a generic implementation. For the ext_oneapi_cuda we started with the bfloat16 functions that had nvptx builtins, and hence we had no reason to do generic software impls initially that other backends could also use, at least initially before they add their own builtins etc. We have already implemented all such cases where special Nvidia nvptx builtins exist. This means that in the ext_oneapi_cuda backend all new bfloat16 functions will just use a generic software implementation.

}

sycl::ext::oneapi::bfloat16 float2bfloat16(float f) {
return __builtin_bit_cast(sycl::ext::oneapi::bfloat16,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be recommended to use oneapi::detail::bitsToBfloat16 here and in other places instead of __builtin_bit_cast ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @JackAKirk
I didn't realize sycl bfloat16 has already had bitsToBfloat16, will update the code if we move the generic implementation to bfloat16_math.hpp
Thanks very much.

@gmlueck
Copy link
Contributor

gmlueck commented Dec 7, 2022

For the bfloat16 min function we can straightforwardly combine the generic implementation added here with the Nvidia specific builtin

This makes sense to me. Is our plan, then, to change this PR to do this?

The only contentious point is that we need to decide the naming convention for these functions as discussed in earlier messages.

Some input from my side:

  • I think we should avoid adding functions with leading double underbar. (I'd be happy to change __ldg to ldg for consistency.)

  • SYCL seems to use the "f" prefix for all the floating point math functions, even the overloads that operate on the half type. I think we should maintain this for consistency.

  • I think we do not have a precedent for expressing rounding mode yet. Using a suffix on the function name is OK with me.

@jinge90
Copy link
Contributor Author

jinge90 commented Dec 8, 2022

Hi, @gmlueck and @JackAKirk
The background of this PR is we are required to provide equivalent functions as CUDA math APIs and bfloat16 functions are one important piece of the whole work: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__BFLOAT16.html#group__CUDA__MATH__INTRINSIC__BFLOAT16
Our implementation aims to be tested and work on Intel CPU, GPU, ACC platform and also available on host. These functions can be split into 2 categories:

  1. "inline" functions whose implementation is fully located in the header file
  2. "wrapper" functions whose "real" implementation is provided in SYCL libdevice and the function definition in header file just calls those "real" implementation

For 1, the examples are hmax, hmin.... and for 2, the examples are type convert functions with different rounding mode. The reason why we implement some functions in libdevice is those functions's real implementation is more complicated and not suitable in a hpp file.

Our bfloat16 utils aim to be tested and work in all platform which SYCL compiler supports(CPU, Intel GPU, ACC) except NV backend and we also had corresponding lit tests: intel/llvm-test-suite#1364
As we discussed above, the functions we provide will have overlap with what SYCL bfloat16 will provide in the future and I agree that we should avoid code duplicate, so I suggest to:

  1. we will not be in a hurry to merge current PR but we have had more bfloat16 utils implemented locally, we will upload all of them to current PR.
  2. we can review all utils implemented and check if any function will be included in SYCL bfloat16
  3. If some of the functions can be included in SYCL bfloat16, we can move the implementation to bfloat16_math.hpp as generic implementation which targets for host and devices other than NV and remove them in this PR.

We can try to move max, min to blfoat16_math.hpp as first step, does the approach make sense to you?
Thanks very much.

@jinge90 jinge90 changed the title [SYCL] Add bfloat16 comparison utils based on libdevice bfloat16 support. [SYCL] Add bfloat16 utils based on libdevice bfloat16 support. Dec 8, 2022
@gmlueck
Copy link
Contributor

gmlueck commented Dec 8, 2022

Our bfloat16 utils aim to be tested and work in all platform which SYCL compiler supports(CPU, Intel GPU, ACC) except NV backend

Why wouldn't we want these bfloat16 utils to work on NV backend?

we will not be in a hurry to merge current PR ...

If you are not in a hurry, I don't mind waiting until we have a better idea of the other math functions we want to add.

If some of the functions can be included in SYCL bfloat16, we can move the implementation to bfloat16_math.hpp as generic implementation which targets for host and devices other than NV and remove them in this PR.

Why wouldn't we move all of the bfloat16 math functions to the sycl_ext_oneapi_bfloat16_math_functions extension? Isn't it better to have them all in one place?

@jinge90
Copy link
Contributor Author

jinge90 commented Dec 9, 2022

Our bfloat16 utils aim to be tested and work in all platform which SYCL compiler supports(CPU, Intel GPU, ACC) except NV backend

Why wouldn't we want these bfloat16 utils to work on NV backend?

we will not be in a hurry to merge current PR ...

If you are not in a hurry, I don't mind waiting until we have a better idea of the other math functions we want to add.

If some of the functions can be included in SYCL bfloat16, we can move the implementation to bfloat16_math.hpp as generic implementation which targets for host and devices other than NV and remove them in this PR.

Why wouldn't we move all of the bfloat16 math functions to the sycl_ext_oneapi_bfloat16_math_functions extension? Isn't it better to have them all in one place?

Hi, @gmlueck

"Why wouldn't we want these bfloat16 utils to work on NV backend?"
Technically, those generic bfloat16 util implementation can work on NV too but we have better way to implement them for NV backend such as "clc" builtins which may have better perf.

"Why wouldn't we move all of the bfloat16 math functions to the sycl_ext_oneapi_bfloat16_math_functions extension? Isn't it better to have them all in one place?"
It is better to have them all in one place but I am not sure whether all of these bfloat16 utils included in CUDA math APIs will be included in sycl_ext_oneapi_bfloat16_math_functions extension as well. For any bfloat16 function covered in CUDA math API which is also accepted by sycl bfloat16 ext, we will move it to sycl bfloat16 math.

Thanks very much.

@JackAKirk
Copy link
Contributor

JackAKirk commented Dec 9, 2022

"Why wouldn't we want these bfloat16 utils to work on NV backend?" Technically, those generic bfloat16 util implementation can work on NV too but we have better way to implement them for NV backend such as "clc" builtins which may have better perf.

You don't have to worry about this. It is true that in many cases we will eventually want impls that

  1. Convert bfloat16 inputs to float
  2. Calculate operations using float nvptx builtins
  3. Convert back to bfloat16 for the result

However, this is not true for all cases (using a target specific builtin), and there is no reason to not let the Nvidia backend (or any other backend IMO) use generic implementations when they are added. We would prefer some not perfectly optimal implementation rather than no implementation. There was a similar situation here: #6038 We implemented marray math functions using the simplest solution from the set of optimal algorithms for the Nvidia backend, but we also switched on the functions for other backends: In the future these other backends could switch to a more optimal implementation for that backend if it exists.
Of course if your implementations for Intel hardware use Intel target specific builtins then it won't be possible to allow other backends to use the functions, but I don't see any such target specific builtins in the function impls you've added so far.

"Why wouldn't we move all of the bfloat16 math functions to the sycl_ext_oneapi_bfloat16_math_functions extension? Isn't it better to have them all in one place?" It is better to have them all in one place but I am not sure whether all of these bfloat16 utils included in CUDA math APIs will be included in sycl_ext_oneapi_bfloat16_math_functions extension as well. For any bfloat16 function covered in CUDA math API which is also accepted by sycl bfloat16 ext, we will move it to sycl bfloat16 math.

I Imagine it is safe to assume that if there is already a given sycl builtin for float type, we will want to add the corresponding bfloat16 function in the bfloat16 extension. These are all very standard widely used math functions and I think it would be strange if they didn't have bfloat16 support. This is the case for all the "math" bfloat16 variants from the CUDA library.
For some of the other categories e.g. conversion bfloat16 functions where there isn't a corresponding float or half function already in SYCL, I think the key question to consider first is how important the function is in deep learning applications, which is the principle application domain of the bfloat16 type. If a function is widely used then I think the function will have to be in the extension, since noone will use the programming model if they can't write the applications they need. If you learn about the importance/priority of a given function then please tell us! This is a young and very active field so it is very useful to share knowledge when it has relevance for our task.

I think there would only be some argument for not adopting a builtin generally across backends in the case that it is for a very niche application. I am guessing this is going to be quite rare.

@jinge90 jinge90 marked this pull request as draft December 12, 2022 01:47
Copy link

@xtian-github xtian-github left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jinge90 jinge90 closed this Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants