-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] Add fma_relu extension #5749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added bfloat16 in oneapi experimental namespace. Signed-off-by: jack.kirk <[email protected]>
…_BF16_CONVERSION.asciidoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libclc changes look good to me.
@hdelan, please resolve merge conflicts and pre-commit fails. |
Should be fixed now |
@intel/dpcpp-specification-reviewers, ping. |
template <typename T> | ||
T fma_relu(T a, T b, T c); | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make more sense for this function to take the bfloat16
or bfloat16x2
types themselves rather than uint16_t
and uint32_t
?
Also a nit about the organization of this spec ... the "Specification" section below is the formal specification of your extension. The description of the fma_relu
function should be there, not in the "Overview" section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make more sense for this function to take the bfloat16 or bfloat16x2 types themselves rather than uint16_t and uint32_t?
I am following the convention used by all of these bfloat16 PRs: #5748 #5724, which use uint16_t
and uint32_t
as storage types. Perhaps this mention of storage types doesn't belong in this document. Should I remove it?
The description of the fma_relu function should be there, not in the "Overview" section.
Thanks, have swapped that into specification section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked with the @dkhaldi about the Matrix API, and she says they will add APIs that take the bfloat16
type soon, but they will keep the uint16_t
versions also for a transition period. Does it make sense to add bfloat16
versions of fma_relu
to this PR, or will you do that in a subsequent one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked with the @dkhaldi about the Matrix API, and she says they will add APIs that take the
bfloat16
type soon, but they will keep theuint16_t
versions also for a transition period. Does it make sense to addbfloat16
versions offma_relu
to this PR, or will you do that in a subsequent one?
Good point, cc @hdelan, we should be able to add bfloat16 implementations of the fma_relu functions in this PR provided that #5393 is merged. We do want the bfloat16x2 cases too but this will require the definition of a bfloat16x2 class / extension doc update first, analogous to bfloat16 in #5393, so the corresponding bfloat16x2 impls will probably be done in a separate PR to this. For the joint_matrix API and other bfloat16 math builtins: fabs, fma, fmin, fmax, the uint16_t implementations are already merged and we are already working on follow up PRs for the corresponding bfloat16 implementations.
Removed aspect reference: can be added once the ext_oneapi_bfloat16 aspect is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libclc changes look good to me.
sycl/doc/extensions/experimental/sycl_ext_oneapi_fma_relu.asciidoc
Outdated
Show resolved
Hide resolved
== Overview | ||
|
||
This extension introduces the `fma_relu` function for datatypes `sycl::half`, | ||
`bfloat16` and `bfloat16x2`. `bfloat16` and `bfloat16x2` refer to the bfloat16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this came up in another review also, but I forget which one. There is no bfloat16x2
type defined currently in sycl_ext_*_bf16_conversion
(soon to be renamed sycl_ext_oneapi_bfloat16
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes have been made to the doc. fma_relu
now accepts the bfloat16 class instead of uint16_t
. The bfloat16x2
version still takes uint32_t
as a storage type, but the doc explains that this will change once the bfloat16x2
class has been implemented as an extension.
sycl/doc/extensions/experimental/sycl_ext_oneapi_fma_relu.asciidoc
Outdated
Show resolved
Hide resolved
…idoc Co-authored-by: JackAKirk <[email protected]>
Co-authored-by: JackAKirk <[email protected]>
Co-authored-by: JackAKirk <[email protected]>
|
||
// Available when T is sycl::half, uint16_t (bfloat16) or uint32_t (bfloat16x2) | ||
template <typename T> | ||
T fma_relu(T a, T b, T c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of extending math functions, you are already adding support for fma, fmax, etc to bfloat16/half variants.
What would be the benefit of adding a specific fma_relu over doing fma followed with fmax(res,0) and return 0 if the max is zero?
This extension of fma_relu is introducing two big "new" territories to DPC++:
1- Introducing ML activation functions to DPC++: the issue is that this type of functions are numerous: the ones we know of and the ones we don't know about them yet. Is the expectation to keep adding these as free functions in DPC++? relu is an easy one that can be written using max. What about the others? Why relu is so special here?
2- Introducing fusions to DPC++: fma_relu is telling the compiler these two functions can be fused together. While this can be important in libraries, is this really necessary for DPC++? DPC++ has a compiler that can detect that this type of relu or other functions is following an fma and can trigger the fusion the user intended.
One other open question and issue is: if we end up deciding to have this type of ML very specific functions in DPC++, what should be the objects that use them? scalar, vector ? marray? why the only vector type here is bfloat16x2 ? Should this be put under the joint matrix umbrella as an another potential tensor hardware accelerated function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are valid points.
The primary benefit of this sort of extension, is that it allows users to concisely target builtins specific to a particular backend. Since the fma_relu function is in the cuda math headers, we think that it is appropriate to have them in DPC++ as well, for ease of porting code etc. It is our feeling that since this extension targets just the CUDA backend, it will always be an extension and will not enter the core spec libraries. A DPC++ extension should (as much as possible) give users access to all of the functionality of the backend API, but not necessarily more. Therefore we do not need to be concerned about making fma_relu work for other backends (unless they also have a similar builtin to target).
The question of fusions is an interesting one, and something we will discuss a bit internally. Perhaps in the long run this is the approach that will be used in some instances.
The objects that use the function should be scalar and vector. The reason that bfloat16 has not been vectorized is because the vector types for the bfloat16 class has not been implemented yet. Once implemented we will add the bfloat vec versions for this function. bfloat16x2 is vectorized since we are relying on an older impl of bf16x2 which uses uint32_t as storage type.
However, we think that for the time being, we are interested in representing backend-specific features in DPC++, and since these features are exposed to the user as a free function in the CUDA headers, we think this is reason enough to bring this function into DPC++ as an extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share a link to the cuda math headers that contains the full list of math/ML functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find a link to the headers online, but you can find __hfma_relu
in any regular install of CUDA 11.6
/usr/local/cuda-11.6 $ grep "fma_relu" * -HnrI
include/cuda_bf16.h:3216:__CUDA_BF16_DECL__ __nv_bfloat16 __hfma_relu(const __nv_bfloat16 a, const __nv_bfloat16 b, const __nv_bfloat16 c);
include/cuda_bf16.hpp:2142:__CUDA_BF16_DECL__ __nv_bfloat16 __hfma_relu(const __nv_bfloat16 a, const __nv_bfloat16 b, const __nv_bfloat16 c)
include/cuda_fp16.hpp:2453:__CUDA_FP16_DECL__ __half __hfma_relu(const __half a, const __half b, const __half c)
include/cuda_fp16.h:3251:__CUDA_FP16_DECL__ __half __hfma_relu(const __half a, const __half b, const __half c);
targets/x86_64-linux/include/cuda_bf16.h:3216:__CUDA_BF16_DECL__ __nv_bfloat16 __hfma_relu(const __nv_bfloat16 a, const __nv_bfloat16 b, const __nv_bfloat16 c);
targets/x86_64-linux/include/cuda_bf16.hpp:2142:__CUDA_BF16_DECL__ __nv_bfloat16 __hfma_relu(const __nv_bfloat16 a, const __nv_bfloat16 b, const __nv_bfloat16 c)
targets/x86_64-linux/include/cuda_fp16.hpp:2453:__CUDA_FP16_DECL__ __half __hfma_relu(const __half a, const __half b, const __half c)
targets/x86_64-linux/include/cuda_fp16.h:3251:__CUDA_FP16_DECL__ __half __hfma_relu(const __half a, const __half b, const __half c);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry you can actually find it all here:
https://docs.nvidia.com/cuda/cuda-math-api/index.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think the approach should be with these functions?
Should we:
- Implement
bfloat16
versions of the core sycl math functions. In the meantime we could make sure that when for instanceexp(my_bf16)
is being called, it is being cast to afloat
to ensure correctness, before thebfloat16
specialization ofexp
is fully implemented. The problem is that there are a lot of core math functions, and maybe there wouldn't be a clear distinction between those that have a native bf16 version, and those that rely on casting. For functions that are not in the core sycl math library, new ones could be added as extensions, as is the case forfma_relu
. - Do what CUDA does and make new free functions specifically catering to bf16 like
hexp
for instance (cuda uses same funcs for bf16 and half). This involves introducing more and more functions and the list is likely to get longer. - Don't try to add support for these bf16 functions.
It is worth noting that not all the functions listed above have their own builtins, but it seems that all of them produce far less ptx than their say float
implementation counterpart, so it would be worthwhile calling these special bf16 functions in some way.
The reason we have added fma_relu
is so that users can target the PTX builtin relating to fma_relu. We did this relatively blindly because we thought it was a good idea to have access to all PTX builtins, which we still consider correct.
This PR introduces full support of element wise operations in the cuda backend. `wi_data`, `get_matrix_fill`, and `joint_matrix.get_wi_data()` are introduced for portability with the Intel backend. In addition, in the CUDA backend users can call `joint_matrix.wi_marray` to access the marray that stores the WI owned elements of the matrix and perform optimized element wise operations using math functions that take marrays. bfloat16 element wise operations support is also included and this PR adds bfloat16 scalar/marray impls replacing the existing uint16_t "storage type" implementations for fma, fmax, fmin, and fabs math functions. The bfloat16 fma_relu function impl has now been added directly in #5749. The existing temporary uint16_t implementations (introduced in #5748 with unmerged tests intel/llvm-test-suite#897) have been removed, since these bfloat16 implementations replaces them. Signed-off-by: jack.kirk <[email protected]>
This extension adds
fma_relu
, a fused multiply-add operation that returnsa * b + c > 0 ? a * b + c : 0
.fma_relu
is implemented here only for datatypessycl::half
,bfloat16
(using storage typeuint16_t
) andbfloat16x2
(using storage typeuint32_t
).This PR depends on:
Intel PR: #5724
Upstream patch: https://reviews.llvm.org/D118977
Upstream patch: https://reviews.llvm.org/D116673
Merged extension: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_intel_bf16_conversion.asciidoc
Tests PR: intel/llvm-test-suite#898