[SYCL] Add fma_relu extension #5749

hdelan · 2022-03-07T16:14:22Z

This extension adds fma_relu, a fused multiply-add operation that returns a * b + c > 0 ? a * b + c : 0. fma_relu is implemented here only for datatypes sycl::half, bfloat16 (using storage type uint16_t) and bfloat16x2 (using storage type uint32_t).

This PR depends on:

Intel PR: #5724
Upstream patch: https://reviews.llvm.org/D118977
Upstream patch: https://reviews.llvm.org/D116673

Merged extension: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_intel_bf16_conversion.asciidoc

Tests PR: intel/llvm-test-suite#898

Added bfloat16 in oneapi experimental namespace. Signed-off-by: jack.kirk <[email protected]>

…_BF16_CONVERSION.asciidoc

bader

libclc changes look good to me.

bader · 2022-03-11T08:16:50Z

@hdelan, please resolve merge conflicts and pre-commit fails.

hdelan · 2022-03-11T14:59:23Z

@hdelan, please resolve merge conflicts and pre-commit fails.

Should be fixed now

bader · 2022-03-11T15:07:04Z

@intel/dpcpp-specification-reviewers, ping.

gmlueck · 2022-03-11T15:20:18Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_fma_relu.asciidoc

+template <typename T>
+T fma_relu(T a, T b, T c);
+}
+```


Wouldn't it make more sense for this function to take the bfloat16 or bfloat16x2 types themselves rather than uint16_t and uint32_t?

Also a nit about the organization of this spec ... the "Specification" section below is the formal specification of your extension. The description of the fma_relu function should be there, not in the "Overview" section.

Wouldn't it make more sense for this function to take the bfloat16 or bfloat16x2 types themselves rather than uint16_t and uint32_t?

I am following the convention used by all of these bfloat16 PRs: #5748 #5724, which use uint16_t and uint32_t as storage types. Perhaps this mention of storage types doesn't belong in this document. Should I remove it?

The description of the fma_relu function should be there, not in the "Overview" section.

Thanks, have swapped that into specification section.

I talked with the @dkhaldi about the Matrix API, and she says they will add APIs that take the bfloat16 type soon, but they will keep the uint16_t versions also for a transition period. Does it make sense to add bfloat16 versions of fma_relu to this PR, or will you do that in a subsequent one?

I talked with the @dkhaldi about the Matrix API, and she says they will add APIs that take the bfloat16 type soon, but they will keep the uint16_t versions also for a transition period. Does it make sense to add bfloat16 versions of fma_relu to this PR, or will you do that in a subsequent one?

Good point, cc @hdelan, we should be able to add bfloat16 implementations of the fma_relu functions in this PR provided that #5393 is merged. We do want the bfloat16x2 cases too but this will require the definition of a bfloat16x2 class / extension doc update first, analogous to bfloat16 in #5393, so the corresponding bfloat16x2 impls will probably be done in a separate PR to this. For the joint_matrix API and other bfloat16 math builtins: fabs, fma, fmin, fmax, the uint16_t implementations are already merged and we are already working on follow up PRs for the corresponding bfloat16 implementations.

Removed aspect reference: can be added once the ext_oneapi_bfloat16 aspect is merged.

bader

libclc changes look good to me.

sycl/doc/extensions/experimental/sycl_ext_oneapi_fma_relu.asciidoc

gmlueck · 2022-03-31T20:21:24Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_fma_relu.asciidoc

+== Overview
+
+This extension introduces the `fma_relu` function for datatypes `sycl::half`,
+`bfloat16` and `bfloat16x2`. `bfloat16` and `bfloat16x2` refer to the bfloat16


I think this came up in another review also, but I forget which one. There is no bfloat16x2 type defined currently in sycl_ext_*_bf16_conversion (soon to be renamed sycl_ext_oneapi_bfloat16).

These changes have been made to the doc. fma_relu now accepts the bfloat16 class instead of uint16_t. The bfloat16x2 version still takes uint32_t as a storage type, but the doc explains that this will change once the bfloat16x2 class has been implemented as an extension.

sycl/include/sycl/ext/oneapi/experimental/builtins.hpp

sycl/doc/extensions/experimental/sycl_ext_oneapi_fma_relu.asciidoc

sycl/include/sycl/ext/oneapi/experimental/builtins.hpp

…idoc Co-authored-by: JackAKirk <[email protected]>

Co-authored-by: JackAKirk <[email protected]>

dkhaldi · 2022-04-05T19:42:14Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_fma_relu.asciidoc

+
+// Available when T is sycl::half, uint16_t (bfloat16) or uint32_t (bfloat16x2)
+template <typename T>
+T fma_relu(T a, T b, T c);


As part of extending math functions, you are already adding support for fma, fmax, etc to bfloat16/half variants.
What would be the benefit of adding a specific fma_relu over doing fma followed with fmax(res,0) and return 0 if the max is zero?

This extension of fma_relu is introducing two big "new" territories to DPC++:
1- Introducing ML activation functions to DPC++: the issue is that this type of functions are numerous: the ones we know of and the ones we don't know about them yet. Is the expectation to keep adding these as free functions in DPC++? relu is an easy one that can be written using max. What about the others? Why relu is so special here?

2- Introducing fusions to DPC++: fma_relu is telling the compiler these two functions can be fused together. While this can be important in libraries, is this really necessary for DPC++? DPC++ has a compiler that can detect that this type of relu or other functions is following an fma and can trigger the fusion the user intended.

One other open question and issue is: if we end up deciding to have this type of ML very specific functions in DPC++, what should be the objects that use them? scalar, vector ? marray? why the only vector type here is bfloat16x2 ? Should this be put under the joint matrix umbrella as an another potential tensor hardware accelerated function?

These are valid points.

The primary benefit of this sort of extension, is that it allows users to concisely target builtins specific to a particular backend. Since the fma_relu function is in the cuda math headers, we think that it is appropriate to have them in DPC++ as well, for ease of porting code etc. It is our feeling that since this extension targets just the CUDA backend, it will always be an extension and will not enter the core spec libraries. A DPC++ extension should (as much as possible) give users access to all of the functionality of the backend API, but not necessarily more. Therefore we do not need to be concerned about making fma_relu work for other backends (unless they also have a similar builtin to target).

The question of fusions is an interesting one, and something we will discuss a bit internally. Perhaps in the long run this is the approach that will be used in some instances.

The objects that use the function should be scalar and vector. The reason that bfloat16 has not been vectorized is because the vector types for the bfloat16 class has not been implemented yet. Once implemented we will add the bfloat vec versions for this function. bfloat16x2 is vectorized since we are relying on an older impl of bf16x2 which uses uint32_t as storage type.

However, we think that for the time being, we are interested in representing backend-specific features in DPC++, and since these features are exposed to the user as a free function in the CUDA headers, we think this is reason enough to bring this function into DPC++ as an extension.

Can you share a link to the cuda math headers that contains the full list of math/ML functions?

I can't find a link to the headers online, but you can find __hfma_relu in any regular install of CUDA 11.6

/usr/local/cuda-11.6 $ grep "fma_relu" * -HnrI include/cuda_bf16.h:3216:__CUDA_BF16_DECL__ __nv_bfloat16 __hfma_relu(const __nv_bfloat16 a, const __nv_bfloat16 b, const __nv_bfloat16 c); include/cuda_bf16.hpp:2142:__CUDA_BF16_DECL__ __nv_bfloat16 __hfma_relu(const __nv_bfloat16 a, const __nv_bfloat16 b, const __nv_bfloat16 c) include/cuda_fp16.hpp:2453:__CUDA_FP16_DECL__ __half __hfma_relu(const __half a, const __half b, const __half c) include/cuda_fp16.h:3251:__CUDA_FP16_DECL__ __half __hfma_relu(const __half a, const __half b, const __half c); targets/x86_64-linux/include/cuda_bf16.h:3216:__CUDA_BF16_DECL__ __nv_bfloat16 __hfma_relu(const __nv_bfloat16 a, const __nv_bfloat16 b, const __nv_bfloat16 c); targets/x86_64-linux/include/cuda_bf16.hpp:2142:__CUDA_BF16_DECL__ __nv_bfloat16 __hfma_relu(const __nv_bfloat16 a, const __nv_bfloat16 b, const __nv_bfloat16 c) targets/x86_64-linux/include/cuda_fp16.hpp:2453:__CUDA_FP16_DECL__ __half __hfma_relu(const __half a, const __half b, const __half c) targets/x86_64-linux/include/cuda_fp16.h:3251:__CUDA_FP16_DECL__ __half __hfma_relu(const __half a, const __half b, const __half c);

Sorry you can actually find it all here:
https://docs.nvidia.com/cuda/cuda-math-api/index.html

half: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF__ARITHMETIC.html#group__CUDA__MATH____HALF__ARITHMETIC

bfloat16: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____BFLOAT16__ARITHMETIC.html#group__CUDA__MATH____BFLOAT16__ARITHMETIC

What do you think the approach should be with these functions?

Should we:

Implement bfloat16 versions of the core sycl math functions. In the meantime we could make sure that when for instance exp(my_bf16) is being called, it is being cast to a float to ensure correctness, before the bfloat16 specialization of exp is fully implemented. The problem is that there are a lot of core math functions, and maybe there wouldn't be a clear distinction between those that have a native bf16 version, and those that rely on casting. For functions that are not in the core sycl math library, new ones could be added as extensions, as is the case for fma_relu.

Do what CUDA does and make new free functions specifically catering to bf16 like hexp for instance (cuda uses same funcs for bf16 and half). This involves introducing more and more functions and the list is likely to get longer.

Don't try to add support for these bf16 functions.

It is worth noting that not all the functions listed above have their own builtins, but it seems that all of them produce far less ptx than their say float implementation counterpart, so it would be worthwhile calling these special bf16 functions in some way.

The reason we have added fma_relu is so that users can target the PTX builtin relating to fma_relu. We did this relatively blindly because we thought it was a good idea to have access to all PTX builtins, which we still consider correct.

This PR introduces full support of element wise operations in the cuda backend. `wi_data`, `get_matrix_fill`, and `joint_matrix.get_wi_data()` are introduced for portability with the Intel backend. In addition, in the CUDA backend users can call `joint_matrix.wi_marray` to access the marray that stores the WI owned elements of the matrix and perform optimized element wise operations using math functions that take marrays. bfloat16 element wise operations support is also included and this PR adds bfloat16 scalar/marray impls replacing the existing uint16_t "storage type" implementations for fma, fmax, fmin, and fabs math functions. The bfloat16 fma_relu function impl has now been added directly in #5749. The existing temporary uint16_t implementations (introduced in #5748 with unmerged tests intel/llvm-test-suite#897) have been removed, since these bfloat16 implementations replaces them. Signed-off-by: jack.kirk <[email protected]>

JackAKirk and others added 8 commits January 25, 2022 14:06

Added bfloat16 support for cuda backend.

025cf7e

Added bfloat16 in oneapi experimental namespace. Signed-off-by: jack.kirk <[email protected]>

deleted intel namespace bfloat16.

66b4e33

Format.

2d04406

Changed extension macro name.

9418f74

Merge branch 'sycl' into bf16-cvt-ext

65fddfa

fixed test.

4d99f3f

Used neg ptx7.0 builtin for unary minus

3982001

Adding fma_relu extension

450e1b5

hdelan requested review from a team and bader as code owners March 7, 2022 16:14

hdelan requested a review from againull March 7, 2022 16:14

hdelan changed the title ~~[SYCL][] Adding fma_relu extension~~ [SYCL] Adding fma_relu extension Mar 7, 2022

Replaced SYCL_EXT_INTEL_BF16_CONVERSION.asciidoc with SYCL_EXT_ONEAPI…

8d2d11f

…_BF16_CONVERSION.asciidoc

hdelan mentioned this pull request Mar 7, 2022

Adding tests for fma relu with half, bf16 and bf16x2 datatypes intel/llvm-test-suite#898

Closed

JackAKirk mentioned this pull request Mar 7, 2022

[SYCL][Doc] math functions added to bfloat16 ext #5645

Merged

hdelan changed the title ~~[SYCL] Adding fma_relu extension~~ [SYCL] Add fma_relu extension Mar 7, 2022

Hugh Delaney and others added 2 commits March 7, 2022 18:13

Remove redundant include

a514505

Merge branch 'sycl' into bf16-cvt-ext

d8bc53f

bader previously approved these changes Mar 9, 2022

View reviewed changes

againull previously approved these changes Mar 11, 2022

View reviewed changes

Adding symbols to linux dump

37a18d7

hdelan dismissed stale reviews from againull and bader via 37a18d7 March 11, 2022 12:18

Merge main into branch

a7b2fdc

hdelan requested a review from againull March 11, 2022 14:59

bader previously approved these changes Mar 11, 2022

View reviewed changes

gmlueck reviewed Mar 11, 2022

View reviewed changes

Responding to comments

7b40302

hdelan dismissed bader’s stale review via 7b40302 March 14, 2022 10:12

JackAKirk added 2 commits March 15, 2022 14:35

Merge branch 'sycl' into bf16-cvt-ext

2f9b7d7

Renamed extension to cover all bfloat16 funct.

8a29c44

Removed aspect reference: can be added once the ext_oneapi_bfloat16 aspect is merged.

hdelan requested review from bader and gmlueck March 28, 2022 11:20

bader previously approved these changes Mar 28, 2022

View reviewed changes

gmlueck reviewed Mar 31, 2022

View reviewed changes

Hugh Delaney added 3 commits April 4, 2022 10:46

Merge remote-tracking branch 'Jack/bf16-cvt-ext' into add_relu

f53577f

Making fma_relu accept the bfloat16 class

49aca06

Merge branch 'add_relu' of https://github.com/hdelan/llvm into add_relu

02cbc5b

hdelan dismissed bader’s stale review via 02cbc5b April 4, 2022 14:10

Update doc

9fb55df

hdelan requested review from gmlueck and JackAKirk April 4, 2022 14:24

JackAKirk reviewed Apr 4, 2022

View reviewed changes

sycl/include/sycl/ext/oneapi/experimental/builtins.hpp Outdated Show resolved Hide resolved

JackAKirk reviewed Apr 4, 2022

View reviewed changes

sycl/doc/extensions/experimental/sycl_ext_oneapi_fma_relu.asciidoc Outdated Show resolved Hide resolved

JackAKirk reviewed Apr 4, 2022

View reviewed changes

sycl/include/sycl/ext/oneapi/experimental/builtins.hpp Outdated Show resolved Hide resolved

hdelan and others added 4 commits April 4, 2022 15:44

Update sycl/doc/extensions/experimental/sycl_ext_oneapi_fma_relu.asci…

358c943

…idoc Co-authored-by: JackAKirk <[email protected]>

Update sycl/include/sycl/ext/oneapi/experimental/builtins.hpp

7c6d728

Co-authored-by: JackAKirk <[email protected]>

Update sycl/include/sycl/ext/oneapi/experimental/builtins.hpp

390ae97

Co-authored-by: JackAKirk <[email protected]>

Using bits instead of reinterpret cast

f08791a

JackAKirk mentioned this pull request Apr 5, 2022

[SYCL][CUDA] Joint_matrix elem wise ops inc bfloat16 #5964

Merged

dkhaldi reviewed Apr 5, 2022

View reviewed changes

github-actions bot added the Stale label Oct 5, 2022

github-actions bot closed this Nov 5, 2022

[SYCL] Add fma_relu extension #5749

[SYCL] Add fma_relu extension #5749

Uh oh!

Conversation

hdelan commented Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bader left a comment

Choose a reason for hiding this comment

Uh oh!

bader commented Mar 11, 2022

Uh oh!

hdelan commented Mar 11, 2022

Uh oh!

bader commented Mar 11, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bader left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hdelan Apr 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hdelan Apr 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hdelan commented Mar 7, 2022 •

edited

Loading

hdelan Apr 7, 2022 •

edited

Loading

hdelan Apr 7, 2022 •

edited

Loading