[SYCL][CUDA] Add -fcuda-prec-sqrt flag #5141

npmiller · 2021-12-14T18:03:37Z

This patch add __nvvm_reflect support for __CUDA_PREC_SQRT and adds
a -Xclang -fcuda-prec-sqrt flag which is equivalent to the nvcc
-prec-sqrt flag, except that it defaults to false for clang++ and
to true for nvcc.

The reason for that is that the SYCL specification doesn't require a
correctly rounded sqrt so we likely want to keep the fast sqrt as a
default and use the flag when higher precision is required.

See additional discussion on #4041 and #5116

I tested this with hellinger-sycl from HeCBench:

No extra flag, so default sycl::sqrt being used, with lower precision but better performance:

% nvprof ./main
==394249== NVPROF is profiling process 394249, command: ./main
Problem size: c(768,3072) = a(768,1536) * b(1536,3072)
Fail - The result is incorrect for element: [0, 31], expected: 0.250686, but found: 0.250686
Fail - The result is incorrect for element: [0, 37], expected: 0.241163, but found: 0.241163
Fail - The result is incorrect for element: [0, 39], expected: 0.240011, but found: 0.240011
Fail - The result is incorrect for element: [0, 51], expected: 0.245439, but found: 0.245439
Fail - The result is incorrect for element: [0, 54], expected: 0.244692, but found: 0.244692
FAIL
==394249== Profiling application: ./main
==394249== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.91%  5.95323s       100  59.532ms  59.036ms  66.540ms  _ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_E9hellinger
                    0.04%  2.6810ms         1  2.6810ms  2.6810ms  2.6810ms  [CUDA memcpy DtoH]
                    0.04%  2.4583ms         2  1.2291ms  511.12us  1.9472ms  [CUDA memcpy HtoD]

Added -Xclang -fcuda-prec-sqrt to the CFLAGS, so correctly rounded sqrt but lower performance:

% nvprof ./main
==394357== NVPROF is profiling process 394357, command: ./main
Problem size: c(768,3072) = a(768,1536) * b(1536,3072)
PASS
==394357== Profiling application: ./main
==394357== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.95%  10.1790s       100  101.79ms  100.61ms  122.74ms  _ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_E9hellinger
                    0.03%  2.6858ms         1  2.6858ms  2.6858ms  2.6858ms  [CUDA memcpy DtoH]
                    0.02%  2.4209ms         2  1.2105ms  488.78us  1.9321ms  [CUDA memcpy HtoD]

This patch add `__nvvm_reflect` support for `__CUDA_PREC_SQRT` and adds a `-Xclang -fcuda-prec-sqrt` flag which is equivalent to the `nvcc` `-prec-sqrt` flag, except that it defaults to `false` for `clang++` and to `true` for `nvcc`. The reason for that is that the SYCL specification doesn't require a correctly rounded `sqrt` so we likely want to keep the fast `sqrt` as a default and use the flag when higher precision is required. See additional discussion on intel#4041 and intel#5116

smanna12 · 2021-12-14T21:04:42Z

Could you please add a test? Thanks

andykaylor · 2021-12-15T18:53:37Z

The problem of wanting correctly rounded sqrt and divide is not limited to CUDA, so I'd like to see a more general option provided and not requiring a '-Xclang' prefix. I'd also like it to be more explicit that this refers to single precision operations. What do you think of something like this:

-fprec-sqrt-f32, -fno-prec-sqrt-f32

Require single precision floating point square root operations to be correctly rounded. If -fno-prec-sqrt-f32 is used, these operations may be performed with up to 2.5 ulp relative error.

This option (and similar options for other precisions) could even be useful in the host CPU compiler, as could equivalent options for division.

npmiller · 2021-12-16T14:42:32Z

Could you please add a test? Thanks

I've expanded the llvm test checking __nvvm_reflect, added a clang test for the flag, and updated the NVPTX backend documentation to add the reflect flag.

npmiller · 2021-12-16T15:13:57Z

@andykaylor do you mean a really generic option? I think maybe a SYCL option would be more appropriate here, maybe something like: -fsycl-prec-sqrt-f32, that could be used for other SYCL targets.

I'm not sure a generic option makes that much sense because for regular C/C++, sqrt would defer to the standard library. For SYCL we can tweak the standard library in libclc fairly easily, for regular C/C++ that seems like a much more consequential change.

So essentially, I think the -fcuda-prec-sqrt flag still makes sense to add as it just wires up an existing CUDA mechanism and could be used outside of SYCL. Then I think something like -fsycl-prec-sqrt-f32 that we can also use for other targets would also make sense, and for CUDA we could simply wire that to the -fcuda-prec-sqrt flag. But when it comes to an even more generic flag, I'm not sure it makes sense, or at least I think this is something that would need to be discussed and proposed upstream first.

andykaylor · 2021-12-16T16:44:58Z

@andykaylor do you mean a really generic option? I think maybe a SYCL option would be more appropriate here, maybe something like: -fsycl-prec-sqrt-f32, that could be used for other SYCL targets.

I did intend the fully generic option, but I think you have a point about there being more to discuss to make that happen.

If you specify -fno-math-errno, a call to sqrt will be represented by llvm.sqrt in IR and the backend is free to lower that to something other than a library call. Even with math-errno, some backends will check for negative input and only call the library function for that case. (Example: https://godbolt.org/z/8GG9c1zP3) The ability to approximate sqrt allows some more optimized lowering sequences, particularly in the case of reciprocal square root.

The problem, as I saw when putting together the details for my response, is in setting the limit for how much error the flag should allow. The Intel Software Optimization Guide, for example, offers an instruction sequence that would approximate sqrt with 4 ulp error, but if we say the flag allows 4 ulp error, then it doesn't do what we need for SYCL since the SYCL spec only allows 2.5 ulp error even in the relaxed mode.

So, having thought this through, I am persuaded that your suggestion of adding a -fsycl-prec-sqrt-f32 option that gets wired to the -fcuda-prec-sqrt makes sense. We should add -fsycl-prec-div-f32 while we're at it.

elizabethandrews

FE changes LGTM

smanna12

FE changes look good to me.

AGindinson

LGTM for Options.td

It follows the approach from intel#5141 and intel#5309 adding intermediate fcuda-prec-div flag. Signed-off-by: Sidorov, Dmitry <[email protected]>

npmiller requested review from AGindinson, bader, elizabethandrews, hchilama, mdtoguchi, premanandrao, pvchupin and smanna12 as code owners December 14, 2021 18:03

[SYCL][CUDA] Fix formatting

ade2555

bader requested a review from andykaylor December 14, 2021 18:29

npmiller mentioned this pull request Dec 15, 2021

[CUDA] sycl::sqrt leads to IEEE754 incompatible results on NVidia cards #4041

Closed

[SYCL][CUDA] Add nvvm-reflect test for prec-sqrt

2f6ade7

npmiller added 3 commits December 16, 2021 10:34

[SYCL][CUDA] Fix nvvm ftz test

50372de

[SYCL][CUDA] Update NVPTX docs with new reflect flag

0ec92ec

[SYCL][CUDA] Add -fcuda-prec-sqrt flag test

9f722e6

bader approved these changes Dec 17, 2021

View reviewed changes

elizabethandrews approved these changes Dec 20, 2021

View reviewed changes

smanna12 approved these changes Dec 20, 2021

View reviewed changes

AGindinson approved these changes Dec 31, 2021

View reviewed changes

bader merged commit ebf9ea8 into intel:sycl Dec 31, 2021

npmiller mentioned this pull request Jan 14, 2022

[SYCL] Add -fsycl-fp32-prec-sqrt flag #5309

Merged

npmiller mentioned this pull request Mar 13, 2024

[HIP] DPC++ does not use correctly rounded sqrt, even when using -fno-fast-math #12961

Open

MrSidims mentioned this pull request Feb 18, 2025

[SYCL][CUDA][HIP] Propagate -foffload-fp32-prec-sqrt #17044

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][CUDA] Add -fcuda-prec-sqrt flag #5141

[SYCL][CUDA] Add -fcuda-prec-sqrt flag #5141

Uh oh!

npmiller commented Dec 14, 2021

Uh oh!

smanna12 commented Dec 14, 2021

Uh oh!

andykaylor commented Dec 15, 2021

Uh oh!

npmiller commented Dec 16, 2021

Uh oh!

npmiller commented Dec 16, 2021

Uh oh!

andykaylor commented Dec 16, 2021

Uh oh!

elizabethandrews left a comment

Uh oh!

smanna12 left a comment

Uh oh!

AGindinson left a comment

Uh oh!

Uh oh!

[SYCL][CUDA] Add -fcuda-prec-sqrt flag #5141

[SYCL][CUDA] Add -fcuda-prec-sqrt flag #5141

Uh oh!

Conversation

npmiller commented Dec 14, 2021

Uh oh!

smanna12 commented Dec 14, 2021

Uh oh!

andykaylor commented Dec 15, 2021

Uh oh!

npmiller commented Dec 16, 2021

Uh oh!

npmiller commented Dec 16, 2021

Uh oh!

andykaylor commented Dec 16, 2021

Uh oh!

elizabethandrews left a comment

Choose a reason for hiding this comment

Uh oh!

smanna12 left a comment

Choose a reason for hiding this comment

Uh oh!

AGindinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!