[LIBCLC][CUDA] Use generic sqrt implementation #5116

npmiller · 2021-12-09T18:48:23Z

This fixes #4041, the generic libclc sqrt implementation falls back on
the LLVM intrinsic which generates the correct sqrt.rn.f, __nv_sin
generates the "native" version sqrt.approx.f, which doesn't have the
same precision.

I've ran both the sample on #4041, and the hellinger-sycl benchmark, and both pass with this patch.

We may have to review the other built-ins as well, a lot of them use the __nv_ variants which may also not have good enough precision.

This fixes intel#4041, the generic libclc `sqrt` implementation falls back on the LLVM intrinsic which generates the correct `sqrt.rn.f`, `__nv_sin` generates the "native" version `sqrt.approx.f`, which doesn't have the same precision.

bader

Considering that accuracy of CUDA built-ins is equal or better than OpenCL bulit-ins (see https://github.com/intel/llvm/blob/sycl/sycl/doc/cuda/cuda-vs-opencl-math-builtin-precisions.md for more details), it seems okay.
OTOH, I assumed that SPIR-V built-ins are implemented as a wrappers around corresponding CUDA built-ins, so it's not clear how did we manage to change mapping for SPIR-V built-in. Was is done for performance improvement? Won't this change degrade sqrt performance?

npmiller · 2021-12-10T15:00:25Z

Considering that accuracy of CUDA built-ins is equal or better than OpenCL bulit-ins (see https://github.com/intel/llvm/blob/sycl/sycl/doc/cuda/cuda-vs-opencl-math-builtin-precisions.md for more details), it seems okay. OTOH, I assumed that SPIR-V built-ins are implemented as a wrappers around corresponding CUDA built-ins, so it's not clear how did we manage to change mapping for SPIR-V built-in. Was is done for performance improvement? Won't this change degrade sqrt performance?

Okay, so I had a further look into this, first of all it does degrade performance significantly, on the hellinger-sycl benchmark I'm getting the following performance drop:

sqrt.approx.f, current sycl branch:

 GPU activities:   99.91%  5.93577s       100  59.358ms  59.055ms  66.582ms  _ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_E9hellinger
                    0.05%  2.6821ms         1  2.6821ms  2.6821ms  2.6821ms  [CUDA memcpy DtoH]
                    0.04%  2.4431ms         2  1.2216ms  493.77us  1.9494ms  [CUDA memcpy HtoD]

sqrt.rn.f, this patch:

 GPU activities:   99.95%  10.1411s       100  101.41ms  100.33ms  122.04ms  _ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_E9hellinger
                    0.03%  2.6770ms         1  2.6770ms  2.6770ms  2.6770ms  [CUDA memcpy DtoH]
                    0.02%  2.4431ms         2  1.2216ms  492.69us  1.9504ms  [CUDA memcpy HtoD]

Nothing in the git history suggests this was done for performance, however looking further at the specification in:

https://github.com/intel/llvm/blob/sycl/sycl/doc/cuda/cuda-vs-opencl-math-builtin-precisions.md

It says the following for the CUDA built-in:

0 ulp (when compiled with -prec-sqrt=true) otherwise 1 ulp if compute capability ≥ 5.2 and 3 ulp otherwise.

And for the OpenCL 1.2 built-in:

≤ 3 ulp

So I believe the approx variant actually has enough precision for the SYCL requirements.

However we should probably support compiler flags to raise precision, and the issue in the original ticket is likely more that -ffp-model=precise doesn't switch from approx back to the full precision instruction.

npmiller · 2021-12-10T15:09:24Z

Closing this as it's the wrong approach.

bader · 2021-12-10T16:34:24Z

However we should probably support compiler flags to raise precision, and the issue in the original ticket is likely more that -ffp-model=precise doesn't switch from approx back to the full precision instruction.

That make sense to me. Thanks for looking into this.
@andykaylor, just FYI.

zjin-lcf · 2021-12-10T19:02:26Z

Does -ffast-math enable the fast sqrt ?

npmiller · 2021-12-10T19:08:20Z

Does -ffast-math enable the fast sqrt ?

The fast sqrt is the default one, so it will be used with or without -ffast-math.

zjin-lcf · 2021-12-10T20:27:58Z

The default should be the slow one when CUDA support is enabled, so users will not see result mismatch. Is that right ?

npmiller · 2021-12-13T10:40:34Z

The default should be the slow one when CUDA support is enabled, so users will not see result mismatch. Is that right ?

I don't think so because the fast one fulfills the precision requirements for SYCL:

SYCL requirements: ≤ 3 ulp
CUDA: 0 ulp (when compiled with -prec-sqrt=true) otherwise 1 ulp if compute capability ≥ 5.2 and 3 ulp otherwise.

And so I think the default should be the fastest version that fulfills the SYCL specification requirements. It may be not the best when porting from CUDA but I think that's what makes more sense from the SYCL point of view.

What I'm looking into is adding the -prec-sqrt nvvcc flag to clang so users porting from CUDA can use this flag to raise the precision if they need to, and SYCL applications would still get the expected performance and precision when using sqrt.

zjin-lcf · 2021-12-13T12:38:50Z

Nowadays, most Nivida GPUs >= 5.2 and ulp=1. However , ulp > 1 in SYCL, and it has caused result mismatch when porting a CUDA program. If the SYCL spec needs to be modified, please let the committee know.

I understand the SYCL point of view. Thanks.

This patch add `__nvvm_reflect` support for `__CUDA_PREC_SQRT` and adds a `-Xclang -fcuda-prec-sqrt` flag which is equivalent to the `nvcc` `-prec-sqrt` flag, except that it defaults to `false` for `clang++` and to `true` for `nvcc`. The reason for that is that the SYCL specification doesn't require a correctly rounded `sqrt` so we likely want to keep the fast `sqrt` as a default and use the flag when higher precision is required. See additional discussion on intel#4041 and intel#5116

This patch add `__nvvm_reflect` support for `__CUDA_PREC_SQRT` and adds a `-Xclang -fcuda-prec-sqrt` flag which is equivalent to the `nvcc` `-prec-sqrt` flag, except that it defaults to `false` for `clang++` and to `true` for `nvcc`. The reason for that is that the SYCL specification doesn't require a correctly rounded `sqrt` so we likely want to keep the fast `sqrt` as a default and use the flag when higher precision is required. See additional discussion on #4041 and #5116

[SYCL][CUDA] Use generic sqrt implementation

ec8f3ab

This fixes intel#4041, the generic libclc `sqrt` implementation falls back on the LLVM intrinsic which generates the correct `sqrt.rn.f`, `__nv_sin` generates the "native" version `sqrt.approx.f`, which doesn't have the same precision.

npmiller requested a review from bader as a code owner December 9, 2021 18:48

npmiller changed the title ~~[SYCL][CUDA] Use generic sqrt implementation~~ [LIBCLC][CUDA] Use generic sqrt implementation Dec 9, 2021

bader added cuda CUDA back-end libclc libclc project related issues labels Dec 9, 2021

bader approved these changes Dec 9, 2021

View reviewed changes

npmiller closed this Dec 10, 2021

npmiller mentioned this pull request Dec 10, 2021

[CUDA] sycl::sqrt leads to IEEE754 incompatible results on NVidia cards #4041

Closed

npmiller mentioned this pull request Dec 14, 2021

[SYCL][CUDA] Add -fcuda-prec-sqrt flag #5141

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LIBCLC][CUDA] Use generic sqrt implementation #5116

[LIBCLC][CUDA] Use generic sqrt implementation #5116

Uh oh!

npmiller commented Dec 9, 2021

Uh oh!

bader left a comment

Uh oh!

npmiller commented Dec 10, 2021

Uh oh!

npmiller commented Dec 10, 2021

Uh oh!

bader commented Dec 10, 2021

Uh oh!

zjin-lcf commented Dec 10, 2021

Uh oh!

npmiller commented Dec 10, 2021

Uh oh!

zjin-lcf commented Dec 10, 2021

Uh oh!

npmiller commented Dec 13, 2021

Uh oh!

zjin-lcf commented Dec 13, 2021 •

edited

Loading

Uh oh!

Uh oh!

[LIBCLC][CUDA] Use generic sqrt implementation #5116

[LIBCLC][CUDA] Use generic sqrt implementation #5116

Uh oh!

Conversation

npmiller commented Dec 9, 2021

Uh oh!

bader left a comment

Choose a reason for hiding this comment

Uh oh!

npmiller commented Dec 10, 2021

Uh oh!

npmiller commented Dec 10, 2021

Uh oh!

bader commented Dec 10, 2021

Uh oh!

zjin-lcf commented Dec 10, 2021

Uh oh!

npmiller commented Dec 10, 2021

Uh oh!

zjin-lcf commented Dec 10, 2021

Uh oh!

npmiller commented Dec 13, 2021

Uh oh!

zjin-lcf commented Dec 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zjin-lcf commented Dec 13, 2021 •

edited

Loading