Skip to content

[LIBCLC][CUDA] Use generic sqrt implementation #5116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

npmiller
Copy link
Contributor

@npmiller npmiller commented Dec 9, 2021

This fixes #4041, the generic libclc sqrt implementation falls back on
the LLVM intrinsic which generates the correct sqrt.rn.f, __nv_sin
generates the "native" version sqrt.approx.f, which doesn't have the
same precision.

I've ran both the sample on #4041, and the hellinger-sycl benchmark, and both pass with this patch.

We may have to review the other built-ins as well, a lot of them use the __nv_ variants which may also not have good enough precision.

This fixes intel#4041, the generic libclc `sqrt` implementation falls back on
the LLVM intrinsic which generates the correct `sqrt.rn.f`, `__nv_sin`
generates the "native" version `sqrt.approx.f`, which doesn't have the
same precision.
@npmiller npmiller requested a review from bader as a code owner December 9, 2021 18:48
@npmiller npmiller changed the title [SYCL][CUDA] Use generic sqrt implementation [LIBCLC][CUDA] Use generic sqrt implementation Dec 9, 2021
@bader bader added cuda CUDA back-end libclc libclc project related issues labels Dec 9, 2021
Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that accuracy of CUDA built-ins is equal or better than OpenCL bulit-ins (see https://github.com/intel/llvm/blob/sycl/sycl/doc/cuda/cuda-vs-opencl-math-builtin-precisions.md for more details), it seems okay.
OTOH, I assumed that SPIR-V built-ins are implemented as a wrappers around corresponding CUDA built-ins, so it's not clear how did we manage to change mapping for SPIR-V built-in. Was is done for performance improvement? Won't this change degrade sqrt performance?

@npmiller
Copy link
Contributor Author

Considering that accuracy of CUDA built-ins is equal or better than OpenCL bulit-ins (see https://github.com/intel/llvm/blob/sycl/sycl/doc/cuda/cuda-vs-opencl-math-builtin-precisions.md for more details), it seems okay. OTOH, I assumed that SPIR-V built-ins are implemented as a wrappers around corresponding CUDA built-ins, so it's not clear how did we manage to change mapping for SPIR-V built-in. Was is done for performance improvement? Won't this change degrade sqrt performance?

Okay, so I had a further look into this, first of all it does degrade performance significantly, on the hellinger-sycl benchmark I'm getting the following performance drop:

sqrt.approx.f, current sycl branch:

 GPU activities:   99.91%  5.93577s       100  59.358ms  59.055ms  66.582ms  _ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_E9hellinger
                    0.05%  2.6821ms         1  2.6821ms  2.6821ms  2.6821ms  [CUDA memcpy DtoH]
                    0.04%  2.4431ms         2  1.2216ms  493.77us  1.9494ms  [CUDA memcpy HtoD]

sqrt.rn.f, this patch:

 GPU activities:   99.95%  10.1411s       100  101.41ms  100.33ms  122.04ms  _ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_E9hellinger
                    0.03%  2.6770ms         1  2.6770ms  2.6770ms  2.6770ms  [CUDA memcpy DtoH]
                    0.02%  2.4431ms         2  1.2216ms  492.69us  1.9504ms  [CUDA memcpy HtoD]

Nothing in the git history suggests this was done for performance, however looking further at the specification in:

It says the following for the CUDA built-in:

0 ulp (when compiled with -prec-sqrt=true) otherwise 1 ulp if compute capability ≥ 5.2 and 3 ulp otherwise.

And for the OpenCL 1.2 built-in:

≤ 3 ulp

So I believe the approx variant actually has enough precision for the SYCL requirements.

However we should probably support compiler flags to raise precision, and the issue in the original ticket is likely more that -ffp-model=precise doesn't switch from approx back to the full precision instruction.

@npmiller
Copy link
Contributor Author

Closing this as it's the wrong approach.

@npmiller npmiller closed this Dec 10, 2021
@bader
Copy link
Contributor

bader commented Dec 10, 2021

However we should probably support compiler flags to raise precision, and the issue in the original ticket is likely more that -ffp-model=precise doesn't switch from approx back to the full precision instruction.

That make sense to me. Thanks for looking into this.
@andykaylor, just FYI.

@zjin-lcf
Copy link
Contributor

Does -ffast-math enable the fast sqrt ?

@npmiller
Copy link
Contributor Author

Does -ffast-math enable the fast sqrt ?

The fast sqrt is the default one, so it will be used with or without -ffast-math.

@zjin-lcf
Copy link
Contributor

The default should be the slow one when CUDA support is enabled, so users will not see result mismatch. Is that right ?

@npmiller
Copy link
Contributor Author

The default should be the slow one when CUDA support is enabled, so users will not see result mismatch. Is that right ?

I don't think so because the fast one fulfills the precision requirements for SYCL:

  • SYCL requirements: ≤ 3 ulp
  • CUDA: 0 ulp (when compiled with -prec-sqrt=true) otherwise 1 ulp if compute capability ≥ 5.2 and 3 ulp otherwise.

And so I think the default should be the fastest version that fulfills the SYCL specification requirements. It may be not the best when porting from CUDA but I think that's what makes more sense from the SYCL point of view.

What I'm looking into is adding the -prec-sqrt nvvcc flag to clang so users porting from CUDA can use this flag to raise the precision if they need to, and SYCL applications would still get the expected performance and precision when using sqrt.

@zjin-lcf
Copy link
Contributor

zjin-lcf commented Dec 13, 2021

Nowadays, most Nivida GPUs >= 5.2 and ulp=1. However , ulp > 1 in SYCL, and it has caused result mismatch when porting a CUDA program. If the SYCL spec needs to be modified, please let the committee know.

I understand the SYCL point of view. Thanks.

npmiller added a commit to npmiller/llvm that referenced this pull request Dec 14, 2021
This patch add `__nvvm_reflect` support for `__CUDA_PREC_SQRT` and adds
a `-Xclang -fcuda-prec-sqrt` flag which is equivalent to the `nvcc`
`-prec-sqrt` flag, except that it defaults to `false` for `clang++` and
to `true` for `nvcc`.

The reason for that is that the SYCL specification doesn't require a
correctly rounded `sqrt` so we likely want to keep the fast `sqrt` as a
default and use the flag when higher precision is required.

See additional discussion on intel#4041 and intel#5116
bader pushed a commit that referenced this pull request Dec 31, 2021
This patch add `__nvvm_reflect` support for `__CUDA_PREC_SQRT` and adds
a `-Xclang -fcuda-prec-sqrt` flag which is equivalent to the `nvcc`
`-prec-sqrt` flag, except that it defaults to `false` for `clang++` and
to `true` for `nvcc`.

The reason for that is that the SYCL specification doesn't require a
correctly rounded `sqrt` so we likely want to keep the fast `sqrt` as a
default and use the flag when higher precision is required.

See additional discussion on #4041 and #5116
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end libclc libclc project related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CUDA] sycl::sqrt leads to IEEE754 incompatible results on NVidia cards
3 participants