-
Notifications
You must be signed in to change notification settings - Fork 787
[LIBCLC][CUDA] Use generic sqrt implementation #5116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This fixes intel#4041, the generic libclc `sqrt` implementation falls back on the LLVM intrinsic which generates the correct `sqrt.rn.f`, `__nv_sin` generates the "native" version `sqrt.approx.f`, which doesn't have the same precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that accuracy of CUDA built-ins is equal or better than OpenCL bulit-ins (see https://github.com/intel/llvm/blob/sycl/sycl/doc/cuda/cuda-vs-opencl-math-builtin-precisions.md for more details), it seems okay.
OTOH, I assumed that SPIR-V built-ins are implemented as a wrappers around corresponding CUDA built-ins, so it's not clear how did we manage to change mapping for SPIR-V built-in. Was is done for performance improvement? Won't this change degrade sqrt
performance?
Okay, so I had a further look into this, first of all it does degrade performance significantly, on the
Nothing in the git history suggests this was done for performance, however looking further at the specification in: It says the following for the CUDA built-in:
And for the OpenCL 1.2 built-in:
So I believe the However we should probably support compiler flags to raise precision, and the issue in the original ticket is likely more that |
Closing this as it's the wrong approach. |
That make sense to me. Thanks for looking into this. |
Does -ffast-math enable the fast sqrt ? |
The fast sqrt is the default one, so it will be used with or without |
The default should be the slow one when CUDA support is enabled, so users will not see result mismatch. Is that right ? |
I don't think so because the fast one fulfills the precision requirements for SYCL:
And so I think the default should be the fastest version that fulfills the SYCL specification requirements. It may be not the best when porting from CUDA but I think that's what makes more sense from the SYCL point of view. What I'm looking into is adding the |
Nowadays, most Nivida GPUs >= 5.2 and ulp=1. However , ulp > 1 in SYCL, and it has caused result mismatch when porting a CUDA program. If the SYCL spec needs to be modified, please let the committee know. I understand the SYCL point of view. Thanks. |
This patch add `__nvvm_reflect` support for `__CUDA_PREC_SQRT` and adds a `-Xclang -fcuda-prec-sqrt` flag which is equivalent to the `nvcc` `-prec-sqrt` flag, except that it defaults to `false` for `clang++` and to `true` for `nvcc`. The reason for that is that the SYCL specification doesn't require a correctly rounded `sqrt` so we likely want to keep the fast `sqrt` as a default and use the flag when higher precision is required. See additional discussion on intel#4041 and intel#5116
This patch add `__nvvm_reflect` support for `__CUDA_PREC_SQRT` and adds a `-Xclang -fcuda-prec-sqrt` flag which is equivalent to the `nvcc` `-prec-sqrt` flag, except that it defaults to `false` for `clang++` and to `true` for `nvcc`. The reason for that is that the SYCL specification doesn't require a correctly rounded `sqrt` so we likely want to keep the fast `sqrt` as a default and use the flag when higher precision is required. See additional discussion on #4041 and #5116
This fixes #4041, the generic libclc
sqrt
implementation falls back onthe LLVM intrinsic which generates the correct
sqrt.rn.f
,__nv_sin
generates the "native" version
sqrt.approx.f
, which doesn't have thesame precision.
I've ran both the sample on #4041, and the
hellinger-sycl
benchmark, and both pass with this patch.We may have to review the other built-ins as well, a lot of them use the
__nv_
variants which may also not have good enough precision.