-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL][CUDA] Add -fcuda-prec-sqrt flag #5141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This patch add `__nvvm_reflect` support for `__CUDA_PREC_SQRT` and adds a `-Xclang -fcuda-prec-sqrt` flag which is equivalent to the `nvcc` `-prec-sqrt` flag, except that it defaults to `false` for `clang++` and to `true` for `nvcc`. The reason for that is that the SYCL specification doesn't require a correctly rounded `sqrt` so we likely want to keep the fast `sqrt` as a default and use the flag when higher precision is required. See additional discussion on intel#4041 and intel#5116
Could you please add a test? Thanks |
The problem of wanting correctly rounded sqrt and divide is not limited to CUDA, so I'd like to see a more general option provided and not requiring a '-Xclang' prefix. I'd also like it to be more explicit that this refers to single precision operations. What do you think of something like this: -fprec-sqrt-f32, -fno-prec-sqrt-f32 Require single precision floating point square root operations to be correctly rounded. If -fno-prec-sqrt-f32 is used, these operations may be performed with up to 2.5 ulp relative error. This option (and similar options for other precisions) could even be useful in the host CPU compiler, as could equivalent options for division. |
I've expanded the llvm test checking |
@andykaylor do you mean a really generic option? I think maybe a SYCL option would be more appropriate here, maybe something like: I'm not sure a generic option makes that much sense because for regular C/C++, So essentially, I think the |
I did intend the fully generic option, but I think you have a point about there being more to discuss to make that happen. If you specify -fno-math-errno, a call to sqrt will be represented by llvm.sqrt in IR and the backend is free to lower that to something other than a library call. Even with math-errno, some backends will check for negative input and only call the library function for that case. (Example: https://godbolt.org/z/8GG9c1zP3) The ability to approximate sqrt allows some more optimized lowering sequences, particularly in the case of reciprocal square root. The problem, as I saw when putting together the details for my response, is in setting the limit for how much error the flag should allow. The Intel Software Optimization Guide, for example, offers an instruction sequence that would approximate sqrt with 4 ulp error, but if we say the flag allows 4 ulp error, then it doesn't do what we need for SYCL since the SYCL spec only allows 2.5 ulp error even in the relaxed mode. So, having thought this through, I am persuaded that your suggestion of adding a -fsycl-prec-sqrt-f32 option that gets wired to the -fcuda-prec-sqrt makes sense. We should add -fsycl-prec-div-f32 while we're at it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FE changes LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FE changes look good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for Options.td
It follows the approach from intel#5141 and intel#5309 adding intermediate fcuda-prec-div flag. Signed-off-by: Sidorov, Dmitry <[email protected]>
This patch add
__nvvm_reflect
support for__CUDA_PREC_SQRT
and addsa
-Xclang -fcuda-prec-sqrt
flag which is equivalent to thenvcc
-prec-sqrt
flag, except that it defaults tofalse
forclang++
andto
true
fornvcc
.The reason for that is that the SYCL specification doesn't require a
correctly rounded
sqrt
so we likely want to keep the fastsqrt
as adefault and use the flag when higher precision is required.
See additional discussion on #4041 and #5116
I tested this with
hellinger-sycl
from HeCBench:No extra flag, so default
sycl::sqrt
being used, with lower precision but better performance:Added
-Xclang -fcuda-prec-sqrt
to theCFLAGS
, so correctly roundedsqrt
but lower performance: