Skip to content

[SYCL][CUDA] Add -fcuda-prec-sqrt flag #5141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 31, 2021
Merged

Conversation

npmiller
Copy link
Contributor

This patch add __nvvm_reflect support for __CUDA_PREC_SQRT and adds
a -Xclang -fcuda-prec-sqrt flag which is equivalent to the nvcc
-prec-sqrt flag, except that it defaults to false for clang++ and
to true for nvcc.

The reason for that is that the SYCL specification doesn't require a
correctly rounded sqrt so we likely want to keep the fast sqrt as a
default and use the flag when higher precision is required.

See additional discussion on #4041 and #5116

I tested this with hellinger-sycl from HeCBench:

No extra flag, so default sycl::sqrt being used, with lower precision but better performance:

% nvprof ./main
==394249== NVPROF is profiling process 394249, command: ./main
Problem size: c(768,3072) = a(768,1536) * b(1536,3072)
Fail - The result is incorrect for element: [0, 31], expected: 0.250686, but found: 0.250686
Fail - The result is incorrect for element: [0, 37], expected: 0.241163, but found: 0.241163
Fail - The result is incorrect for element: [0, 39], expected: 0.240011, but found: 0.240011
Fail - The result is incorrect for element: [0, 51], expected: 0.245439, but found: 0.245439
Fail - The result is incorrect for element: [0, 54], expected: 0.244692, but found: 0.244692
FAIL
==394249== Profiling application: ./main
==394249== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.91%  5.95323s       100  59.532ms  59.036ms  66.540ms  _ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_E9hellinger
                    0.04%  2.6810ms         1  2.6810ms  2.6810ms  2.6810ms  [CUDA memcpy DtoH]
                    0.04%  2.4583ms         2  1.2291ms  511.12us  1.9472ms  [CUDA memcpy HtoD]

Added -Xclang -fcuda-prec-sqrt to the CFLAGS, so correctly rounded sqrt but lower performance:

% nvprof ./main
==394357== NVPROF is profiling process 394357, command: ./main
Problem size: c(768,3072) = a(768,1536) * b(1536,3072)
PASS
==394357== Profiling application: ./main
==394357== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.95%  10.1790s       100  101.79ms  100.61ms  122.74ms  _ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_E9hellinger
                    0.03%  2.6858ms         1  2.6858ms  2.6858ms  2.6858ms  [CUDA memcpy DtoH]
                    0.02%  2.4209ms         2  1.2105ms  488.78us  1.9321ms  [CUDA memcpy HtoD]

This patch add `__nvvm_reflect` support for `__CUDA_PREC_SQRT` and adds
a `-Xclang -fcuda-prec-sqrt` flag which is equivalent to the `nvcc`
`-prec-sqrt` flag, except that it defaults to `false` for `clang++` and
to `true` for `nvcc`.

The reason for that is that the SYCL specification doesn't require a
correctly rounded `sqrt` so we likely want to keep the fast `sqrt` as a
default and use the flag when higher precision is required.

See additional discussion on intel#4041 and intel#5116
@bader bader requested a review from andykaylor December 14, 2021 18:29
@smanna12
Copy link
Contributor

Could you please add a test? Thanks

@andykaylor
Copy link
Contributor

The problem of wanting correctly rounded sqrt and divide is not limited to CUDA, so I'd like to see a more general option provided and not requiring a '-Xclang' prefix. I'd also like it to be more explicit that this refers to single precision operations. What do you think of something like this:

-fprec-sqrt-f32, -fno-prec-sqrt-f32

Require single precision floating point square root operations to be correctly rounded. If -fno-prec-sqrt-f32 is used, these operations may be performed with up to 2.5 ulp relative error.

This option (and similar options for other precisions) could even be useful in the host CPU compiler, as could equivalent options for division.

@npmiller
Copy link
Contributor Author

Could you please add a test? Thanks

I've expanded the llvm test checking __nvvm_reflect, added a clang test for the flag, and updated the NVPTX backend documentation to add the reflect flag.

@npmiller
Copy link
Contributor Author

@andykaylor do you mean a really generic option? I think maybe a SYCL option would be more appropriate here, maybe something like: -fsycl-prec-sqrt-f32, that could be used for other SYCL targets.

I'm not sure a generic option makes that much sense because for regular C/C++, sqrt would defer to the standard library. For SYCL we can tweak the standard library in libclc fairly easily, for regular C/C++ that seems like a much more consequential change.

So essentially, I think the -fcuda-prec-sqrt flag still makes sense to add as it just wires up an existing CUDA mechanism and could be used outside of SYCL. Then I think something like -fsycl-prec-sqrt-f32 that we can also use for other targets would also make sense, and for CUDA we could simply wire that to the -fcuda-prec-sqrt flag. But when it comes to an even more generic flag, I'm not sure it makes sense, or at least I think this is something that would need to be discussed and proposed upstream first.

@andykaylor
Copy link
Contributor

@andykaylor do you mean a really generic option? I think maybe a SYCL option would be more appropriate here, maybe something like: -fsycl-prec-sqrt-f32, that could be used for other SYCL targets.

I did intend the fully generic option, but I think you have a point about there being more to discuss to make that happen.

If you specify -fno-math-errno, a call to sqrt will be represented by llvm.sqrt in IR and the backend is free to lower that to something other than a library call. Even with math-errno, some backends will check for negative input and only call the library function for that case. (Example: https://godbolt.org/z/8GG9c1zP3) The ability to approximate sqrt allows some more optimized lowering sequences, particularly in the case of reciprocal square root.

The problem, as I saw when putting together the details for my response, is in setting the limit for how much error the flag should allow. The Intel Software Optimization Guide, for example, offers an instruction sequence that would approximate sqrt with 4 ulp error, but if we say the flag allows 4 ulp error, then it doesn't do what we need for SYCL since the SYCL spec only allows 2.5 ulp error even in the relaxed mode.

So, having thought this through, I am persuaded that your suggestion of adding a -fsycl-prec-sqrt-f32 option that gets wired to the -fcuda-prec-sqrt makes sense. We should add -fsycl-prec-div-f32 while we're at it.

Copy link
Contributor

@elizabethandrews elizabethandrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FE changes LGTM

Copy link
Contributor

@smanna12 smanna12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FE changes look good to me.

Copy link
Contributor

@AGindinson AGindinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for Options.td

@bader bader merged commit ebf9ea8 into intel:sycl Dec 31, 2021
MrSidims added a commit to MrSidims/llvm that referenced this pull request Feb 18, 2025
It follows the approach from intel#5141
and intel#5309 adding intermediate
fcuda-prec-div flag.

Signed-off-by: Sidorov, Dmitry <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants