Skip to content

Commit 65898a3

Browse files
authored
[SYCL][NVPTX] Enable approximate div/sqrt with -ffast-math (#15553)
The generation of approximate div/sqrt in the NVPTX backend is driven by the "unsafe-fp-math" function attribute. Presumably when the optimization was first added there was no way of getting at this information from ISel, or even that there was no suitable instruction-level representation to begin with. Even today, the `afn` fast-math flag is appropriate for relaxing sqrt to an approximate version, but while some targets apply that reasoning to fdiv, it's not clear that's a valid reading of the language reference manual. The problem with using the function attribute is that when inlining it must be set on *both* caller/callee functions, otherwise it is wiped. Since CUDA's devicelib bytecode library has hundreds functions with unsafe-fp-math explicitly disabled, if we inline those functions into SYCL kernels, we disable the ability for the backend to generate approximate functions, not just inside the devicelib function but across the entire kernel. This might explain why some performance reports we've received suggest that inlining certain maths functions can make things worse even when the CUDA compiler does the same thing (e.g., #14358 though this needs verified). For this reason, presuambly, the NVPTX backend has two codegen options that override the function attribute and always generate approximate div/sqrt instructions. This patch thus explicitly sets these options when compiling SYCL for NVPTX GPUs. It does not do so for regular C/C++ or CUDA code to limit the wider impact on existing code.
1 parent 7f59dea commit 65898a3

File tree

2 files changed

+27
-0
lines changed

2 files changed

+27
-0
lines changed

clang/lib/Driver/ToolChains/Cuda.cpp

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -946,6 +946,15 @@ void CudaToolChain::addClangTargetOptions(
946946

947947
if (DriverArgs.hasArg(options::OPT_fsycl_fp32_prec_sqrt))
948948
CC1Args.push_back("-fcuda-prec-sqrt");
949+
950+
bool FastRelaxedMath = DriverArgs.hasFlag(
951+
options::OPT_ffast_math, options::OPT_fno_fast_math, false);
952+
bool UnsafeMathOpt =
953+
DriverArgs.hasFlag(options::OPT_funsafe_math_optimizations,
954+
options::OPT_fno_unsafe_math_optimizations, false);
955+
if (FastRelaxedMath || UnsafeMathOpt)
956+
CC1Args.append({"-mllvm", "--nvptx-prec-divf32=0", "-mllvm",
957+
"--nvptx-prec-sqrtf32=0"});
949958
} else {
950959
CC1Args.append(
951960
{"-fcuda-is-device", "-mllvm", "-enable-memcpyopt-without-libcalls"});
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
// REQUIRES: nvptx-registered-target
2+
3+
// RUN: %clang -### -nocudalib \
4+
// RUN: -fsycl -fsycl-targets=nvptx64-nvidia-cuda %s 2>&1 \
5+
// RUN: | FileCheck --check-prefix=CHECK-DEFAULT %s
6+
7+
// RUN: %clang -### -nocudalib \
8+
// RUN: -fsycl -fsycl-targets=nvptx64-nvidia-cuda -ffast-math %s 2>&1 \
9+
// RUN: | FileCheck --check-prefix=CHECK-FAST %s
10+
11+
// RUN: %clang -### -nocudalib \
12+
// RUN: -fsycl -fsycl-targets=nvptx64-nvidia-cuda -funsafe-math-optimizations %s 2>&1 \
13+
// RUN: | FileCheck --check-prefix=CHECK-FAST %s
14+
15+
// CHECK-FAST: "-mllvm" "--nvptx-prec-divf32=0" "-mllvm" "--nvptx-prec-sqrtf32=0"
16+
17+
// CHECK-DEFAULT-NOT: "nvptx-prec-divf32=0"
18+
// CHECK-DEFAULT-NOT: "nvptx-prec-sqrtf32=0"

0 commit comments

Comments
 (0)