-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[Clang][NVVM] Support -f[no-]cuda-prec-sqrt
and propagate precision flag to NVVMReflect
#134244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
NVCC provides the `-prec-sqrt` flag to control whether a precise or approximate square root function is used. However, LLVM previously always use the approximated version. With this change, Clang introduces the `-f[no-]cuda-prec-sqrt` flag, allowing users to specify precision behavior. The default is set to false to maintain existing behavior.
A module flag is now set based on the `-f[no]-cuda-prec-sqrt` flag, allowing the NVVMReflect pass to recognize and apply the specified square root precision.
The `__nv_sqrtf` intrinsic in libdevice.bc, defined by NVIDIA, depends not only on `__nvvm_reflect("__CUDA_FTZ")` but also on `__nvvm_reflect("__CUDA_PREC_SQRT")`. However, the NVVMReflect pass previously failed to recognize `__CUDA_PREC_SQRT`, causing its value to default to `0`. This change enables the NVVMReflect pass to correctly pick up the module flag "nvvm-reflect-prec-sqrt", which Clang sets based on the `-fcuda-prec-sqrt` flag, ensuring proper behavior.
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-backend-nvptx @llvm/pr-subscribers-clang-codegen Author: Lai-YT (Lai-YT) ChangesThis PR demonstrates a potential solution to the issue raised in #131749. The goal is to provide a more concrete standpoint, and I'm open to feedback. Major changes are welcome, and if this isn't the direction we want to take, I'm completely fine with not merging this PR. 😊 What's Changed?The flag I'm unsure about the ideal location for this flag, so I might have made some mistakes. Any reviews or suggestions are greatly appreciated. 🙏 Misc.The module flag llvm-project/clang/test/CodeGenCUDA/flush-denormals.cu Lines 47 to 51 in 52f3cad
This is my first PR for LLVM, so if I come across as impolite in any way, please let me know, and I will address it right away. ✨ Full diff: https://github.com/llvm/llvm-project/pull/134244.diff 8 Files Affected:
diff --git a/clang/include/clang/Basic/CodeGenOptions.h b/clang/include/clang/Basic/CodeGenOptions.h
index e39a73bdb13ac..1ca4360d67820 100644
--- a/clang/include/clang/Basic/CodeGenOptions.h
+++ b/clang/include/clang/Basic/CodeGenOptions.h
@@ -317,6 +317,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
/// CUDA runtime back-end for incorporating them into host-side object file.
std::string CudaGpuBinaryFileName;
+ /// Whether a precise or approximate square root should be used for CUDA
+ /// device code.
+ bool CudaPreciseSqrt;
+
/// List of filenames passed in using the -fembed-offload-object option. These
/// are offloading binaries containing device images and metadata.
std::vector<std::string> OffloadObjects;
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index e69b804de63b5..88ec378222840 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -1279,6 +1279,11 @@ def fcuda_flush_denormals_to_zero : Flag<["-"], "fcuda-flush-denormals-to-zero">
Alias<fgpu_flush_denormals_to_zero>;
def fno_cuda_flush_denormals_to_zero : Flag<["-"], "fno-cuda-flush-denormals-to-zero">,
Alias<fno_gpu_flush_denormals_to_zero>;
+defm cuda_prec_sqrt : BoolFOption<"cuda-prec-sqrt",
+ CodeGenOpts<"CudaPreciseSqrt">, DefaultFalse,
+ PosFlag<SetTrue, [], [ClangOption, CC1Option], "Enable">,
+ NegFlag<SetFalse, [], [ClangOption], "Disable">,
+ BothFlags<[], [ClangOption], " precise square root for CUDA device code.">>;
def : Flag<["-"], "fcuda-rdc">, Alias<fgpu_rdc>;
def : Flag<["-"], "fno-cuda-rdc">, Alias<fno_gpu_rdc>;
defm cuda_short_ptr : BoolFOption<"cuda-short-ptr",
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index 8f9cf965af2b9..7f99a951ab97f 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -1286,6 +1286,10 @@ void CodeGenModule::Release() {
}
if (LangOpts.CUDAIsDevice && getTriple().isNVPTX()) {
+ // Indicate whether __nvvm_reflect should be configured to use precise
+ // square root. (This corresponds to its "__CUDA_PREC_SQRT" property.)
+ getModule().addModuleFlag(llvm::Module::Override, "nvvm-reflect-prec-sqrt",
+ CodeGenOpts.CudaPreciseSqrt);
// Indicate whether __nvvm_reflect should be configured to flush denormal
// floating point values to 0. (This corresponds to its "__CUDA_FTZ"
// property.)
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 06b0b0913d24e..00048e9217518 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -19,6 +19,7 @@
#include "llvm/ADT/StringExtras.h"
#include "llvm/Config/llvm-config.h" // for LLVM_HOST_TRIPLE
#include "llvm/Option/ArgList.h"
+#include "llvm/Option/Option.h"
#include "llvm/Support/FileSystem.h"
#include "llvm/Support/FormatAdapters.h"
#include "llvm/Support/FormatVariadic.h"
@@ -862,6 +863,10 @@ void CudaToolChain::addClangTargetOptions(
if (CudaInstallation.version() >= CudaVersion::CUDA_90)
CC1Args.push_back("-fcuda-allow-variadic-functions");
+ if (DriverArgs.hasFlag(options::OPT_fcuda_prec_sqrt,
+ options::OPT_fno_cuda_prec_sqrt, false))
+ CC1Args.append({"-fcuda-prec-sqrt"});
+
if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
options::OPT_fno_cuda_short_ptr, false))
CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
diff --git a/clang/test/CodeGenCUDA/prec-sqrt.cu b/clang/test/CodeGenCUDA/prec-sqrt.cu
new file mode 100644
index 0000000000000..88c7692e8bb0a
--- /dev/null
+++ b/clang/test/CodeGenCUDA/prec-sqrt.cu
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -fcuda-is-device \
+// RUN: -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN: FileCheck -check-prefixes=NO-PREC-SQRT %s
+
+// RUN: %clang_cc1 -fcuda-is-device -fcuda-prec-sqrt \
+// RUN: -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN: FileCheck -check-prefixes=PREC-SQRT %s
+
+#include "Inputs/cuda.h"
+
+extern "C" __device__ void foo() {}
+
+
+// NO-PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}
+// PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}
diff --git a/clang/test/Driver/cuda-prec-sqrt.cu b/clang/test/Driver/cuda-prec-sqrt.cu
new file mode 100644
index 0000000000000..563c41b75d49a
--- /dev/null
+++ b/clang/test/Driver/cuda-prec-sqrt.cu
@@ -0,0 +1,6 @@
+// Checks that the -fcuda-prec-sqrt flag is passed to the cc1 frontend.
+
+// RUN: %clang -### --target=x86_64-linux-gnu -c -fcuda-prec-sqrt -nocudainc -nocudalib --cuda-path=%S/Inputs/CUDA/usr/local/cuda %s 2>&1 | FileCheck %s
+
+// CHECK: "-triple" "nvptx64-nvidia-cuda"
+// CHECK-SAME: "-fcuda-prec-sqrt"
diff --git a/llvm/lib/Target/NVPTX/NVVMReflect.cpp b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
index 20b8bef1899b4..593c98ea036c5 100644
--- a/llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -173,6 +173,12 @@ static bool runNVVMReflect(Function &F, unsigned SmVersion) {
if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
F.getParent()->getModuleFlag("nvvm-reflect-ftz")))
ReflectVal = Flag->getSExtValue();
+ } else if (ReflectArg == "__CUDA_PREC_SQRT") {
+ // Try to pull __CUDA_PREC_SQRT from the nvvm-reflect-prec-sqrt module
+ // flag.
+ if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
+ F.getParent()->getModuleFlag("nvvm-reflect-prec-sqrt")))
+ ReflectVal = Flag->getSExtValue();
} else if (ReflectArg == "__CUDA_ARCH") {
ReflectVal = SmVersion * 10;
}
diff --git a/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
new file mode 100644
index 0000000000000..5b584547f836b
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
@@ -0,0 +1,28 @@
+; We run nvvm-reflect (and then optimize) this module twice, once with metadata
+; that enables precise sqrt, and again with metadata that disables it.
+
+; RUN: cat %s > %t.noprec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}' >> %t.noprec
+; RUN: opt %t.noprec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN: | FileCheck %s --check-prefix=PREC_SQRT_0 --check-prefix=CHECK
+
+; RUN: cat %s > %t.prec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}' >> %t.prec
+; RUN: opt %t.prec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN: | FileCheck %s --check-prefix=PREC_SQRT_1 --check-prefix=CHECK
+
+@.str = private unnamed_addr constant [17 x i8] c"__CUDA_PREC_SQRT\00", align 1
+
+declare i32 @__nvvm_reflect(ptr)
+
+; CHECK-LABEL: @foo
+define i32 @foo() {
+ ; CHECK-NOT: call i32 @__nvvm_reflect
+ %reflect = call i32 @__nvvm_reflect(ptr @.str)
+ ; PREC_SQRT_0: ret i32 0
+ ; PREC_SQRT_1: ret i32 1
+ ret i32 %reflect
+}
+
+!llvm.module.flags = !{!0}
+; A module flag is added to the end of this file by the RUN lines at the top.
|
@llvm/pr-subscribers-clang-driver Author: Lai-YT (Lai-YT) ChangesThis PR demonstrates a potential solution to the issue raised in #131749. The goal is to provide a more concrete standpoint, and I'm open to feedback. Major changes are welcome, and if this isn't the direction we want to take, I'm completely fine with not merging this PR. 😊 What's Changed?The flag I'm unsure about the ideal location for this flag, so I might have made some mistakes. Any reviews or suggestions are greatly appreciated. 🙏 Misc.The module flag llvm-project/clang/test/CodeGenCUDA/flush-denormals.cu Lines 47 to 51 in 52f3cad
This is my first PR for LLVM, so if I come across as impolite in any way, please let me know, and I will address it right away. ✨ Full diff: https://github.com/llvm/llvm-project/pull/134244.diff 8 Files Affected:
diff --git a/clang/include/clang/Basic/CodeGenOptions.h b/clang/include/clang/Basic/CodeGenOptions.h
index e39a73bdb13ac..1ca4360d67820 100644
--- a/clang/include/clang/Basic/CodeGenOptions.h
+++ b/clang/include/clang/Basic/CodeGenOptions.h
@@ -317,6 +317,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
/// CUDA runtime back-end for incorporating them into host-side object file.
std::string CudaGpuBinaryFileName;
+ /// Whether a precise or approximate square root should be used for CUDA
+ /// device code.
+ bool CudaPreciseSqrt;
+
/// List of filenames passed in using the -fembed-offload-object option. These
/// are offloading binaries containing device images and metadata.
std::vector<std::string> OffloadObjects;
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index e69b804de63b5..88ec378222840 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -1279,6 +1279,11 @@ def fcuda_flush_denormals_to_zero : Flag<["-"], "fcuda-flush-denormals-to-zero">
Alias<fgpu_flush_denormals_to_zero>;
def fno_cuda_flush_denormals_to_zero : Flag<["-"], "fno-cuda-flush-denormals-to-zero">,
Alias<fno_gpu_flush_denormals_to_zero>;
+defm cuda_prec_sqrt : BoolFOption<"cuda-prec-sqrt",
+ CodeGenOpts<"CudaPreciseSqrt">, DefaultFalse,
+ PosFlag<SetTrue, [], [ClangOption, CC1Option], "Enable">,
+ NegFlag<SetFalse, [], [ClangOption], "Disable">,
+ BothFlags<[], [ClangOption], " precise square root for CUDA device code.">>;
def : Flag<["-"], "fcuda-rdc">, Alias<fgpu_rdc>;
def : Flag<["-"], "fno-cuda-rdc">, Alias<fno_gpu_rdc>;
defm cuda_short_ptr : BoolFOption<"cuda-short-ptr",
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index 8f9cf965af2b9..7f99a951ab97f 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -1286,6 +1286,10 @@ void CodeGenModule::Release() {
}
if (LangOpts.CUDAIsDevice && getTriple().isNVPTX()) {
+ // Indicate whether __nvvm_reflect should be configured to use precise
+ // square root. (This corresponds to its "__CUDA_PREC_SQRT" property.)
+ getModule().addModuleFlag(llvm::Module::Override, "nvvm-reflect-prec-sqrt",
+ CodeGenOpts.CudaPreciseSqrt);
// Indicate whether __nvvm_reflect should be configured to flush denormal
// floating point values to 0. (This corresponds to its "__CUDA_FTZ"
// property.)
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 06b0b0913d24e..00048e9217518 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -19,6 +19,7 @@
#include "llvm/ADT/StringExtras.h"
#include "llvm/Config/llvm-config.h" // for LLVM_HOST_TRIPLE
#include "llvm/Option/ArgList.h"
+#include "llvm/Option/Option.h"
#include "llvm/Support/FileSystem.h"
#include "llvm/Support/FormatAdapters.h"
#include "llvm/Support/FormatVariadic.h"
@@ -862,6 +863,10 @@ void CudaToolChain::addClangTargetOptions(
if (CudaInstallation.version() >= CudaVersion::CUDA_90)
CC1Args.push_back("-fcuda-allow-variadic-functions");
+ if (DriverArgs.hasFlag(options::OPT_fcuda_prec_sqrt,
+ options::OPT_fno_cuda_prec_sqrt, false))
+ CC1Args.append({"-fcuda-prec-sqrt"});
+
if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
options::OPT_fno_cuda_short_ptr, false))
CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
diff --git a/clang/test/CodeGenCUDA/prec-sqrt.cu b/clang/test/CodeGenCUDA/prec-sqrt.cu
new file mode 100644
index 0000000000000..88c7692e8bb0a
--- /dev/null
+++ b/clang/test/CodeGenCUDA/prec-sqrt.cu
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -fcuda-is-device \
+// RUN: -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN: FileCheck -check-prefixes=NO-PREC-SQRT %s
+
+// RUN: %clang_cc1 -fcuda-is-device -fcuda-prec-sqrt \
+// RUN: -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN: FileCheck -check-prefixes=PREC-SQRT %s
+
+#include "Inputs/cuda.h"
+
+extern "C" __device__ void foo() {}
+
+
+// NO-PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}
+// PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}
diff --git a/clang/test/Driver/cuda-prec-sqrt.cu b/clang/test/Driver/cuda-prec-sqrt.cu
new file mode 100644
index 0000000000000..563c41b75d49a
--- /dev/null
+++ b/clang/test/Driver/cuda-prec-sqrt.cu
@@ -0,0 +1,6 @@
+// Checks that the -fcuda-prec-sqrt flag is passed to the cc1 frontend.
+
+// RUN: %clang -### --target=x86_64-linux-gnu -c -fcuda-prec-sqrt -nocudainc -nocudalib --cuda-path=%S/Inputs/CUDA/usr/local/cuda %s 2>&1 | FileCheck %s
+
+// CHECK: "-triple" "nvptx64-nvidia-cuda"
+// CHECK-SAME: "-fcuda-prec-sqrt"
diff --git a/llvm/lib/Target/NVPTX/NVVMReflect.cpp b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
index 20b8bef1899b4..593c98ea036c5 100644
--- a/llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -173,6 +173,12 @@ static bool runNVVMReflect(Function &F, unsigned SmVersion) {
if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
F.getParent()->getModuleFlag("nvvm-reflect-ftz")))
ReflectVal = Flag->getSExtValue();
+ } else if (ReflectArg == "__CUDA_PREC_SQRT") {
+ // Try to pull __CUDA_PREC_SQRT from the nvvm-reflect-prec-sqrt module
+ // flag.
+ if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
+ F.getParent()->getModuleFlag("nvvm-reflect-prec-sqrt")))
+ ReflectVal = Flag->getSExtValue();
} else if (ReflectArg == "__CUDA_ARCH") {
ReflectVal = SmVersion * 10;
}
diff --git a/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
new file mode 100644
index 0000000000000..5b584547f836b
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
@@ -0,0 +1,28 @@
+; We run nvvm-reflect (and then optimize) this module twice, once with metadata
+; that enables precise sqrt, and again with metadata that disables it.
+
+; RUN: cat %s > %t.noprec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}' >> %t.noprec
+; RUN: opt %t.noprec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN: | FileCheck %s --check-prefix=PREC_SQRT_0 --check-prefix=CHECK
+
+; RUN: cat %s > %t.prec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}' >> %t.prec
+; RUN: opt %t.prec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN: | FileCheck %s --check-prefix=PREC_SQRT_1 --check-prefix=CHECK
+
+@.str = private unnamed_addr constant [17 x i8] c"__CUDA_PREC_SQRT\00", align 1
+
+declare i32 @__nvvm_reflect(ptr)
+
+; CHECK-LABEL: @foo
+define i32 @foo() {
+ ; CHECK-NOT: call i32 @__nvvm_reflect
+ %reflect = call i32 @__nvvm_reflect(ptr @.str)
+ ; PREC_SQRT_0: ret i32 0
+ ; PREC_SQRT_1: ret i32 1
+ ret i32 %reflect
+}
+
+!llvm.module.flags = !{!0}
+; A module flag is added to the end of this file by the RUN lines at the top.
|
@llvm/pr-subscribers-clang Author: Lai-YT (Lai-YT) ChangesThis PR demonstrates a potential solution to the issue raised in #131749. The goal is to provide a more concrete standpoint, and I'm open to feedback. Major changes are welcome, and if this isn't the direction we want to take, I'm completely fine with not merging this PR. 😊 What's Changed?The flag I'm unsure about the ideal location for this flag, so I might have made some mistakes. Any reviews or suggestions are greatly appreciated. 🙏 Misc.The module flag llvm-project/clang/test/CodeGenCUDA/flush-denormals.cu Lines 47 to 51 in 52f3cad
This is my first PR for LLVM, so if I come across as impolite in any way, please let me know, and I will address it right away. ✨ Full diff: https://github.com/llvm/llvm-project/pull/134244.diff 8 Files Affected:
diff --git a/clang/include/clang/Basic/CodeGenOptions.h b/clang/include/clang/Basic/CodeGenOptions.h
index e39a73bdb13ac..1ca4360d67820 100644
--- a/clang/include/clang/Basic/CodeGenOptions.h
+++ b/clang/include/clang/Basic/CodeGenOptions.h
@@ -317,6 +317,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
/// CUDA runtime back-end for incorporating them into host-side object file.
std::string CudaGpuBinaryFileName;
+ /// Whether a precise or approximate square root should be used for CUDA
+ /// device code.
+ bool CudaPreciseSqrt;
+
/// List of filenames passed in using the -fembed-offload-object option. These
/// are offloading binaries containing device images and metadata.
std::vector<std::string> OffloadObjects;
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index e69b804de63b5..88ec378222840 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -1279,6 +1279,11 @@ def fcuda_flush_denormals_to_zero : Flag<["-"], "fcuda-flush-denormals-to-zero">
Alias<fgpu_flush_denormals_to_zero>;
def fno_cuda_flush_denormals_to_zero : Flag<["-"], "fno-cuda-flush-denormals-to-zero">,
Alias<fno_gpu_flush_denormals_to_zero>;
+defm cuda_prec_sqrt : BoolFOption<"cuda-prec-sqrt",
+ CodeGenOpts<"CudaPreciseSqrt">, DefaultFalse,
+ PosFlag<SetTrue, [], [ClangOption, CC1Option], "Enable">,
+ NegFlag<SetFalse, [], [ClangOption], "Disable">,
+ BothFlags<[], [ClangOption], " precise square root for CUDA device code.">>;
def : Flag<["-"], "fcuda-rdc">, Alias<fgpu_rdc>;
def : Flag<["-"], "fno-cuda-rdc">, Alias<fno_gpu_rdc>;
defm cuda_short_ptr : BoolFOption<"cuda-short-ptr",
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index 8f9cf965af2b9..7f99a951ab97f 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -1286,6 +1286,10 @@ void CodeGenModule::Release() {
}
if (LangOpts.CUDAIsDevice && getTriple().isNVPTX()) {
+ // Indicate whether __nvvm_reflect should be configured to use precise
+ // square root. (This corresponds to its "__CUDA_PREC_SQRT" property.)
+ getModule().addModuleFlag(llvm::Module::Override, "nvvm-reflect-prec-sqrt",
+ CodeGenOpts.CudaPreciseSqrt);
// Indicate whether __nvvm_reflect should be configured to flush denormal
// floating point values to 0. (This corresponds to its "__CUDA_FTZ"
// property.)
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 06b0b0913d24e..00048e9217518 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -19,6 +19,7 @@
#include "llvm/ADT/StringExtras.h"
#include "llvm/Config/llvm-config.h" // for LLVM_HOST_TRIPLE
#include "llvm/Option/ArgList.h"
+#include "llvm/Option/Option.h"
#include "llvm/Support/FileSystem.h"
#include "llvm/Support/FormatAdapters.h"
#include "llvm/Support/FormatVariadic.h"
@@ -862,6 +863,10 @@ void CudaToolChain::addClangTargetOptions(
if (CudaInstallation.version() >= CudaVersion::CUDA_90)
CC1Args.push_back("-fcuda-allow-variadic-functions");
+ if (DriverArgs.hasFlag(options::OPT_fcuda_prec_sqrt,
+ options::OPT_fno_cuda_prec_sqrt, false))
+ CC1Args.append({"-fcuda-prec-sqrt"});
+
if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
options::OPT_fno_cuda_short_ptr, false))
CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
diff --git a/clang/test/CodeGenCUDA/prec-sqrt.cu b/clang/test/CodeGenCUDA/prec-sqrt.cu
new file mode 100644
index 0000000000000..88c7692e8bb0a
--- /dev/null
+++ b/clang/test/CodeGenCUDA/prec-sqrt.cu
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -fcuda-is-device \
+// RUN: -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN: FileCheck -check-prefixes=NO-PREC-SQRT %s
+
+// RUN: %clang_cc1 -fcuda-is-device -fcuda-prec-sqrt \
+// RUN: -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN: FileCheck -check-prefixes=PREC-SQRT %s
+
+#include "Inputs/cuda.h"
+
+extern "C" __device__ void foo() {}
+
+
+// NO-PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}
+// PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}
diff --git a/clang/test/Driver/cuda-prec-sqrt.cu b/clang/test/Driver/cuda-prec-sqrt.cu
new file mode 100644
index 0000000000000..563c41b75d49a
--- /dev/null
+++ b/clang/test/Driver/cuda-prec-sqrt.cu
@@ -0,0 +1,6 @@
+// Checks that the -fcuda-prec-sqrt flag is passed to the cc1 frontend.
+
+// RUN: %clang -### --target=x86_64-linux-gnu -c -fcuda-prec-sqrt -nocudainc -nocudalib --cuda-path=%S/Inputs/CUDA/usr/local/cuda %s 2>&1 | FileCheck %s
+
+// CHECK: "-triple" "nvptx64-nvidia-cuda"
+// CHECK-SAME: "-fcuda-prec-sqrt"
diff --git a/llvm/lib/Target/NVPTX/NVVMReflect.cpp b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
index 20b8bef1899b4..593c98ea036c5 100644
--- a/llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -173,6 +173,12 @@ static bool runNVVMReflect(Function &F, unsigned SmVersion) {
if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
F.getParent()->getModuleFlag("nvvm-reflect-ftz")))
ReflectVal = Flag->getSExtValue();
+ } else if (ReflectArg == "__CUDA_PREC_SQRT") {
+ // Try to pull __CUDA_PREC_SQRT from the nvvm-reflect-prec-sqrt module
+ // flag.
+ if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
+ F.getParent()->getModuleFlag("nvvm-reflect-prec-sqrt")))
+ ReflectVal = Flag->getSExtValue();
} else if (ReflectArg == "__CUDA_ARCH") {
ReflectVal = SmVersion * 10;
}
diff --git a/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
new file mode 100644
index 0000000000000..5b584547f836b
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
@@ -0,0 +1,28 @@
+; We run nvvm-reflect (and then optimize) this module twice, once with metadata
+; that enables precise sqrt, and again with metadata that disables it.
+
+; RUN: cat %s > %t.noprec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}' >> %t.noprec
+; RUN: opt %t.noprec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN: | FileCheck %s --check-prefix=PREC_SQRT_0 --check-prefix=CHECK
+
+; RUN: cat %s > %t.prec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}' >> %t.prec
+; RUN: opt %t.prec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN: | FileCheck %s --check-prefix=PREC_SQRT_1 --check-prefix=CHECK
+
+@.str = private unnamed_addr constant [17 x i8] c"__CUDA_PREC_SQRT\00", align 1
+
+declare i32 @__nvvm_reflect(ptr)
+
+; CHECK-LABEL: @foo
+define i32 @foo() {
+ ; CHECK-NOT: call i32 @__nvvm_reflect
+ %reflect = call i32 @__nvvm_reflect(ptr @.str)
+ ; PREC_SQRT_0: ret i32 0
+ ; PREC_SQRT_1: ret i32 1
+ ret i32 %reflect
+}
+
+!llvm.module.flags = !{!0}
+; A module flag is added to the end of this file by the RUN lines at the top.
|
Why is this a module flag, instead of a function attribute? |
The other value handled by the NVVMReflect pass is Also, since the decision to use precise square root functions seems to apply at the module level, using a module flag feels reasonable. |
At first glance, it seems like a good idea to allow people to choose how their square roots are lowered on a per-function level: some code cares about precise square roots, some doesn't, and you should be able to make choices on a case-by-case basis. But looking at the code structure, maybe cuda doesn't allow that? |
Yeah, AFAIK, CUDA—or more precisely, NVCC—only allows this to be specified at the module level. 😣 |
Hi @Artem-B, would you mind taking a look at this PR when you have time? |
@AlexMaclean who authored #89417 and possibly other NVIDIA folks may have some thoughts on this. In general, making it per-function attribute makes sense on LLVM level. We will also need to reconcile it with the llvm-project/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp Lines 94 to 96 in 10bef36
However, propagating it to NVVMReflect pass complicates things, as libdevice we're linking with is linked once per module. I think we may need to disentangle libdevice from the IR generated by clang. Currently, CUDA compilation. call to What we need to do is change |
It seems like we already have perhaps too many mechanisms to control how sqrt gets lowered. There is the I think for more fine grained responsiveness to instruction and function level options it makes sense to use the existing intrinsics. While, it is consistent with the existing design to treat NVVMReflect as operating globally across the entire module. I'm not sure it makes sense to introduce a new module flag and clang cl opt though... I personally agree with @Artem-B that |
Got it. I did come across Thanks so much for the insights, @Artem-B and @AlexMaclean! 🤗 |
This PR demonstrates a potential solution to the issue raised in #131749. The goal is to provide a more concrete standpoint, and I'm open to feedback. Major changes are welcome, and if this isn't the direction we want to take, I'm completely fine with not merging this PR. 😊
What's Changed?
The flag
-fcuda-prec-sqrt
has been added to the Clang driver and Clang frontend. This sets theCodeGenOpts.CudaPreciseSqrt
option, which influences the value of the module flag"nvvm-reflect-prec-sqrt"
. This flag is then resolved by the NVVMReflect pass for__nvvm_reflect("__CUDA_PREC_SQRT")
.I'm unsure about the ideal location for this flag, so I might have made some mistakes. Any reviews or suggestions are greatly appreciated. 🙏
Misc.
The module flag
"nvvm-reflect-prec-sqrt"
is added before"nvvm-reflect-ftz"
. This ordering is intentional because one of the tests for"nvvm-reflect-ftz"
relies on it being the last module flag. You can see the related test here:llvm-project/clang/test/CodeGenCUDA/flush-denormals.cu
Lines 47 to 51 in 52f3cad
This is my first PR for LLVM, so if I come across as impolite in any way, please let me know, and I will address it right away. ✨