[Clang][NVVM] Support `-f[no-]cuda-prec-sqrt` and propagate precision flag to `NVVMReflect` #134244

Lai-YT · 2025-04-03T12:58:04Z

This PR demonstrates a potential solution to the issue raised in #131749. The goal is to provide a more concrete standpoint, and I'm open to feedback. Major changes are welcome, and if this isn't the direction we want to take, I'm completely fine with not merging this PR. 😊

What's Changed?

The flag -fcuda-prec-sqrt has been added to the Clang driver and Clang frontend. This sets the CodeGenOpts.CudaPreciseSqrt option, which influences the value of the module flag "nvvm-reflect-prec-sqrt". This flag is then resolved by the NVVMReflect pass for __nvvm_reflect("__CUDA_PREC_SQRT").

I'm unsure about the ideal location for this flag, so I might have made some mistakes. Any reviews or suggestions are greatly appreciated. 🙏

Misc.

The module flag "nvvm-reflect-prec-sqrt" is added before "nvvm-reflect-ftz". This ordering is intentional because one of the tests for "nvvm-reflect-ftz" relies on it being the last module flag. You can see the related test here:

llvm-project/clang/test/CodeGenCUDA/flush-denormals.cu

Lines 47 to 51 in 52f3cad

    
           // PTXFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]} 
        
           // PTXFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 1} 
        
           // PTXNOFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]} 
        
           // PTXNOFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 0}

This is my first PR for LLVM, so if I come across as impolite in any way, please let me know, and I will address it right away. ✨

NVCC provides the `-prec-sqrt` flag to control whether a precise or approximate square root function is used. However, LLVM previously always use the approximated version. With this change, Clang introduces the `-f[no-]cuda-prec-sqrt` flag, allowing users to specify precision behavior. The default is set to false to maintain existing behavior.

A module flag is now set based on the `-f[no]-cuda-prec-sqrt` flag, allowing the NVVMReflect pass to recognize and apply the specified square root precision.

The `__nv_sqrtf` intrinsic in libdevice.bc, defined by NVIDIA, depends not only on `__nvvm_reflect("__CUDA_FTZ")` but also on `__nvvm_reflect("__CUDA_PREC_SQRT")`. However, the NVVMReflect pass previously failed to recognize `__CUDA_PREC_SQRT`, causing its value to default to `0`. This change enables the NVVMReflect pass to correctly pick up the module flag "nvvm-reflect-prec-sqrt", which Clang sets based on the `-fcuda-prec-sqrt` flag, ensuring proper behavior.

github-actions · 2025-04-03T12:58:22Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-04-03T12:58:54Z

@llvm/pr-subscribers-backend-nvptx

@llvm/pr-subscribers-clang-codegen

Author: Lai-YT (Lai-YT)

Changes

This PR demonstrates a potential solution to the issue raised in #131749. The goal is to provide a more concrete standpoint, and I'm open to feedback. Major changes are welcome, and if this isn't the direction we want to take, I'm completely fine with not merging this PR. 😊

What's Changed?

The flag -fcuda-prec-sqrt has been added to the Clang driver and Clang frontend. This sets the CodeGenOpts.CudaPreciseSqrt option, which influences the value of the module flag "nvvm-reflect-prec-sqrt". This flag is then resolved by the NVVMReflect pass for __nvvm_reflect("__CUDA_PREC_SQRT").

I'm unsure about the ideal location for this flag, so I might have made some mistakes. Any reviews or suggestions are greatly appreciated. 🙏

Misc.

The module flag "nvvm-reflect-prec-sqrt" is added before "nvvm-reflect-ftz". This ordering is intentional because one of the tests for "nvvm-reflect-ftz" relies on it being the last module flag. You can see the related test here:

llvm-project/clang/test/CodeGenCUDA/flush-denormals.cu

Lines 47 to 51 in 52f3cad

    
           // PTXFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]} 
        
           // PTXFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 1} 
        
           // PTXNOFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]} 
        
           // PTXNOFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 0}

This is my first PR for LLVM, so if I come across as impolite in any way, please let me know, and I will address it right away. ✨

Full diff: https://github.com/llvm/llvm-project/pull/134244.diff

8 Files Affected:

(modified) clang/include/clang/Basic/CodeGenOptions.h (+4)
(modified) clang/include/clang/Driver/Options.td (+5)
(modified) clang/lib/CodeGen/CodeGenModule.cpp (+4)
(modified) clang/lib/Driver/ToolChains/Cuda.cpp (+5)
(added) clang/test/CodeGenCUDA/prec-sqrt.cu (+15)
(added) clang/test/Driver/cuda-prec-sqrt.cu (+6)
(modified) llvm/lib/Target/NVPTX/NVVMReflect.cpp (+6)
(added) llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll (+28)

diff --git a/clang/include/clang/Basic/CodeGenOptions.h b/clang/include/clang/Basic/CodeGenOptions.h
index e39a73bdb13ac..1ca4360d67820 100644
--- a/clang/include/clang/Basic/CodeGenOptions.h
+++ b/clang/include/clang/Basic/CodeGenOptions.h
@@ -317,6 +317,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
   /// CUDA runtime back-end for incorporating them into host-side object file.
   std::string CudaGpuBinaryFileName;
 
+  /// Whether a precise or approximate square root should be used for CUDA
+  /// device code.
+  bool CudaPreciseSqrt;
+
   /// List of filenames passed in using the -fembed-offload-object option. These
   /// are offloading binaries containing device images and metadata.
   std::vector<std::string> OffloadObjects;
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index e69b804de63b5..88ec378222840 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -1279,6 +1279,11 @@ def fcuda_flush_denormals_to_zero : Flag<["-"], "fcuda-flush-denormals-to-zero">
   Alias<fgpu_flush_denormals_to_zero>;
 def fno_cuda_flush_denormals_to_zero : Flag<["-"], "fno-cuda-flush-denormals-to-zero">,
   Alias<fno_gpu_flush_denormals_to_zero>;
+defm cuda_prec_sqrt : BoolFOption<"cuda-prec-sqrt",
+  CodeGenOpts<"CudaPreciseSqrt">, DefaultFalse,
+  PosFlag<SetTrue, [], [ClangOption, CC1Option], "Enable">,
+  NegFlag<SetFalse, [], [ClangOption], "Disable">,
+  BothFlags<[], [ClangOption], " precise square root for CUDA device code.">>;
 def : Flag<["-"], "fcuda-rdc">, Alias<fgpu_rdc>;
 def : Flag<["-"], "fno-cuda-rdc">, Alias<fno_gpu_rdc>;
 defm cuda_short_ptr : BoolFOption<"cuda-short-ptr",
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index 8f9cf965af2b9..7f99a951ab97f 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -1286,6 +1286,10 @@ void CodeGenModule::Release() {
   }
 
   if (LangOpts.CUDAIsDevice && getTriple().isNVPTX()) {
+    // Indicate whether __nvvm_reflect should be configured to use precise
+    // square root.  (This corresponds to its "__CUDA_PREC_SQRT" property.)
+    getModule().addModuleFlag(llvm::Module::Override, "nvvm-reflect-prec-sqrt",
+                              CodeGenOpts.CudaPreciseSqrt);
     // Indicate whether __nvvm_reflect should be configured to flush denormal
     // floating point values to 0.  (This corresponds to its "__CUDA_FTZ"
     // property.)
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 06b0b0913d24e..00048e9217518 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -19,6 +19,7 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/Config/llvm-config.h" // for LLVM_HOST_TRIPLE
 #include "llvm/Option/ArgList.h"
+#include "llvm/Option/Option.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/FormatAdapters.h"
 #include "llvm/Support/FormatVariadic.h"
@@ -862,6 +863,10 @@ void CudaToolChain::addClangTargetOptions(
   if (CudaInstallation.version() >= CudaVersion::CUDA_90)
     CC1Args.push_back("-fcuda-allow-variadic-functions");
 
+  if (DriverArgs.hasFlag(options::OPT_fcuda_prec_sqrt,
+                         options::OPT_fno_cuda_prec_sqrt, false))
+    CC1Args.append({"-fcuda-prec-sqrt"});
+
   if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
                          options::OPT_fno_cuda_short_ptr, false))
     CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
diff --git a/clang/test/CodeGenCUDA/prec-sqrt.cu b/clang/test/CodeGenCUDA/prec-sqrt.cu
new file mode 100644
index 0000000000000..88c7692e8bb0a
--- /dev/null
+++ b/clang/test/CodeGenCUDA/prec-sqrt.cu
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -fcuda-is-device \
+// RUN:   -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN:   FileCheck -check-prefixes=NO-PREC-SQRT %s
+
+// RUN: %clang_cc1 -fcuda-is-device -fcuda-prec-sqrt \
+// RUN:   -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN:   FileCheck -check-prefixes=PREC-SQRT %s
+
+#include "Inputs/cuda.h"
+
+extern "C" __device__ void foo() {}
+
+
+// NO-PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}
+// PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}
diff --git a/clang/test/Driver/cuda-prec-sqrt.cu b/clang/test/Driver/cuda-prec-sqrt.cu
new file mode 100644
index 0000000000000..563c41b75d49a
--- /dev/null
+++ b/clang/test/Driver/cuda-prec-sqrt.cu
@@ -0,0 +1,6 @@
+// Checks that the -fcuda-prec-sqrt flag is passed to the cc1 frontend.
+
+// RUN: %clang -### --target=x86_64-linux-gnu -c -fcuda-prec-sqrt -nocudainc -nocudalib --cuda-path=%S/Inputs/CUDA/usr/local/cuda %s 2>&1 | FileCheck %s
+
+// CHECK: "-triple" "nvptx64-nvidia-cuda"
+// CHECK-SAME: "-fcuda-prec-sqrt"
diff --git a/llvm/lib/Target/NVPTX/NVVMReflect.cpp b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
index 20b8bef1899b4..593c98ea036c5 100644
--- a/llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -173,6 +173,12 @@ static bool runNVVMReflect(Function &F, unsigned SmVersion) {
       if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
               F.getParent()->getModuleFlag("nvvm-reflect-ftz")))
         ReflectVal = Flag->getSExtValue();
+    } else if (ReflectArg == "__CUDA_PREC_SQRT") {
+      // Try to pull __CUDA_PREC_SQRT from the nvvm-reflect-prec-sqrt module
+      // flag.
+      if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
+              F.getParent()->getModuleFlag("nvvm-reflect-prec-sqrt")))
+        ReflectVal = Flag->getSExtValue();
     } else if (ReflectArg == "__CUDA_ARCH") {
       ReflectVal = SmVersion * 10;
     }
diff --git a/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
new file mode 100644
index 0000000000000..5b584547f836b
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
@@ -0,0 +1,28 @@
+; We run nvvm-reflect (and then optimize) this module twice, once with metadata
+; that enables precise sqrt, and again with metadata that disables it.
+
+; RUN: cat %s > %t.noprec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}' >> %t.noprec
+; RUN: opt %t.noprec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN:   | FileCheck %s --check-prefix=PREC_SQRT_0 --check-prefix=CHECK
+
+; RUN: cat %s > %t.prec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}' >> %t.prec
+; RUN: opt %t.prec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN:   | FileCheck %s --check-prefix=PREC_SQRT_1 --check-prefix=CHECK
+
+@.str = private unnamed_addr constant [17 x i8] c"__CUDA_PREC_SQRT\00", align 1
+
+declare i32 @__nvvm_reflect(ptr)
+
+; CHECK-LABEL: @foo
+define i32 @foo() {
+  ; CHECK-NOT: call i32 @__nvvm_reflect
+  %reflect = call i32 @__nvvm_reflect(ptr @.str)
+  ; PREC_SQRT_0: ret i32 0
+  ; PREC_SQRT_1: ret i32 1
+  ret i32 %reflect
+}
+
+!llvm.module.flags = !{!0}
+; A module flag is added to the end of this file by the RUN lines at the top.

llvmbot · 2025-04-03T12:58:55Z

@llvm/pr-subscribers-clang-driver

Author: Lai-YT (Lai-YT)

Changes

This PR demonstrates a potential solution to the issue raised in #131749. The goal is to provide a more concrete standpoint, and I'm open to feedback. Major changes are welcome, and if this isn't the direction we want to take, I'm completely fine with not merging this PR. 😊

What's Changed?

The flag -fcuda-prec-sqrt has been added to the Clang driver and Clang frontend. This sets the CodeGenOpts.CudaPreciseSqrt option, which influences the value of the module flag "nvvm-reflect-prec-sqrt". This flag is then resolved by the NVVMReflect pass for __nvvm_reflect("__CUDA_PREC_SQRT").

I'm unsure about the ideal location for this flag, so I might have made some mistakes. Any reviews or suggestions are greatly appreciated. 🙏

Misc.

The module flag "nvvm-reflect-prec-sqrt" is added before "nvvm-reflect-ftz". This ordering is intentional because one of the tests for "nvvm-reflect-ftz" relies on it being the last module flag. You can see the related test here:

llvm-project/clang/test/CodeGenCUDA/flush-denormals.cu

Lines 47 to 51 in 52f3cad

    
           // PTXFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]} 
        
           // PTXFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 1} 
        
           // PTXNOFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]} 
        
           // PTXNOFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 0}

This is my first PR for LLVM, so if I come across as impolite in any way, please let me know, and I will address it right away. ✨

Full diff: https://github.com/llvm/llvm-project/pull/134244.diff

8 Files Affected:

(modified) clang/include/clang/Basic/CodeGenOptions.h (+4)
(modified) clang/include/clang/Driver/Options.td (+5)
(modified) clang/lib/CodeGen/CodeGenModule.cpp (+4)
(modified) clang/lib/Driver/ToolChains/Cuda.cpp (+5)
(added) clang/test/CodeGenCUDA/prec-sqrt.cu (+15)
(added) clang/test/Driver/cuda-prec-sqrt.cu (+6)
(modified) llvm/lib/Target/NVPTX/NVVMReflect.cpp (+6)
(added) llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll (+28)

diff --git a/clang/include/clang/Basic/CodeGenOptions.h b/clang/include/clang/Basic/CodeGenOptions.h
index e39a73bdb13ac..1ca4360d67820 100644
--- a/clang/include/clang/Basic/CodeGenOptions.h
+++ b/clang/include/clang/Basic/CodeGenOptions.h
@@ -317,6 +317,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
   /// CUDA runtime back-end for incorporating them into host-side object file.
   std::string CudaGpuBinaryFileName;
 
+  /// Whether a precise or approximate square root should be used for CUDA
+  /// device code.
+  bool CudaPreciseSqrt;
+
   /// List of filenames passed in using the -fembed-offload-object option. These
   /// are offloading binaries containing device images and metadata.
   std::vector<std::string> OffloadObjects;
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index e69b804de63b5..88ec378222840 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -1279,6 +1279,11 @@ def fcuda_flush_denormals_to_zero : Flag<["-"], "fcuda-flush-denormals-to-zero">
   Alias<fgpu_flush_denormals_to_zero>;
 def fno_cuda_flush_denormals_to_zero : Flag<["-"], "fno-cuda-flush-denormals-to-zero">,
   Alias<fno_gpu_flush_denormals_to_zero>;
+defm cuda_prec_sqrt : BoolFOption<"cuda-prec-sqrt",
+  CodeGenOpts<"CudaPreciseSqrt">, DefaultFalse,
+  PosFlag<SetTrue, [], [ClangOption, CC1Option], "Enable">,
+  NegFlag<SetFalse, [], [ClangOption], "Disable">,
+  BothFlags<[], [ClangOption], " precise square root for CUDA device code.">>;
 def : Flag<["-"], "fcuda-rdc">, Alias<fgpu_rdc>;
 def : Flag<["-"], "fno-cuda-rdc">, Alias<fno_gpu_rdc>;
 defm cuda_short_ptr : BoolFOption<"cuda-short-ptr",
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index 8f9cf965af2b9..7f99a951ab97f 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -1286,6 +1286,10 @@ void CodeGenModule::Release() {
   }
 
   if (LangOpts.CUDAIsDevice && getTriple().isNVPTX()) {
+    // Indicate whether __nvvm_reflect should be configured to use precise
+    // square root.  (This corresponds to its "__CUDA_PREC_SQRT" property.)
+    getModule().addModuleFlag(llvm::Module::Override, "nvvm-reflect-prec-sqrt",
+                              CodeGenOpts.CudaPreciseSqrt);
     // Indicate whether __nvvm_reflect should be configured to flush denormal
     // floating point values to 0.  (This corresponds to its "__CUDA_FTZ"
     // property.)
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 06b0b0913d24e..00048e9217518 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -19,6 +19,7 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/Config/llvm-config.h" // for LLVM_HOST_TRIPLE
 #include "llvm/Option/ArgList.h"
+#include "llvm/Option/Option.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/FormatAdapters.h"
 #include "llvm/Support/FormatVariadic.h"
@@ -862,6 +863,10 @@ void CudaToolChain::addClangTargetOptions(
   if (CudaInstallation.version() >= CudaVersion::CUDA_90)
     CC1Args.push_back("-fcuda-allow-variadic-functions");
 
+  if (DriverArgs.hasFlag(options::OPT_fcuda_prec_sqrt,
+                         options::OPT_fno_cuda_prec_sqrt, false))
+    CC1Args.append({"-fcuda-prec-sqrt"});
+
   if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
                          options::OPT_fno_cuda_short_ptr, false))
     CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
diff --git a/clang/test/CodeGenCUDA/prec-sqrt.cu b/clang/test/CodeGenCUDA/prec-sqrt.cu
new file mode 100644
index 0000000000000..88c7692e8bb0a
--- /dev/null
+++ b/clang/test/CodeGenCUDA/prec-sqrt.cu
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -fcuda-is-device \
+// RUN:   -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN:   FileCheck -check-prefixes=NO-PREC-SQRT %s
+
+// RUN: %clang_cc1 -fcuda-is-device -fcuda-prec-sqrt \
+// RUN:   -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN:   FileCheck -check-prefixes=PREC-SQRT %s
+
+#include "Inputs/cuda.h"
+
+extern "C" __device__ void foo() {}
+
+
+// NO-PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}
+// PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}
diff --git a/clang/test/Driver/cuda-prec-sqrt.cu b/clang/test/Driver/cuda-prec-sqrt.cu
new file mode 100644
index 0000000000000..563c41b75d49a
--- /dev/null
+++ b/clang/test/Driver/cuda-prec-sqrt.cu
@@ -0,0 +1,6 @@
+// Checks that the -fcuda-prec-sqrt flag is passed to the cc1 frontend.
+
+// RUN: %clang -### --target=x86_64-linux-gnu -c -fcuda-prec-sqrt -nocudainc -nocudalib --cuda-path=%S/Inputs/CUDA/usr/local/cuda %s 2>&1 | FileCheck %s
+
+// CHECK: "-triple" "nvptx64-nvidia-cuda"
+// CHECK-SAME: "-fcuda-prec-sqrt"
diff --git a/llvm/lib/Target/NVPTX/NVVMReflect.cpp b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
index 20b8bef1899b4..593c98ea036c5 100644
--- a/llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -173,6 +173,12 @@ static bool runNVVMReflect(Function &F, unsigned SmVersion) {
       if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
               F.getParent()->getModuleFlag("nvvm-reflect-ftz")))
         ReflectVal = Flag->getSExtValue();
+    } else if (ReflectArg == "__CUDA_PREC_SQRT") {
+      // Try to pull __CUDA_PREC_SQRT from the nvvm-reflect-prec-sqrt module
+      // flag.
+      if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
+              F.getParent()->getModuleFlag("nvvm-reflect-prec-sqrt")))
+        ReflectVal = Flag->getSExtValue();
     } else if (ReflectArg == "__CUDA_ARCH") {
       ReflectVal = SmVersion * 10;
     }
diff --git a/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
new file mode 100644
index 0000000000000..5b584547f836b
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
@@ -0,0 +1,28 @@
+; We run nvvm-reflect (and then optimize) this module twice, once with metadata
+; that enables precise sqrt, and again with metadata that disables it.
+
+; RUN: cat %s > %t.noprec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}' >> %t.noprec
+; RUN: opt %t.noprec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN:   | FileCheck %s --check-prefix=PREC_SQRT_0 --check-prefix=CHECK
+
+; RUN: cat %s > %t.prec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}' >> %t.prec
+; RUN: opt %t.prec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN:   | FileCheck %s --check-prefix=PREC_SQRT_1 --check-prefix=CHECK
+
+@.str = private unnamed_addr constant [17 x i8] c"__CUDA_PREC_SQRT\00", align 1
+
+declare i32 @__nvvm_reflect(ptr)
+
+; CHECK-LABEL: @foo
+define i32 @foo() {
+  ; CHECK-NOT: call i32 @__nvvm_reflect
+  %reflect = call i32 @__nvvm_reflect(ptr @.str)
+  ; PREC_SQRT_0: ret i32 0
+  ; PREC_SQRT_1: ret i32 1
+  ret i32 %reflect
+}
+
+!llvm.module.flags = !{!0}
+; A module flag is added to the end of this file by the RUN lines at the top.

llvmbot · 2025-04-03T12:58:55Z

@llvm/pr-subscribers-clang

Author: Lai-YT (Lai-YT)

Changes

This PR demonstrates a potential solution to the issue raised in #131749. The goal is to provide a more concrete standpoint, and I'm open to feedback. Major changes are welcome, and if this isn't the direction we want to take, I'm completely fine with not merging this PR. 😊

What's Changed?

The flag -fcuda-prec-sqrt has been added to the Clang driver and Clang frontend. This sets the CodeGenOpts.CudaPreciseSqrt option, which influences the value of the module flag "nvvm-reflect-prec-sqrt". This flag is then resolved by the NVVMReflect pass for __nvvm_reflect("__CUDA_PREC_SQRT").

I'm unsure about the ideal location for this flag, so I might have made some mistakes. Any reviews or suggestions are greatly appreciated. 🙏

Misc.

The module flag "nvvm-reflect-prec-sqrt" is added before "nvvm-reflect-ftz". This ordering is intentional because one of the tests for "nvvm-reflect-ftz" relies on it being the last module flag. You can see the related test here:

llvm-project/clang/test/CodeGenCUDA/flush-denormals.cu

Lines 47 to 51 in 52f3cad

    
           // PTXFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]} 
        
           // PTXFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 1} 
        
           // PTXNOFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]} 
        
           // PTXNOFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 0}

This is my first PR for LLVM, so if I come across as impolite in any way, please let me know, and I will address it right away. ✨

Full diff: https://github.com/llvm/llvm-project/pull/134244.diff

8 Files Affected:

(modified) clang/include/clang/Basic/CodeGenOptions.h (+4)
(modified) clang/include/clang/Driver/Options.td (+5)
(modified) clang/lib/CodeGen/CodeGenModule.cpp (+4)
(modified) clang/lib/Driver/ToolChains/Cuda.cpp (+5)
(added) clang/test/CodeGenCUDA/prec-sqrt.cu (+15)
(added) clang/test/Driver/cuda-prec-sqrt.cu (+6)
(modified) llvm/lib/Target/NVPTX/NVVMReflect.cpp (+6)
(added) llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll (+28)

diff --git a/clang/include/clang/Basic/CodeGenOptions.h b/clang/include/clang/Basic/CodeGenOptions.h
index e39a73bdb13ac..1ca4360d67820 100644
--- a/clang/include/clang/Basic/CodeGenOptions.h
+++ b/clang/include/clang/Basic/CodeGenOptions.h
@@ -317,6 +317,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
   /// CUDA runtime back-end for incorporating them into host-side object file.
   std::string CudaGpuBinaryFileName;
 
+  /// Whether a precise or approximate square root should be used for CUDA
+  /// device code.
+  bool CudaPreciseSqrt;
+
   /// List of filenames passed in using the -fembed-offload-object option. These
   /// are offloading binaries containing device images and metadata.
   std::vector<std::string> OffloadObjects;
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index e69b804de63b5..88ec378222840 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -1279,6 +1279,11 @@ def fcuda_flush_denormals_to_zero : Flag<["-"], "fcuda-flush-denormals-to-zero">
   Alias<fgpu_flush_denormals_to_zero>;
 def fno_cuda_flush_denormals_to_zero : Flag<["-"], "fno-cuda-flush-denormals-to-zero">,
   Alias<fno_gpu_flush_denormals_to_zero>;
+defm cuda_prec_sqrt : BoolFOption<"cuda-prec-sqrt",
+  CodeGenOpts<"CudaPreciseSqrt">, DefaultFalse,
+  PosFlag<SetTrue, [], [ClangOption, CC1Option], "Enable">,
+  NegFlag<SetFalse, [], [ClangOption], "Disable">,
+  BothFlags<[], [ClangOption], " precise square root for CUDA device code.">>;
 def : Flag<["-"], "fcuda-rdc">, Alias<fgpu_rdc>;
 def : Flag<["-"], "fno-cuda-rdc">, Alias<fno_gpu_rdc>;
 defm cuda_short_ptr : BoolFOption<"cuda-short-ptr",
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index 8f9cf965af2b9..7f99a951ab97f 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -1286,6 +1286,10 @@ void CodeGenModule::Release() {
   }
 
   if (LangOpts.CUDAIsDevice && getTriple().isNVPTX()) {
+    // Indicate whether __nvvm_reflect should be configured to use precise
+    // square root.  (This corresponds to its "__CUDA_PREC_SQRT" property.)
+    getModule().addModuleFlag(llvm::Module::Override, "nvvm-reflect-prec-sqrt",
+                              CodeGenOpts.CudaPreciseSqrt);
     // Indicate whether __nvvm_reflect should be configured to flush denormal
     // floating point values to 0.  (This corresponds to its "__CUDA_FTZ"
     // property.)
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 06b0b0913d24e..00048e9217518 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -19,6 +19,7 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/Config/llvm-config.h" // for LLVM_HOST_TRIPLE
 #include "llvm/Option/ArgList.h"
+#include "llvm/Option/Option.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/FormatAdapters.h"
 #include "llvm/Support/FormatVariadic.h"
@@ -862,6 +863,10 @@ void CudaToolChain::addClangTargetOptions(
   if (CudaInstallation.version() >= CudaVersion::CUDA_90)
     CC1Args.push_back("-fcuda-allow-variadic-functions");
 
+  if (DriverArgs.hasFlag(options::OPT_fcuda_prec_sqrt,
+                         options::OPT_fno_cuda_prec_sqrt, false))
+    CC1Args.append({"-fcuda-prec-sqrt"});
+
   if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
                          options::OPT_fno_cuda_short_ptr, false))
     CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
diff --git a/clang/test/CodeGenCUDA/prec-sqrt.cu b/clang/test/CodeGenCUDA/prec-sqrt.cu
new file mode 100644
index 0000000000000..88c7692e8bb0a
--- /dev/null
+++ b/clang/test/CodeGenCUDA/prec-sqrt.cu
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -fcuda-is-device \
+// RUN:   -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN:   FileCheck -check-prefixes=NO-PREC-SQRT %s
+
+// RUN: %clang_cc1 -fcuda-is-device -fcuda-prec-sqrt \
+// RUN:   -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN:   FileCheck -check-prefixes=PREC-SQRT %s
+
+#include "Inputs/cuda.h"
+
+extern "C" __device__ void foo() {}
+
+
+// NO-PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}
+// PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}
diff --git a/clang/test/Driver/cuda-prec-sqrt.cu b/clang/test/Driver/cuda-prec-sqrt.cu
new file mode 100644
index 0000000000000..563c41b75d49a
--- /dev/null
+++ b/clang/test/Driver/cuda-prec-sqrt.cu
@@ -0,0 +1,6 @@
+// Checks that the -fcuda-prec-sqrt flag is passed to the cc1 frontend.
+
+// RUN: %clang -### --target=x86_64-linux-gnu -c -fcuda-prec-sqrt -nocudainc -nocudalib --cuda-path=%S/Inputs/CUDA/usr/local/cuda %s 2>&1 | FileCheck %s
+
+// CHECK: "-triple" "nvptx64-nvidia-cuda"
+// CHECK-SAME: "-fcuda-prec-sqrt"
diff --git a/llvm/lib/Target/NVPTX/NVVMReflect.cpp b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
index 20b8bef1899b4..593c98ea036c5 100644
--- a/llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -173,6 +173,12 @@ static bool runNVVMReflect(Function &F, unsigned SmVersion) {
       if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
               F.getParent()->getModuleFlag("nvvm-reflect-ftz")))
         ReflectVal = Flag->getSExtValue();
+    } else if (ReflectArg == "__CUDA_PREC_SQRT") {
+      // Try to pull __CUDA_PREC_SQRT from the nvvm-reflect-prec-sqrt module
+      // flag.
+      if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
+              F.getParent()->getModuleFlag("nvvm-reflect-prec-sqrt")))
+        ReflectVal = Flag->getSExtValue();
     } else if (ReflectArg == "__CUDA_ARCH") {
       ReflectVal = SmVersion * 10;
     }
diff --git a/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
new file mode 100644
index 0000000000000..5b584547f836b
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
@@ -0,0 +1,28 @@
+; We run nvvm-reflect (and then optimize) this module twice, once with metadata
+; that enables precise sqrt, and again with metadata that disables it.
+
+; RUN: cat %s > %t.noprec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}' >> %t.noprec
+; RUN: opt %t.noprec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN:   | FileCheck %s --check-prefix=PREC_SQRT_0 --check-prefix=CHECK
+
+; RUN: cat %s > %t.prec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}' >> %t.prec
+; RUN: opt %t.prec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN:   | FileCheck %s --check-prefix=PREC_SQRT_1 --check-prefix=CHECK
+
+@.str = private unnamed_addr constant [17 x i8] c"__CUDA_PREC_SQRT\00", align 1
+
+declare i32 @__nvvm_reflect(ptr)
+
+; CHECK-LABEL: @foo
+define i32 @foo() {
+  ; CHECK-NOT: call i32 @__nvvm_reflect
+  %reflect = call i32 @__nvvm_reflect(ptr @.str)
+  ; PREC_SQRT_0: ret i32 0
+  ; PREC_SQRT_1: ret i32 1
+  ret i32 %reflect
+}
+
+!llvm.module.flags = !{!0}
+; A module flag is added to the end of this file by the RUN lines at the top.

efriedma-quic · 2025-04-03T20:31:30Z

Why is this a module flag, instead of a function attribute?

Lai-YT · 2025-04-04T02:14:02Z

Why is this a module flag, instead of a function attribute?

The other value handled by the NVVMReflect pass is "__CUDA_FTZ", which Clang currently sets via a module flag. I followed the same approach here for consistency.

Also, since the decision to use precise square root functions seems to apply at the module level, using a module flag feels reasonable.
@efriedma-quic, do you see any pros or cons to using function attributes instead in this case? 🙋

efriedma-quic · 2025-04-04T19:03:09Z

At first glance, it seems like a good idea to allow people to choose how their square roots are lowered on a per-function level: some code cares about precise square roots, some doesn't, and you should be able to make choices on a case-by-case basis.

But looking at the code structure, maybe cuda doesn't allow that?

Lai-YT · 2025-04-05T02:46:32Z

At first glance, it seems like a good idea to allow people to choose how their square roots are lowered on a per-function level: some code cares about precise square roots, some doesn't, and you should be able to make choices on a case-by-case basis.

But looking at the code structure, maybe cuda doesn't allow that?

Yeah, AFAIK, CUDA—or more precisely, NVCC—only allows this to be specified at the module level. 😣

Lai-YT · 2025-04-07T15:14:06Z

Hi @Artem-B, would you mind taking a look at this PR when you have time?
I'd really appreciate any feedback you might have. 🙏

Artem-B · 2025-04-07T19:21:30Z

@AlexMaclean who authored #89417 and possibly other NVIDIA folks may have some thoughts on this.

In general, making it per-function attribute makes sense on LLVM level.

We will also need to reconcile it with the

llvm-project/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

Lines 94 to 96 in 10bef36

    
           static cl::opt<bool> UsePrecSqrtF32( 
        
               "nvptx-prec-sqrtf32", cl::Hidden, 
        
               cl::desc("NVPTX Specific: 0 use sqrt.approx, 1 use sqrt.rn."),

However, propagating it to NVVMReflect pass complicates things, as libdevice we're linking with is linked once per module.

I think we may need to disentangle libdevice from the IR generated by clang.

Currently, CUDA compilation. call to sqrtf() maps to __nv_sqrtf(__a) which is served by libdevice bitcode and which chooses precise or approximate version of LLVM intrinsic based on NVVMReflect.

What we need to do is change sqrtf() to use clang builtins() so we retain per-function control on lowering it.
Once we have that in place, we can independently control sqrtf precision via function and/or module attributes, and do it independently from the choice we make via NVVMReflect for __nv_sqrtf().

AlexMaclean · 2025-04-08T19:20:29Z

It seems like we already have perhaps too many mechanisms to control how sqrt gets lowered. There is the __nv_sqrtf libdevice function which chooses between specific (1:1 to PTX) intrinsics based on NVVMReflect and then there is also llvm.sqrt and nvvm.sqrt.f which are lowered and optimized based on command-line options and function and instruction level flags, each in its own way.

I think for more fine grained responsiveness to instruction and function level options it makes sense to use the existing intrinsics. While, it is consistent with the existing design to treat NVVMReflect as operating globally across the entire module. I'm not sure it makes sense to introduce a new module flag and clang cl opt though...

I personally agree with @Artem-B that __nv_sqrtf+NVVMReflect may not be the way to go. Using one of the intrinsics seems like a better approach but I may be missing something.

Lai-YT · 2025-04-09T02:52:41Z

Got it. I did come across nvptx-prec-sqrtf32, but since NVVMReflect makes the decision earlier by selecting the intrinsic in libdevice, I figured I couldn't rely on it.
I completely agree that unifying the behavior would be ideal moving forward.

Thanks so much for the insights, @Artem-B and @AlexMaclean! 🤗
It sounds like there's not much more I can do with this PR as it stands.

Lai-YT added 3 commits April 3, 2025 20:35

[Clang][CodeGen] Add module flag for square root precision

203f061

A module flag is now set based on the `-f[no]-cuda-prec-sqrt` flag, allowing the NVVMReflect pass to recognize and apply the specified square root precision.

Artem-B assigned Artem-B and unassigned Artem-B Apr 7, 2025

Lai-YT closed this Apr 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Clang][NVVM] Support `-f[no-]cuda-prec-sqrt` and propagate precision flag to `NVVMReflect` #134244

[Clang][NVVM] Support `-f[no-]cuda-prec-sqrt` and propagate precision flag to `NVVMReflect` #134244

Uh oh!

Lai-YT commented Apr 3, 2025

Uh oh!

github-actions bot commented Apr 3, 2025

Uh oh!

llvmbot commented Apr 3, 2025 •

edited

Loading

What's Changed?

Misc.

Uh oh!

llvmbot commented Apr 3, 2025

What's Changed?

Misc.

Uh oh!

llvmbot commented Apr 3, 2025

What's Changed?

Misc.

Uh oh!

efriedma-quic commented Apr 3, 2025

Uh oh!

Lai-YT commented Apr 4, 2025

Uh oh!

efriedma-quic commented Apr 4, 2025

Uh oh!

Lai-YT commented Apr 5, 2025

Uh oh!

Lai-YT commented Apr 7, 2025

Uh oh!

Artem-B commented Apr 7, 2025

Uh oh!

AlexMaclean commented Apr 8, 2025

Uh oh!

Lai-YT commented Apr 9, 2025

Uh oh!

Uh oh!

	// PTXFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]}
	// PTXFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 1}

	// PTXNOFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]}
	// PTXNOFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 0}

[Clang][NVVM] Support -f[no-]cuda-prec-sqrt and propagate precision flag to NVVMReflect #134244

[Clang][NVVM] Support -f[no-]cuda-prec-sqrt and propagate precision flag to NVVMReflect #134244

Uh oh!

Conversation

Lai-YT commented Apr 3, 2025

What's Changed?

Misc.

Uh oh!

github-actions bot commented Apr 3, 2025

Uh oh!

llvmbot commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's Changed?

Misc.

Uh oh!

llvmbot commented Apr 3, 2025

What's Changed?

Misc.

Uh oh!

llvmbot commented Apr 3, 2025

What's Changed?

Misc.

Uh oh!

efriedma-quic commented Apr 3, 2025

Uh oh!

Lai-YT commented Apr 4, 2025

Uh oh!

efriedma-quic commented Apr 4, 2025

Uh oh!

Lai-YT commented Apr 5, 2025

Uh oh!

Lai-YT commented Apr 7, 2025

Uh oh!

Artem-B commented Apr 7, 2025

Uh oh!

AlexMaclean commented Apr 8, 2025

Uh oh!

Lai-YT commented Apr 9, 2025

Uh oh!

Uh oh!

[Clang][NVVM] Support `-f[no-]cuda-prec-sqrt` and propagate precision flag to `NVVMReflect` #134244

[Clang][NVVM] Support `-f[no-]cuda-prec-sqrt` and propagate precision flag to `NVVMReflect` #134244

llvmbot commented Apr 3, 2025 •

edited

Loading