Skip to content

[Clang][NVVM] Support -f[no-]cuda-prec-sqrt and propagate precision flag to NVVMReflect #134244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

Lai-YT
Copy link

@Lai-YT Lai-YT commented Apr 3, 2025

This PR demonstrates a potential solution to the issue raised in #131749. The goal is to provide a more concrete standpoint, and I'm open to feedback. Major changes are welcome, and if this isn't the direction we want to take, I'm completely fine with not merging this PR. 😊

What's Changed?

The flag -fcuda-prec-sqrt has been added to the Clang driver and Clang frontend. This sets the CodeGenOpts.CudaPreciseSqrt option, which influences the value of the module flag "nvvm-reflect-prec-sqrt". This flag is then resolved by the NVVMReflect pass for __nvvm_reflect("__CUDA_PREC_SQRT").

I'm unsure about the ideal location for this flag, so I might have made some mistakes. Any reviews or suggestions are greatly appreciated. 🙏

Misc.

The module flag "nvvm-reflect-prec-sqrt" is added before "nvvm-reflect-ftz". This ordering is intentional because one of the tests for "nvvm-reflect-ftz" relies on it being the last module flag. You can see the related test here:

// PTXFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]}
// PTXFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 1}
// PTXNOFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]}
// PTXNOFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 0}


This is my first PR for LLVM, so if I come across as impolite in any way, please let me know, and I will address it right away. ✨

Lai-YT added 3 commits April 3, 2025 20:35
NVCC provides the `-prec-sqrt` flag to control whether a precise or
approximate square root function is used. However, LLVM previously
always use the approximated version.

With this change, Clang introduces the `-f[no-]cuda-prec-sqrt` flag,
allowing users to specify precision behavior. The default is set to
false to maintain existing behavior.
A module flag is now set based on the `-f[no]-cuda-prec-sqrt` flag,
allowing the NVVMReflect pass to recognize and apply the specified
square root precision.
The `__nv_sqrtf` intrinsic in libdevice.bc, defined by NVIDIA, depends
not only on `__nvvm_reflect("__CUDA_FTZ")` but also on
`__nvvm_reflect("__CUDA_PREC_SQRT")`. However, the NVVMReflect pass
previously failed to recognize `__CUDA_PREC_SQRT`, causing its value to
default to `0`.

This change enables the NVVMReflect pass to correctly pick up the module
flag "nvvm-reflect-prec-sqrt", which Clang sets based on the
`-fcuda-prec-sqrt` flag, ensuring proper behavior.
Copy link

github-actions bot commented Apr 3, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. backend:NVPTX labels Apr 3, 2025
@llvmbot
Copy link
Member

llvmbot commented Apr 3, 2025

@llvm/pr-subscribers-backend-nvptx

@llvm/pr-subscribers-clang-codegen

Author: Lai-YT (Lai-YT)

Changes

This PR demonstrates a potential solution to the issue raised in #131749. The goal is to provide a more concrete standpoint, and I'm open to feedback. Major changes are welcome, and if this isn't the direction we want to take, I'm completely fine with not merging this PR. 😊

What's Changed?

The flag -fcuda-prec-sqrt has been added to the Clang driver and Clang frontend. This sets the CodeGenOpts.CudaPreciseSqrt option, which influences the value of the module flag "nvvm-reflect-prec-sqrt". This flag is then resolved by the NVVMReflect pass for __nvvm_reflect("__CUDA_PREC_SQRT").

I'm unsure about the ideal location for this flag, so I might have made some mistakes. Any reviews or suggestions are greatly appreciated. 🙏

Misc.

The module flag "nvvm-reflect-prec-sqrt" is added before "nvvm-reflect-ftz". This ordering is intentional because one of the tests for "nvvm-reflect-ftz" relies on it being the last module flag. You can see the related test here:

// PTXFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]}
// PTXFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 1}
// PTXNOFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]}
// PTXNOFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 0}


This is my first PR for LLVM, so if I come across as impolite in any way, please let me know, and I will address it right away. ✨


Full diff: https://github.com/llvm/llvm-project/pull/134244.diff

8 Files Affected:

  • (modified) clang/include/clang/Basic/CodeGenOptions.h (+4)
  • (modified) clang/include/clang/Driver/Options.td (+5)
  • (modified) clang/lib/CodeGen/CodeGenModule.cpp (+4)
  • (modified) clang/lib/Driver/ToolChains/Cuda.cpp (+5)
  • (added) clang/test/CodeGenCUDA/prec-sqrt.cu (+15)
  • (added) clang/test/Driver/cuda-prec-sqrt.cu (+6)
  • (modified) llvm/lib/Target/NVPTX/NVVMReflect.cpp (+6)
  • (added) llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll (+28)
diff --git a/clang/include/clang/Basic/CodeGenOptions.h b/clang/include/clang/Basic/CodeGenOptions.h
index e39a73bdb13ac..1ca4360d67820 100644
--- a/clang/include/clang/Basic/CodeGenOptions.h
+++ b/clang/include/clang/Basic/CodeGenOptions.h
@@ -317,6 +317,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
   /// CUDA runtime back-end for incorporating them into host-side object file.
   std::string CudaGpuBinaryFileName;
 
+  /// Whether a precise or approximate square root should be used for CUDA
+  /// device code.
+  bool CudaPreciseSqrt;
+
   /// List of filenames passed in using the -fembed-offload-object option. These
   /// are offloading binaries containing device images and metadata.
   std::vector<std::string> OffloadObjects;
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index e69b804de63b5..88ec378222840 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -1279,6 +1279,11 @@ def fcuda_flush_denormals_to_zero : Flag<["-"], "fcuda-flush-denormals-to-zero">
   Alias<fgpu_flush_denormals_to_zero>;
 def fno_cuda_flush_denormals_to_zero : Flag<["-"], "fno-cuda-flush-denormals-to-zero">,
   Alias<fno_gpu_flush_denormals_to_zero>;
+defm cuda_prec_sqrt : BoolFOption<"cuda-prec-sqrt",
+  CodeGenOpts<"CudaPreciseSqrt">, DefaultFalse,
+  PosFlag<SetTrue, [], [ClangOption, CC1Option], "Enable">,
+  NegFlag<SetFalse, [], [ClangOption], "Disable">,
+  BothFlags<[], [ClangOption], " precise square root for CUDA device code.">>;
 def : Flag<["-"], "fcuda-rdc">, Alias<fgpu_rdc>;
 def : Flag<["-"], "fno-cuda-rdc">, Alias<fno_gpu_rdc>;
 defm cuda_short_ptr : BoolFOption<"cuda-short-ptr",
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index 8f9cf965af2b9..7f99a951ab97f 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -1286,6 +1286,10 @@ void CodeGenModule::Release() {
   }
 
   if (LangOpts.CUDAIsDevice && getTriple().isNVPTX()) {
+    // Indicate whether __nvvm_reflect should be configured to use precise
+    // square root.  (This corresponds to its "__CUDA_PREC_SQRT" property.)
+    getModule().addModuleFlag(llvm::Module::Override, "nvvm-reflect-prec-sqrt",
+                              CodeGenOpts.CudaPreciseSqrt);
     // Indicate whether __nvvm_reflect should be configured to flush denormal
     // floating point values to 0.  (This corresponds to its "__CUDA_FTZ"
     // property.)
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 06b0b0913d24e..00048e9217518 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -19,6 +19,7 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/Config/llvm-config.h" // for LLVM_HOST_TRIPLE
 #include "llvm/Option/ArgList.h"
+#include "llvm/Option/Option.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/FormatAdapters.h"
 #include "llvm/Support/FormatVariadic.h"
@@ -862,6 +863,10 @@ void CudaToolChain::addClangTargetOptions(
   if (CudaInstallation.version() >= CudaVersion::CUDA_90)
     CC1Args.push_back("-fcuda-allow-variadic-functions");
 
+  if (DriverArgs.hasFlag(options::OPT_fcuda_prec_sqrt,
+                         options::OPT_fno_cuda_prec_sqrt, false))
+    CC1Args.append({"-fcuda-prec-sqrt"});
+
   if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
                          options::OPT_fno_cuda_short_ptr, false))
     CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
diff --git a/clang/test/CodeGenCUDA/prec-sqrt.cu b/clang/test/CodeGenCUDA/prec-sqrt.cu
new file mode 100644
index 0000000000000..88c7692e8bb0a
--- /dev/null
+++ b/clang/test/CodeGenCUDA/prec-sqrt.cu
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -fcuda-is-device \
+// RUN:   -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN:   FileCheck -check-prefixes=NO-PREC-SQRT %s
+
+// RUN: %clang_cc1 -fcuda-is-device -fcuda-prec-sqrt \
+// RUN:   -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN:   FileCheck -check-prefixes=PREC-SQRT %s
+
+#include "Inputs/cuda.h"
+
+extern "C" __device__ void foo() {}
+
+
+// NO-PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}
+// PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}
diff --git a/clang/test/Driver/cuda-prec-sqrt.cu b/clang/test/Driver/cuda-prec-sqrt.cu
new file mode 100644
index 0000000000000..563c41b75d49a
--- /dev/null
+++ b/clang/test/Driver/cuda-prec-sqrt.cu
@@ -0,0 +1,6 @@
+// Checks that the -fcuda-prec-sqrt flag is passed to the cc1 frontend.
+
+// RUN: %clang -### --target=x86_64-linux-gnu -c -fcuda-prec-sqrt -nocudainc -nocudalib --cuda-path=%S/Inputs/CUDA/usr/local/cuda %s 2>&1 | FileCheck %s
+
+// CHECK: "-triple" "nvptx64-nvidia-cuda"
+// CHECK-SAME: "-fcuda-prec-sqrt"
diff --git a/llvm/lib/Target/NVPTX/NVVMReflect.cpp b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
index 20b8bef1899b4..593c98ea036c5 100644
--- a/llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -173,6 +173,12 @@ static bool runNVVMReflect(Function &F, unsigned SmVersion) {
       if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
               F.getParent()->getModuleFlag("nvvm-reflect-ftz")))
         ReflectVal = Flag->getSExtValue();
+    } else if (ReflectArg == "__CUDA_PREC_SQRT") {
+      // Try to pull __CUDA_PREC_SQRT from the nvvm-reflect-prec-sqrt module
+      // flag.
+      if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
+              F.getParent()->getModuleFlag("nvvm-reflect-prec-sqrt")))
+        ReflectVal = Flag->getSExtValue();
     } else if (ReflectArg == "__CUDA_ARCH") {
       ReflectVal = SmVersion * 10;
     }
diff --git a/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
new file mode 100644
index 0000000000000..5b584547f836b
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
@@ -0,0 +1,28 @@
+; We run nvvm-reflect (and then optimize) this module twice, once with metadata
+; that enables precise sqrt, and again with metadata that disables it.
+
+; RUN: cat %s > %t.noprec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}' >> %t.noprec
+; RUN: opt %t.noprec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN:   | FileCheck %s --check-prefix=PREC_SQRT_0 --check-prefix=CHECK
+
+; RUN: cat %s > %t.prec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}' >> %t.prec
+; RUN: opt %t.prec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN:   | FileCheck %s --check-prefix=PREC_SQRT_1 --check-prefix=CHECK
+
+@.str = private unnamed_addr constant [17 x i8] c"__CUDA_PREC_SQRT\00", align 1
+
+declare i32 @__nvvm_reflect(ptr)
+
+; CHECK-LABEL: @foo
+define i32 @foo() {
+  ; CHECK-NOT: call i32 @__nvvm_reflect
+  %reflect = call i32 @__nvvm_reflect(ptr @.str)
+  ; PREC_SQRT_0: ret i32 0
+  ; PREC_SQRT_1: ret i32 1
+  ret i32 %reflect
+}
+
+!llvm.module.flags = !{!0}
+; A module flag is added to the end of this file by the RUN lines at the top.

@llvmbot
Copy link
Member

llvmbot commented Apr 3, 2025

@llvm/pr-subscribers-clang-driver

Author: Lai-YT (Lai-YT)

Changes

This PR demonstrates a potential solution to the issue raised in #131749. The goal is to provide a more concrete standpoint, and I'm open to feedback. Major changes are welcome, and if this isn't the direction we want to take, I'm completely fine with not merging this PR. 😊

What's Changed?

The flag -fcuda-prec-sqrt has been added to the Clang driver and Clang frontend. This sets the CodeGenOpts.CudaPreciseSqrt option, which influences the value of the module flag "nvvm-reflect-prec-sqrt". This flag is then resolved by the NVVMReflect pass for __nvvm_reflect("__CUDA_PREC_SQRT").

I'm unsure about the ideal location for this flag, so I might have made some mistakes. Any reviews or suggestions are greatly appreciated. 🙏

Misc.

The module flag "nvvm-reflect-prec-sqrt" is added before "nvvm-reflect-ftz". This ordering is intentional because one of the tests for "nvvm-reflect-ftz" relies on it being the last module flag. You can see the related test here:

// PTXFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]}
// PTXFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 1}
// PTXNOFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]}
// PTXNOFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 0}


This is my first PR for LLVM, so if I come across as impolite in any way, please let me know, and I will address it right away. ✨


Full diff: https://github.com/llvm/llvm-project/pull/134244.diff

8 Files Affected:

  • (modified) clang/include/clang/Basic/CodeGenOptions.h (+4)
  • (modified) clang/include/clang/Driver/Options.td (+5)
  • (modified) clang/lib/CodeGen/CodeGenModule.cpp (+4)
  • (modified) clang/lib/Driver/ToolChains/Cuda.cpp (+5)
  • (added) clang/test/CodeGenCUDA/prec-sqrt.cu (+15)
  • (added) clang/test/Driver/cuda-prec-sqrt.cu (+6)
  • (modified) llvm/lib/Target/NVPTX/NVVMReflect.cpp (+6)
  • (added) llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll (+28)
diff --git a/clang/include/clang/Basic/CodeGenOptions.h b/clang/include/clang/Basic/CodeGenOptions.h
index e39a73bdb13ac..1ca4360d67820 100644
--- a/clang/include/clang/Basic/CodeGenOptions.h
+++ b/clang/include/clang/Basic/CodeGenOptions.h
@@ -317,6 +317,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
   /// CUDA runtime back-end for incorporating them into host-side object file.
   std::string CudaGpuBinaryFileName;
 
+  /// Whether a precise or approximate square root should be used for CUDA
+  /// device code.
+  bool CudaPreciseSqrt;
+
   /// List of filenames passed in using the -fembed-offload-object option. These
   /// are offloading binaries containing device images and metadata.
   std::vector<std::string> OffloadObjects;
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index e69b804de63b5..88ec378222840 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -1279,6 +1279,11 @@ def fcuda_flush_denormals_to_zero : Flag<["-"], "fcuda-flush-denormals-to-zero">
   Alias<fgpu_flush_denormals_to_zero>;
 def fno_cuda_flush_denormals_to_zero : Flag<["-"], "fno-cuda-flush-denormals-to-zero">,
   Alias<fno_gpu_flush_denormals_to_zero>;
+defm cuda_prec_sqrt : BoolFOption<"cuda-prec-sqrt",
+  CodeGenOpts<"CudaPreciseSqrt">, DefaultFalse,
+  PosFlag<SetTrue, [], [ClangOption, CC1Option], "Enable">,
+  NegFlag<SetFalse, [], [ClangOption], "Disable">,
+  BothFlags<[], [ClangOption], " precise square root for CUDA device code.">>;
 def : Flag<["-"], "fcuda-rdc">, Alias<fgpu_rdc>;
 def : Flag<["-"], "fno-cuda-rdc">, Alias<fno_gpu_rdc>;
 defm cuda_short_ptr : BoolFOption<"cuda-short-ptr",
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index 8f9cf965af2b9..7f99a951ab97f 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -1286,6 +1286,10 @@ void CodeGenModule::Release() {
   }
 
   if (LangOpts.CUDAIsDevice && getTriple().isNVPTX()) {
+    // Indicate whether __nvvm_reflect should be configured to use precise
+    // square root.  (This corresponds to its "__CUDA_PREC_SQRT" property.)
+    getModule().addModuleFlag(llvm::Module::Override, "nvvm-reflect-prec-sqrt",
+                              CodeGenOpts.CudaPreciseSqrt);
     // Indicate whether __nvvm_reflect should be configured to flush denormal
     // floating point values to 0.  (This corresponds to its "__CUDA_FTZ"
     // property.)
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 06b0b0913d24e..00048e9217518 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -19,6 +19,7 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/Config/llvm-config.h" // for LLVM_HOST_TRIPLE
 #include "llvm/Option/ArgList.h"
+#include "llvm/Option/Option.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/FormatAdapters.h"
 #include "llvm/Support/FormatVariadic.h"
@@ -862,6 +863,10 @@ void CudaToolChain::addClangTargetOptions(
   if (CudaInstallation.version() >= CudaVersion::CUDA_90)
     CC1Args.push_back("-fcuda-allow-variadic-functions");
 
+  if (DriverArgs.hasFlag(options::OPT_fcuda_prec_sqrt,
+                         options::OPT_fno_cuda_prec_sqrt, false))
+    CC1Args.append({"-fcuda-prec-sqrt"});
+
   if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
                          options::OPT_fno_cuda_short_ptr, false))
     CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
diff --git a/clang/test/CodeGenCUDA/prec-sqrt.cu b/clang/test/CodeGenCUDA/prec-sqrt.cu
new file mode 100644
index 0000000000000..88c7692e8bb0a
--- /dev/null
+++ b/clang/test/CodeGenCUDA/prec-sqrt.cu
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -fcuda-is-device \
+// RUN:   -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN:   FileCheck -check-prefixes=NO-PREC-SQRT %s
+
+// RUN: %clang_cc1 -fcuda-is-device -fcuda-prec-sqrt \
+// RUN:   -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN:   FileCheck -check-prefixes=PREC-SQRT %s
+
+#include "Inputs/cuda.h"
+
+extern "C" __device__ void foo() {}
+
+
+// NO-PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}
+// PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}
diff --git a/clang/test/Driver/cuda-prec-sqrt.cu b/clang/test/Driver/cuda-prec-sqrt.cu
new file mode 100644
index 0000000000000..563c41b75d49a
--- /dev/null
+++ b/clang/test/Driver/cuda-prec-sqrt.cu
@@ -0,0 +1,6 @@
+// Checks that the -fcuda-prec-sqrt flag is passed to the cc1 frontend.
+
+// RUN: %clang -### --target=x86_64-linux-gnu -c -fcuda-prec-sqrt -nocudainc -nocudalib --cuda-path=%S/Inputs/CUDA/usr/local/cuda %s 2>&1 | FileCheck %s
+
+// CHECK: "-triple" "nvptx64-nvidia-cuda"
+// CHECK-SAME: "-fcuda-prec-sqrt"
diff --git a/llvm/lib/Target/NVPTX/NVVMReflect.cpp b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
index 20b8bef1899b4..593c98ea036c5 100644
--- a/llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -173,6 +173,12 @@ static bool runNVVMReflect(Function &F, unsigned SmVersion) {
       if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
               F.getParent()->getModuleFlag("nvvm-reflect-ftz")))
         ReflectVal = Flag->getSExtValue();
+    } else if (ReflectArg == "__CUDA_PREC_SQRT") {
+      // Try to pull __CUDA_PREC_SQRT from the nvvm-reflect-prec-sqrt module
+      // flag.
+      if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
+              F.getParent()->getModuleFlag("nvvm-reflect-prec-sqrt")))
+        ReflectVal = Flag->getSExtValue();
     } else if (ReflectArg == "__CUDA_ARCH") {
       ReflectVal = SmVersion * 10;
     }
diff --git a/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
new file mode 100644
index 0000000000000..5b584547f836b
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
@@ -0,0 +1,28 @@
+; We run nvvm-reflect (and then optimize) this module twice, once with metadata
+; that enables precise sqrt, and again with metadata that disables it.
+
+; RUN: cat %s > %t.noprec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}' >> %t.noprec
+; RUN: opt %t.noprec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN:   | FileCheck %s --check-prefix=PREC_SQRT_0 --check-prefix=CHECK
+
+; RUN: cat %s > %t.prec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}' >> %t.prec
+; RUN: opt %t.prec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN:   | FileCheck %s --check-prefix=PREC_SQRT_1 --check-prefix=CHECK
+
+@.str = private unnamed_addr constant [17 x i8] c"__CUDA_PREC_SQRT\00", align 1
+
+declare i32 @__nvvm_reflect(ptr)
+
+; CHECK-LABEL: @foo
+define i32 @foo() {
+  ; CHECK-NOT: call i32 @__nvvm_reflect
+  %reflect = call i32 @__nvvm_reflect(ptr @.str)
+  ; PREC_SQRT_0: ret i32 0
+  ; PREC_SQRT_1: ret i32 1
+  ret i32 %reflect
+}
+
+!llvm.module.flags = !{!0}
+; A module flag is added to the end of this file by the RUN lines at the top.

@llvmbot
Copy link
Member

llvmbot commented Apr 3, 2025

@llvm/pr-subscribers-clang

Author: Lai-YT (Lai-YT)

Changes

This PR demonstrates a potential solution to the issue raised in #131749. The goal is to provide a more concrete standpoint, and I'm open to feedback. Major changes are welcome, and if this isn't the direction we want to take, I'm completely fine with not merging this PR. 😊

What's Changed?

The flag -fcuda-prec-sqrt has been added to the Clang driver and Clang frontend. This sets the CodeGenOpts.CudaPreciseSqrt option, which influences the value of the module flag "nvvm-reflect-prec-sqrt". This flag is then resolved by the NVVMReflect pass for __nvvm_reflect("__CUDA_PREC_SQRT").

I'm unsure about the ideal location for this flag, so I might have made some mistakes. Any reviews or suggestions are greatly appreciated. 🙏

Misc.

The module flag "nvvm-reflect-prec-sqrt" is added before "nvvm-reflect-ftz". This ordering is intentional because one of the tests for "nvvm-reflect-ftz" relies on it being the last module flag. You can see the related test here:

// PTXFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]}
// PTXFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 1}
// PTXNOFTZ:!llvm.module.flags = !{{{.*}}[[MODFLAG:![0-9]+]]}
// PTXNOFTZ:[[MODFLAG]] = !{i32 4, !"nvvm-reflect-ftz", i32 0}


This is my first PR for LLVM, so if I come across as impolite in any way, please let me know, and I will address it right away. ✨


Full diff: https://github.com/llvm/llvm-project/pull/134244.diff

8 Files Affected:

  • (modified) clang/include/clang/Basic/CodeGenOptions.h (+4)
  • (modified) clang/include/clang/Driver/Options.td (+5)
  • (modified) clang/lib/CodeGen/CodeGenModule.cpp (+4)
  • (modified) clang/lib/Driver/ToolChains/Cuda.cpp (+5)
  • (added) clang/test/CodeGenCUDA/prec-sqrt.cu (+15)
  • (added) clang/test/Driver/cuda-prec-sqrt.cu (+6)
  • (modified) llvm/lib/Target/NVPTX/NVVMReflect.cpp (+6)
  • (added) llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll (+28)
diff --git a/clang/include/clang/Basic/CodeGenOptions.h b/clang/include/clang/Basic/CodeGenOptions.h
index e39a73bdb13ac..1ca4360d67820 100644
--- a/clang/include/clang/Basic/CodeGenOptions.h
+++ b/clang/include/clang/Basic/CodeGenOptions.h
@@ -317,6 +317,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
   /// CUDA runtime back-end for incorporating them into host-side object file.
   std::string CudaGpuBinaryFileName;
 
+  /// Whether a precise or approximate square root should be used for CUDA
+  /// device code.
+  bool CudaPreciseSqrt;
+
   /// List of filenames passed in using the -fembed-offload-object option. These
   /// are offloading binaries containing device images and metadata.
   std::vector<std::string> OffloadObjects;
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index e69b804de63b5..88ec378222840 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -1279,6 +1279,11 @@ def fcuda_flush_denormals_to_zero : Flag<["-"], "fcuda-flush-denormals-to-zero">
   Alias<fgpu_flush_denormals_to_zero>;
 def fno_cuda_flush_denormals_to_zero : Flag<["-"], "fno-cuda-flush-denormals-to-zero">,
   Alias<fno_gpu_flush_denormals_to_zero>;
+defm cuda_prec_sqrt : BoolFOption<"cuda-prec-sqrt",
+  CodeGenOpts<"CudaPreciseSqrt">, DefaultFalse,
+  PosFlag<SetTrue, [], [ClangOption, CC1Option], "Enable">,
+  NegFlag<SetFalse, [], [ClangOption], "Disable">,
+  BothFlags<[], [ClangOption], " precise square root for CUDA device code.">>;
 def : Flag<["-"], "fcuda-rdc">, Alias<fgpu_rdc>;
 def : Flag<["-"], "fno-cuda-rdc">, Alias<fno_gpu_rdc>;
 defm cuda_short_ptr : BoolFOption<"cuda-short-ptr",
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index 8f9cf965af2b9..7f99a951ab97f 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -1286,6 +1286,10 @@ void CodeGenModule::Release() {
   }
 
   if (LangOpts.CUDAIsDevice && getTriple().isNVPTX()) {
+    // Indicate whether __nvvm_reflect should be configured to use precise
+    // square root.  (This corresponds to its "__CUDA_PREC_SQRT" property.)
+    getModule().addModuleFlag(llvm::Module::Override, "nvvm-reflect-prec-sqrt",
+                              CodeGenOpts.CudaPreciseSqrt);
     // Indicate whether __nvvm_reflect should be configured to flush denormal
     // floating point values to 0.  (This corresponds to its "__CUDA_FTZ"
     // property.)
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 06b0b0913d24e..00048e9217518 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -19,6 +19,7 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/Config/llvm-config.h" // for LLVM_HOST_TRIPLE
 #include "llvm/Option/ArgList.h"
+#include "llvm/Option/Option.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/FormatAdapters.h"
 #include "llvm/Support/FormatVariadic.h"
@@ -862,6 +863,10 @@ void CudaToolChain::addClangTargetOptions(
   if (CudaInstallation.version() >= CudaVersion::CUDA_90)
     CC1Args.push_back("-fcuda-allow-variadic-functions");
 
+  if (DriverArgs.hasFlag(options::OPT_fcuda_prec_sqrt,
+                         options::OPT_fno_cuda_prec_sqrt, false))
+    CC1Args.append({"-fcuda-prec-sqrt"});
+
   if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
                          options::OPT_fno_cuda_short_ptr, false))
     CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
diff --git a/clang/test/CodeGenCUDA/prec-sqrt.cu b/clang/test/CodeGenCUDA/prec-sqrt.cu
new file mode 100644
index 0000000000000..88c7692e8bb0a
--- /dev/null
+++ b/clang/test/CodeGenCUDA/prec-sqrt.cu
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -fcuda-is-device \
+// RUN:   -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN:   FileCheck -check-prefixes=NO-PREC-SQRT %s
+
+// RUN: %clang_cc1 -fcuda-is-device -fcuda-prec-sqrt \
+// RUN:   -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN:   FileCheck -check-prefixes=PREC-SQRT %s
+
+#include "Inputs/cuda.h"
+
+extern "C" __device__ void foo() {}
+
+
+// NO-PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}
+// PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}
diff --git a/clang/test/Driver/cuda-prec-sqrt.cu b/clang/test/Driver/cuda-prec-sqrt.cu
new file mode 100644
index 0000000000000..563c41b75d49a
--- /dev/null
+++ b/clang/test/Driver/cuda-prec-sqrt.cu
@@ -0,0 +1,6 @@
+// Checks that the -fcuda-prec-sqrt flag is passed to the cc1 frontend.
+
+// RUN: %clang -### --target=x86_64-linux-gnu -c -fcuda-prec-sqrt -nocudainc -nocudalib --cuda-path=%S/Inputs/CUDA/usr/local/cuda %s 2>&1 | FileCheck %s
+
+// CHECK: "-triple" "nvptx64-nvidia-cuda"
+// CHECK-SAME: "-fcuda-prec-sqrt"
diff --git a/llvm/lib/Target/NVPTX/NVVMReflect.cpp b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
index 20b8bef1899b4..593c98ea036c5 100644
--- a/llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -173,6 +173,12 @@ static bool runNVVMReflect(Function &F, unsigned SmVersion) {
       if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
               F.getParent()->getModuleFlag("nvvm-reflect-ftz")))
         ReflectVal = Flag->getSExtValue();
+    } else if (ReflectArg == "__CUDA_PREC_SQRT") {
+      // Try to pull __CUDA_PREC_SQRT from the nvvm-reflect-prec-sqrt module
+      // flag.
+      if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
+              F.getParent()->getModuleFlag("nvvm-reflect-prec-sqrt")))
+        ReflectVal = Flag->getSExtValue();
     } else if (ReflectArg == "__CUDA_ARCH") {
       ReflectVal = SmVersion * 10;
     }
diff --git a/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
new file mode 100644
index 0000000000000..5b584547f836b
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
@@ -0,0 +1,28 @@
+; We run nvvm-reflect (and then optimize) this module twice, once with metadata
+; that enables precise sqrt, and again with metadata that disables it.
+
+; RUN: cat %s > %t.noprec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}' >> %t.noprec
+; RUN: opt %t.noprec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN:   | FileCheck %s --check-prefix=PREC_SQRT_0 --check-prefix=CHECK
+
+; RUN: cat %s > %t.prec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}' >> %t.prec
+; RUN: opt %t.prec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN:   | FileCheck %s --check-prefix=PREC_SQRT_1 --check-prefix=CHECK
+
+@.str = private unnamed_addr constant [17 x i8] c"__CUDA_PREC_SQRT\00", align 1
+
+declare i32 @__nvvm_reflect(ptr)
+
+; CHECK-LABEL: @foo
+define i32 @foo() {
+  ; CHECK-NOT: call i32 @__nvvm_reflect
+  %reflect = call i32 @__nvvm_reflect(ptr @.str)
+  ; PREC_SQRT_0: ret i32 0
+  ; PREC_SQRT_1: ret i32 1
+  ret i32 %reflect
+}
+
+!llvm.module.flags = !{!0}
+; A module flag is added to the end of this file by the RUN lines at the top.

@efriedma-quic
Copy link
Collaborator

Why is this a module flag, instead of a function attribute?

@Lai-YT
Copy link
Author

Lai-YT commented Apr 4, 2025

Why is this a module flag, instead of a function attribute?

The other value handled by the NVVMReflect pass is "__CUDA_FTZ", which Clang currently sets via a module flag. I followed the same approach here for consistency.

Also, since the decision to use precise square root functions seems to apply at the module level, using a module flag feels reasonable.
@efriedma-quic, do you see any pros or cons to using function attributes instead in this case? 🙋

@efriedma-quic
Copy link
Collaborator

At first glance, it seems like a good idea to allow people to choose how their square roots are lowered on a per-function level: some code cares about precise square roots, some doesn't, and you should be able to make choices on a case-by-case basis.

But looking at the code structure, maybe cuda doesn't allow that?

@Lai-YT
Copy link
Author

Lai-YT commented Apr 5, 2025

At first glance, it seems like a good idea to allow people to choose how their square roots are lowered on a per-function level: some code cares about precise square roots, some doesn't, and you should be able to make choices on a case-by-case basis.

But looking at the code structure, maybe cuda doesn't allow that?

Yeah, AFAIK, CUDA—or more precisely, NVCC—only allows this to be specified at the module level. 😣

@Lai-YT
Copy link
Author

Lai-YT commented Apr 7, 2025

Hi @Artem-B, would you mind taking a look at this PR when you have time?
I'd really appreciate any feedback you might have. 🙏

@Artem-B Artem-B assigned Artem-B and unassigned Artem-B Apr 7, 2025
@Artem-B
Copy link
Member

Artem-B commented Apr 7, 2025

@AlexMaclean who authored #89417 and possibly other NVIDIA folks may have some thoughts on this.

In general, making it per-function attribute makes sense on LLVM level.

We will also need to reconcile it with the

static cl::opt<bool> UsePrecSqrtF32(
"nvptx-prec-sqrtf32", cl::Hidden,
cl::desc("NVPTX Specific: 0 use sqrt.approx, 1 use sqrt.rn."),

However, propagating it to NVVMReflect pass complicates things, as libdevice we're linking with is linked once per module.

I think we may need to disentangle libdevice from the IR generated by clang.

Currently, CUDA compilation. call to sqrtf() maps to __nv_sqrtf(__a) which is served by libdevice bitcode and which chooses precise or approximate version of LLVM intrinsic based on NVVMReflect.

What we need to do is change sqrtf() to use clang builtins() so we retain per-function control on lowering it.
Once we have that in place, we can independently control sqrtf precision via function and/or module attributes, and do it independently from the choice we make via NVVMReflect for __nv_sqrtf().

@AlexMaclean
Copy link
Member

It seems like we already have perhaps too many mechanisms to control how sqrt gets lowered. There is the __nv_sqrtf libdevice function which chooses between specific (1:1 to PTX) intrinsics based on NVVMReflect and then there is also llvm.sqrt and nvvm.sqrt.f which are lowered and optimized based on command-line options and function and instruction level flags, each in its own way.

I think for more fine grained responsiveness to instruction and function level options it makes sense to use the existing intrinsics. While, it is consistent with the existing design to treat NVVMReflect as operating globally across the entire module. I'm not sure it makes sense to introduce a new module flag and clang cl opt though...

I personally agree with @Artem-B that __nv_sqrtf+NVVMReflect may not be the way to go. Using one of the intrinsics seems like a better approach but I may be missing something.

@Lai-YT
Copy link
Author

Lai-YT commented Apr 9, 2025

Got it. I did come across nvptx-prec-sqrtf32, but since NVVMReflect makes the decision earlier by selecting the intrinsic in libdevice, I figured I couldn't rely on it.
I completely agree that unifying the behavior would be ideal moving forward.

Thanks so much for the insights, @Artem-B and @AlexMaclean! 🤗
It sounds like there's not much more I can do with this PR as it stands.

@Lai-YT Lai-YT closed this Apr 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:NVPTX clang:codegen IR generation bugs: mangling, exceptions, etc. clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants