Skip to content

[NVPTX] Allow compiling LLVM-IR without -march set #79873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 31, 2024

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Jan 29, 2024

Summary:
The NVPTX tools require an architecture to be used, however if we are
creating generic LLVM-IR we should be able to leave it unspecified. This
will result in the target-cpu attributes not being set on the
functions so it can be changed when linked into code. This allows the
standalone --target=nvptx64-nvidia-cuda toolchain to create LLVM-IR
simmilar to how CUDA's deviceRTL looks from C/C++

Summary:
The NVPTX tools require an architecture to be used, however if we are
creating generic LLVM-IR we should be able to leave it unspecified. This
will result in the `target-cpu` attributes not being set on the
functions so it can be changed when linked into code. This allows the
standalone `--target=nvptx64-nvidia-cuda` toolchain to create LLVM-IR
simmilar to how CUDA's deviceRTL looks from C/C++
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Jan 29, 2024
@llvmbot
Copy link
Member

llvmbot commented Jan 29, 2024

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-driver

Author: Joseph Huber (jhuber6)

Changes

Summary:
The NVPTX tools require an architecture to be used, however if we are
creating generic LLVM-IR we should be able to leave it unspecified. This
will result in the target-cpu attributes not being set on the
functions so it can be changed when linked into code. This allows the
standalone --target=nvptx64-nvidia-cuda toolchain to create LLVM-IR
simmilar to how CUDA's deviceRTL looks from C/C++


Full diff: https://github.com/llvm/llvm-project/pull/79873.diff

6 Files Affected:

  • (modified) clang/include/clang/Basic/DiagnosticDriverKinds.td (+2)
  • (modified) clang/lib/Basic/Targets/NVPTX.cpp (+6-1)
  • (modified) clang/lib/Basic/Targets/NVPTX.h (+2-1)
  • (modified) clang/lib/Driver/ToolChains/Cuda.cpp (+13-6)
  • (modified) clang/test/Driver/cuda-cross-compiling.c (+14-10)
  • (modified) clang/test/Preprocessor/predefined-arch-macros.c (+12)
diff --git a/clang/include/clang/Basic/DiagnosticDriverKinds.td b/clang/include/clang/Basic/DiagnosticDriverKinds.td
index 094fe1950941270..476528375fb8889 100644
--- a/clang/include/clang/Basic/DiagnosticDriverKinds.td
+++ b/clang/include/clang/Basic/DiagnosticDriverKinds.td
@@ -57,6 +57,8 @@ def warn_drv_avr_stdlib_not_linked: Warning<
   InGroup<AVRRtlibLinkingQuirks>;
 def err_drv_cuda_bad_gpu_arch : Error<"unsupported CUDA gpu architecture: %0">;
 def err_drv_offload_bad_gpu_arch : Error<"unsupported %0 gpu architecture: %1">;
+def err_drv_offload_missing_gpu_arch : Error<
+  "Must pass in an explicit %0 gpu architecture to '%1'">;
 def err_drv_no_cuda_installation : Error<
   "cannot find CUDA installation; provide its path via '--cuda-path', or pass "
   "'-nocudainc' to build without CUDA includes">;
diff --git a/clang/lib/Basic/Targets/NVPTX.cpp b/clang/lib/Basic/Targets/NVPTX.cpp
index 0b9d97f69d146af..7687e3faad770d0 100644
--- a/clang/lib/Basic/Targets/NVPTX.cpp
+++ b/clang/lib/Basic/Targets/NVPTX.cpp
@@ -59,7 +59,7 @@ NVPTXTargetInfo::NVPTXTargetInfo(const llvm::Triple &Triple,
   // Define available target features
   // These must be defined in sorted order!
   NoAsmVariants = true;
-  GPU = CudaArch::SM_20;
+  GPU = CudaArch::UNUSED;
 
   if (TargetPointerWidth == 32)
     resetDataLayout("e-p:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64");
@@ -169,6 +169,11 @@ void NVPTXTargetInfo::getTargetDefines(const LangOptions &Opts,
                                        MacroBuilder &Builder) const {
   Builder.defineMacro("__PTX__");
   Builder.defineMacro("__NVPTX__");
+
+  // Skip setting architecture dependent macros if undefined.
+  if (GPU == CudaArch::UNUSED && !HostTarget)
+    return;
+
   if (Opts.CUDAIsDevice || Opts.OpenMPIsTargetDevice || !HostTarget) {
     // Set __CUDA_ARCH__ for the GPU specified.
     std::string CUDAArchCode = [this] {
diff --git a/clang/lib/Basic/Targets/NVPTX.h b/clang/lib/Basic/Targets/NVPTX.h
index 20d76b702a9426e..f476d49047c0138 100644
--- a/clang/lib/Basic/Targets/NVPTX.h
+++ b/clang/lib/Basic/Targets/NVPTX.h
@@ -79,7 +79,8 @@ class LLVM_LIBRARY_VISIBILITY NVPTXTargetInfo : public TargetInfo {
   initFeatureMap(llvm::StringMap<bool> &Features, DiagnosticsEngine &Diags,
                  StringRef CPU,
                  const std::vector<std::string> &FeaturesVec) const override {
-    Features[CudaArchToString(GPU)] = true;
+    if (GPU != CudaArch::UNUSED)
+      Features[CudaArchToString(GPU)] = true;
     Features["ptx" + std::to_string(PTXVersion)] = true;
     return TargetInfo::initFeatureMap(Features, Diags, CPU, FeaturesVec);
   }
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 8a9d0caaccf30bb..ca54d2d55426b9f 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -389,7 +389,11 @@ void NVPTX::Assembler::ConstructJob(Compilation &C, const JobAction &JA,
     GPUArchName = JA.getOffloadingArch();
   } else {
     GPUArchName = Args.getLastArgValue(options::OPT_march_EQ);
-    assert(!GPUArchName.empty() && "Must have an architecture passed in.");
+    if (GPUArchName.empty()) {
+      C.getDriver().Diag(diag::err_drv_offload_missing_gpu_arch)
+          << getToolChain().getArchName() << getShortName();
+      return;
+    }
   }
 
   // Obtain architecture from the action.
@@ -593,7 +597,11 @@ void NVPTX::Linker::ConstructJob(Compilation &C, const JobAction &JA,
     CmdArgs.push_back("-v");
 
   StringRef GPUArch = Args.getLastArgValue(options::OPT_march_EQ);
-  assert(!GPUArch.empty() && "At least one GPU Arch required for nvlink.");
+  if (GPUArch.empty()) {
+    C.getDriver().Diag(diag::err_drv_offload_missing_gpu_arch)
+        << getToolChain().getArchName() << getShortName();
+    return;
+  }
 
   CmdArgs.push_back("-arch");
   CmdArgs.push_back(Args.MakeArgString(GPUArch));
@@ -726,9 +734,8 @@ NVPTXToolChain::NVPTXToolChain(const Driver &D, const llvm::Triple &Triple,
 llvm::opt::DerivedArgList *
 NVPTXToolChain::TranslateArgs(const llvm::opt::DerivedArgList &Args,
                               StringRef BoundArch,
-                              Action::OffloadKind DeviceOffloadKind) const {
-  DerivedArgList *DAL =
-      ToolChain::TranslateArgs(Args, BoundArch, DeviceOffloadKind);
+                              Action::OffloadKind OffloadKind) const {
+  DerivedArgList *DAL = ToolChain::TranslateArgs(Args, BoundArch, OffloadKind);
   if (!DAL)
     DAL = new DerivedArgList(Args.getBaseArgs());
 
@@ -738,7 +745,7 @@ NVPTXToolChain::TranslateArgs(const llvm::opt::DerivedArgList &Args,
     if (!llvm::is_contained(*DAL, A))
       DAL->append(A);
 
-  if (!DAL->hasArg(options::OPT_march_EQ)) {
+  if (!DAL->hasArg(options::OPT_march_EQ) && OffloadKind != Action::OFK_None) {
     DAL->AddJoinedArg(nullptr, Opts.getOption(options::OPT_march_EQ),
                       CudaArchToString(CudaArch::CudaDefault));
   } else if (DAL->getLastArgValue(options::OPT_march_EQ) == "native") {
diff --git a/clang/test/Driver/cuda-cross-compiling.c b/clang/test/Driver/cuda-cross-compiling.c
index 12d0af3b45f32f6..6c9e2cac736b792 100644
--- a/clang/test/Driver/cuda-cross-compiling.c
+++ b/clang/test/Driver/cuda-cross-compiling.c
@@ -59,16 +59,6 @@
 
 // LINK: nvlink{{.*}}"-o" "a.out" "-arch" "sm_61" {{.*}} "{{.*}}.cubin"
 
-//
-// Test the generated arguments default to a value with no architecture. 
-//
-// RUN: %clang --target=nvptx64-nvidia-cuda -### --cuda-path=%S/Inputs/CUDA/usr/local/cuda %s 2>&1 \
-// RUN:   | FileCheck -check-prefix=DEFAULT %s
-
-//      DEFAULT: -cc1" "-triple" "nvptx64-nvidia-cuda" "-S" {{.*}} "-target-cpu" "sm_52" "-target-feature" "+ptx{{[0-9]+}}" {{.*}} "-o" "[[PTX:.+]].s"
-// DEFAULT-NEXT: ptxas{{.*}}"-m64" "-O0" "--gpu-name" "sm_52" "--output-file" "[[CUBIN:.+]].cubin" "[[PTX]].s" "-c"
-// DEFAULT-NEXT: nvlink{{.*}}"-o" "a.out" "-arch" "sm_52" {{.*}} "[[CUBIN]].cubin"
-
 //
 // Test to ensure that we enable handling global constructors in a freestanding
 // Nvidia compilation.
@@ -77,3 +67,17 @@
 // RUN:   | FileCheck -check-prefix=LOWERING %s
 
 // LOWERING: -cc1" "-triple" "nvptx64-nvidia-cuda" {{.*}} "-mllvm" "--nvptx-lower-global-ctor-dtor"
+
+//
+// Tests for handling a missing architecture.
+//
+// RUN: not %clang -target nvptx64-nvidia-cuda %s -### 2>&1 \
+// RUN:   | FileCheck -check-prefix=MISSING %s
+
+// MISSING: error: Must pass in an explicit nvptx64 gpu architecture to 'ptxas'
+// MISSING: error: Must pass in an explicit nvptx64 gpu architecture to 'nvlink'
+
+// RUN: %clang -target nvptx64-nvidia-cuda -flto -c %s -### 2>&1 \
+// RUN:   | FileCheck -check-prefix=GENERIC %s
+
+// GENERIC-NOT: -cc1" "-triple" "nvptx64-nvidia-cuda" {{.*}} "-target-cpu"
diff --git a/clang/test/Preprocessor/predefined-arch-macros.c b/clang/test/Preprocessor/predefined-arch-macros.c
index 27c7b4a271fee80..be9323adb4f640d 100644
--- a/clang/test/Preprocessor/predefined-arch-macros.c
+++ b/clang/test/Preprocessor/predefined-arch-macros.c
@@ -4292,6 +4292,18 @@
 // RUN:   | FileCheck -match-full-lines %s -check-prefix=CHECK_SYSTEMZ_ZVECTOR
 // CHECK_SYSTEMZ_ZVECTOR: #define __VEC__ 10304
 
+// Begin nvptx tests ----------------
+
+// RUN: %clang -march=sm_75 -E -dM %s -o - 2>&1 \
+// RUN:     -target nvptx64-unknown-unknown \
+// RUN:   | FileCheck -match-full-lines %s -check-prefixes=CHECK_NVPTX,CHECK_ARCH_SM_75
+// RUN: %clang -E -dM %s -o - 2>&1 \
+// RUN:     -target nvptx64-unknown-unknown \
+// RUN:   | FileCheck -match-full-lines %s -check-prefixes=CHECK_NVPTX,CHECK_ARCH_UNSET
+// CHECK_ARCH_SM_75: #define __CUDA_ARCH__ 750
+// CHECK_ARCH_UNSET-NOT: #define __CUDA_ARCH__
+// CHECK_NVPTX: #define __NVPTX__ 1
+
 // Begin amdgcn tests ----------------
 
 // RUN: %clang -march=amdgcn -E -dM %s -o - 2>&1 \

@Artem-B
Copy link
Member

Artem-B commented Jan 29, 2024

Relying on something not being defined is probably not the best way to handle 'generic' target. For starters it makes it hard or impossible to recreate the same compilation state by undoing already-specified option. It also breaks established assumption that there is a default target CPU/GPU. If we do want to have a generic GPU target, then we should grow an explicit 'generic' GPU variant, IMO. It would be a functional opposite of 'native'.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 29, 2024

Relying on something not being defined is probably not the best way to handle 'generic' target. For starters it makes it hard or impossible to recreate the same compilation state by undoing already-specified option. It also breaks established assumption that there is a default target CPU/GPU. If we do want to have a generic GPU target, then we should grow an explicit 'generic' GPU variant, IMO. It would be a functional opposite of 'native'.

AMDGPU uses a missing -mcpu on the OpenCL target to build their "generic" device libraries and it's also missing from NVIDIA's libdevice.10.bc in the same way. I think there's some precedent from both vendors to treat missing attributes as a more generic target. The default target to me is more of the domain of the driver. So if you're using CUDA and don't specify anything you get sm_52. This patch also creates a hard error if it's unspecified before it makes it to the tools like ptxas and nvlink.

@Artem-B
Copy link
Member

Artem-B commented Jan 29, 2024

I think there's some precedent from both vendors to treat missing attributes as a more generic target.

It sounds more like a bug than a feature to me.

The major difference between "you get sm_xx by default" and this "you get generic by default" is that With specific sm_XX, I can override it both ways -- I wan enable/disable it if I need to regardless of how it was specified before my overriding options.

With the magic unnameable 'generic' target, I can only disable it by specifying it, but there's no way to enable it once a preceding option names some specific architecture.

It makes little difference where you control complete build, but that is not the case for all builds. E.g. Tensorflow builds with bazel and the end user does not have access to whatever compiler flags global build rules may set. So if you want to build for generic GPU target, you will have to jump through way more hoops than is reasonable, as opposed to specifying a few overriding options you're interested in.

I'm fine with defaulting to such generic target, but I do believe we need to handle it the same way as specific targets.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 29, 2024

I think there's some precedent from both vendors to treat missing attributes as a more generic target.

It sounds more like a bug than a feature to me.

The major difference between "you get sm_xx by default" and this "you get generic by default" is that With specific sm_XX, I can override it both ways -- I wan enable/disable it if I need to regardless of how it was specified before my overriding options.

With the magic unnameable 'generic' target, I can only disable it by specifying it, but there's no way to enable it once a preceding option names some specific architecture.

It makes little difference where you control complete build, but that is not the case for all builds. E.g. Tensorflow builds with bazel and the end user does not have access to whatever compiler flags global build rules may set. So if you want to build for generic GPU target, you will have to jump through way more hoops than is reasonable, as opposed to specifying a few overriding options you're interested in.

I'm fine with defaulting to such generic target, but I do believe we need to handle it the same way as specific targets.

We could make the driver require some special argument, but the question is how that should be reflected in the LLVM-IR. Right now if you specify target-cpu you get target-cpu attributes, which is what we don't want. The standard behavior on all other targets is to ignore target-dependent attributes when -march or -mcpu isn't set, which is exactly what we want here. On the driver level we could hide this behind -march=generic if that's important, but I think inside of clang it should just be if it's missing as that's the standard handling.

@Artem-B
Copy link
Member

Artem-B commented Jan 29, 2024

Right now if you specify target-cpu you get target-cpu attributes, which is what we don't want.

I'm fine handling 'generic' in a special way under the hood and not specifying target-CPU.

My concern is about user-facing interface. Command line options must be overridable.
For the CPU I would be able to specify the variant that matches the default.
For GPU I'll have no way to explicitly pick 'generic' as the target. I think this is important.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 30, 2024

Right now if you specify target-cpu you get target-cpu attributes, which is what we don't want.

I'm fine handling 'generic' in a special way under the hood and not specifying target-CPU.

My concern is about user-facing interface. Command line options must be overridable. For the CPU I would be able to specify the variant that matches the default. For GPU I'll have no way to explicitly pick 'generic' as the target. I think this is important.

I can add generic for overloading purposes, but I'd prefer if leaving it off implied generic as well, at least for this standalone target version. This is a little divergent from standard CPU targets because GPUs have no backward compatibility, so something being "generic" requires a bit more work.

@Artem-B
Copy link
Member

Artem-B commented Jan 30, 2024

Considering that it's for the stand-alone compilation only, I'm not going to block this patch.
That said, please add a TODO somewhere to address an issue w/ explicitly targeting generic variant.

@jhuber6 jhuber6 merged commit 7155c1e into llvm:main Jan 31, 2024
jhuber6 added a commit that referenced this pull request Mar 15, 2024
…85222)

This PR adds `-march=generic` support for the NVPTX backend. This
fulfills a TODO introduced in #79873.

With this PR, users can explicitly request the "default" CUDA
architecture, which makes sure that no specific architecture is
specified.

This PR does not address any compatibility issues between different CUDA
versions.

---------

Co-authored-by: Joseph Huber <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants