[NVPTX] Add support for maxclusterrank in launch_bounds #66496

jchlanda · 2023-09-15T11:24:02Z

Since SM_90 CUDA supports specifying additional argument to the launch_bounds attribute: maxBlocksPerCluster, to express the maximum number of CTAs that can be part of the cluster. See: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-dimension-directives-maxclusterrank and
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds for details.

llvmbot · 2023-09-15T11:25:12Z

@llvm/pr-subscribers-clang-codegen

@llvm/pr-subscribers-clang

Changes

Since SM_90 CUDA supports specifying additional argument to the launch_bounds attribute: maxBlocksPerCluster, to express the maximum number of CTAs that can be part of the cluster. See: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-dimension-directives-maxclusterrank and https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds for details. --

Patch is 24.44 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/66496.diff

13 Files Affected:

(modified) clang/include/clang/Basic/Attr.td (+2-1)
(modified) clang/include/clang/Basic/DiagnosticSemaKinds.td (+4)
(modified) clang/include/clang/Sema/Sema.h (+3-2)
(modified) clang/lib/CodeGen/Targets/NVPTX.cpp (+10-2)
(modified) clang/lib/Parse/ParseOpenMP.cpp (+2-1)
(modified) clang/lib/Sema/SemaDeclAttr.cpp (+39-7)
(modified) clang/lib/Sema/SemaTemplateInstantiateDecl.cpp (+9-1)
(modified) clang/test/CodeGenCUDA/launch-bounds.cu (+69)
(modified) clang/test/SemaCUDA/launch_bounds.cu (+3-1)
(added) clang/test/SemaCUDA/launch_bounds_sm_90.cu (+45)
(modified) llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp (+36-43)
(modified) llvm/lib/Target/NVPTX/NVPTXUtilities.cpp (+4)
(modified) llvm/lib/Target/NVPTX/NVPTXUtilities.h (+1)

diff --git a/clang/include/clang/Basic/Attr.td b/clang/include/clang/Basic/Attr.td
index c95db7e8049d47a..3c51261bd3eb081 100644
--- a/clang/include/clang/Basic/Attr.td
+++ b/clang/include/clang/Basic/Attr.td
@@ -1267,7 +1267,8 @@ def CUDAInvalidTarget : InheritableAttr {
 
 def CUDALaunchBounds : InheritableAttr {
   let Spellings = [GNU&lt;&quot;launch_bounds&quot;&gt;, Declspec&lt;&quot;__launch_bounds__&quot;&gt;];
-  let Args = [ExprArgument&lt;&quot;MaxThreads&quot;&gt;, ExprArgument&lt;&quot;MinBlocks&quot;, 1&gt;];
+  let Args = [ExprArgument&lt;&quot;MaxThreads&quot;&gt;, ExprArgument&lt;&quot;MinBlocks&quot;, 1&gt;,
+              ExprArgument&lt;&quot;MaxBlocks&quot;, 1&gt;];
   let LangOpts = [CUDA];
   let Subjects = SubjectList&lt;[ObjCMethod, FunctionLike]&gt;;
   // An AST node is created for this attribute, but is not used by other parts
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 0ac4df8edb242f6..088e3a45c7babba 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -11836,6 +11836,10 @@ def err_sycl_special_type_num_init_method : Error&lt;
   &quot;types with &#x27;sycl_special_class&#x27; attribute must have one and only one &#x27;__init&#x27; &quot;
   &quot;method defined&quot;&gt;;
 
+def warn_cuda_maxclusterrank_sm_90 : Warning&lt;
+  &quot;maxclusterrank requires sm_90 or higher, CUDA arch provided: %0, ignoring &quot;
+  &quot;%1 attribute&quot;&gt;, InGroup&lt;IgnoredAttributes&gt;;
+
 def err_bit_int_bad_size : Error&lt;&quot;%select{signed|unsigned}0 _BitInt must &quot;
                                  &quot;have a bit size of at least %select{2|1}0&quot;&gt;;
 def err_bit_int_max_size : Error&lt;&quot;%select{signed|unsigned}0 _BitInt of bit &quot;
diff --git a/clang/include/clang/Sema/Sema.h b/clang/include/clang/Sema/Sema.h
index 47379e00a7445e3..dca7b66da3796d9 100644
--- a/clang/include/clang/Sema/Sema.h
+++ b/clang/include/clang/Sema/Sema.h
@@ -11051,12 +11051,13 @@ class Sema final {
   /// Create an CUDALaunchBoundsAttr attribute.
   CUDALaunchBoundsAttr *CreateLaunchBoundsAttr(const AttributeCommonInfo &amp;CI,
                                                Expr *MaxThreads,
-                                               Expr *MinBlocks);
+                                               Expr *MinBlocks,
+                                               Expr *MaxBlocks);
 
   /// AddLaunchBoundsAttr - Adds a launch_bounds attribute to a particular
   /// declaration.
   void AddLaunchBoundsAttr(Decl *D, const AttributeCommonInfo &amp;CI,
-                           Expr *MaxThreads, Expr *MinBlocks);
+                           Expr *MaxThreads, Expr *MinBlocks, Expr *MaxBlocks);
 
   /// AddModeAttr - Adds a mode attribute to a particular declaration.
   void AddModeAttr(Decl *D, const AttributeCommonInfo &amp;CI, IdentifierInfo *Name,
diff --git a/clang/lib/CodeGen/Targets/NVPTX.cpp b/clang/lib/CodeGen/Targets/NVPTX.cpp
index 0d4bbd795648008..64d019a10514d60 100644
--- a/clang/lib/CodeGen/Targets/NVPTX.cpp
+++ b/clang/lib/CodeGen/Targets/NVPTX.cpp
@@ -296,8 +296,8 @@ void CodeGenModule::handleCUDALaunchBoundsAttr(
     NVPTXTargetCodeGenInfo::addNVVMMetadata(F, &quot;maxntidx&quot;,
                                             MaxThreads.getExtValue());
 
-  // min blocks is an optional argument for CUDALaunchBoundsAttr. If it was
-  // not specified in __launch_bounds__ or if the user specified a 0 value,
+  // min and max blocks is an optional argument for CUDALaunchBoundsAttr. If it
+  // was not specified in __launch_bounds__ or if the user specified a 0 value,
   // we don&#x27;t have to add a PTX directive.
   if (Attr-&gt;getMinBlocks()) {
     llvm::APSInt MinBlocks(32);
@@ -307,6 +307,14 @@ void CodeGenModule::handleCUDALaunchBoundsAttr(
       NVPTXTargetCodeGenInfo::addNVVMMetadata(F, &quot;minctasm&quot;,
                                               MinBlocks.getExtValue());
   }
+  if (Attr-&gt;getMaxBlocks()) {
+    llvm::APSInt MaxBlocks(32);
+    MaxBlocks = Attr-&gt;getMaxBlocks()-&gt;EvaluateKnownConstInt(getContext());
+    if (MaxBlocks &gt; 0)
+      // Create !{&lt;func-ref&gt;, metadata !&quot;maxclusterrank&quot;, i32 &lt;val&gt;} node
+      NVPTXTargetCodeGenInfo::addNVVMMetadata(F, &quot;maxclusterrank&quot;,
+                                              MaxBlocks.getExtValue());
+  }
 }
 
 std::unique_ptr&lt;TargetCodeGenInfo&gt;
diff --git a/clang/lib/Parse/ParseOpenMP.cpp b/clang/lib/Parse/ParseOpenMP.cpp
index 605b97617432ed3..8a8a126bf7244d4 100644
--- a/clang/lib/Parse/ParseOpenMP.cpp
+++ b/clang/lib/Parse/ParseOpenMP.cpp
@@ -3739,7 +3739,8 @@ OMPClause *Parser::ParseOpenMPOMPXAttributesClause(bool ParseOnly) {
         continue;
       if (auto *A = Actions.CreateLaunchBoundsAttr(
               PA, PA.getArgAsExpr(0),
-              PA.getNumArgs() &gt; 1 ? PA.getArgAsExpr(1) : nullptr))
+              PA.getNumArgs() &gt; 1 ? PA.getArgAsExpr(1) : nullptr,
+              PA.getNumArgs() &gt; 2 ? PA.getArgAsExpr(2) : nullptr))
         Attrs.push_back(A);
       continue;
     default:
diff --git a/clang/lib/Sema/SemaDeclAttr.cpp b/clang/lib/Sema/SemaDeclAttr.cpp
index cc98713241395ec..e62a0d4fc29f9cd 100644
--- a/clang/lib/Sema/SemaDeclAttr.cpp
+++ b/clang/lib/Sema/SemaDeclAttr.cpp
@@ -5607,6 +5607,21 @@ bool Sema::CheckRegparmAttr(const ParsedAttr &amp;AL, unsigned &amp;numParams) {
   return false;
 }
 
+// Helper to get CudaArch.
+static CudaArch getCudaArch(const TargetInfo &amp;TI) {
+  if (!TI.hasFeature(&quot;ptx&quot;)) {
+    return CudaArch::UNKNOWN;
+  }
+  for (const auto &amp;Feature : TI.getTargetOpts().FeatureMap) {
+    if (Feature.getValue()) {
+      CudaArch Arch = StringToCudaArch(Feature.getKey());
+      if (Arch != CudaArch::UNKNOWN)
+        return Arch;
+    }
+  }
+  return CudaArch::UNKNOWN;
+}
+
 // Checks whether an argument of launch_bounds attribute is
 // acceptable, performs implicit conversion to Rvalue, and returns
 // non-nullptr Expr result on success. Otherwise, it returns nullptr
@@ -5650,8 +5665,8 @@ static Expr *makeLaunchBoundsArgExpr(Sema &amp;S, Expr *E,
 
 CUDALaunchBoundsAttr *
 Sema::CreateLaunchBoundsAttr(const AttributeCommonInfo &amp;CI, Expr *MaxThreads,
-                             Expr *MinBlocks) {
-  CUDALaunchBoundsAttr TmpAttr(Context, CI, MaxThreads, MinBlocks);
+                             Expr *MinBlocks, Expr *MaxBlocks) {
+  CUDALaunchBoundsAttr TmpAttr(Context, CI, MaxThreads, MinBlocks, MaxBlocks);
   MaxThreads = makeLaunchBoundsArgExpr(*this, MaxThreads, TmpAttr, 0);
   if (MaxThreads == nullptr)
     return nullptr;
@@ -5662,22 +5677,39 @@ Sema::CreateLaunchBoundsAttr(const AttributeCommonInfo &amp;CI, Expr *MaxThreads,
       return nullptr;
   }
 
+  if (MaxBlocks) {
+    // Feature &#x27;.maxclusterrank&#x27; requires .target sm_90 or higher.
+    auto SM = getCudaArch(Context.getTargetInfo());
+    if (SM == CudaArch::UNKNOWN || SM &lt; CudaArch::SM_90) {
+      Diag(MaxBlocks-&gt;getBeginLoc(), diag::warn_cuda_maxclusterrank_sm_90)
+          &lt;&lt; CudaArchToString(SM) &lt;&lt; CI &lt;&lt; MaxBlocks-&gt;getSourceRange();
+      // Ignore it by setting MaxBlocks to null;
+      MaxBlocks = nullptr;
+    } else {
+      MaxBlocks = makeLaunchBoundsArgExpr(*this, MaxBlocks, TmpAttr, 2);
+      if (MaxBlocks == nullptr)
+        return nullptr;
+    }
+  }
+
   return ::new (Context)
-      CUDALaunchBoundsAttr(Context, CI, MaxThreads, MinBlocks);
+      CUDALaunchBoundsAttr(Context, CI, MaxThreads, MinBlocks, MaxBlocks);
 }
 
 void Sema::AddLaunchBoundsAttr(Decl *D, const AttributeCommonInfo &amp;CI,
-                               Expr *MaxThreads, Expr *MinBlocks) {
-  if (auto *Attr = CreateLaunchBoundsAttr(CI, MaxThreads, MinBlocks))
+                               Expr *MaxThreads, Expr *MinBlocks,
+                               Expr *MaxBlocks) {
+  if (auto *Attr = CreateLaunchBoundsAttr(CI, MaxThreads, MinBlocks, MaxBlocks))
     D-&gt;addAttr(Attr);
 }
 
 static void handleLaunchBoundsAttr(Sema &amp;S, Decl *D, const ParsedAttr &amp;AL) {
-  if (!AL.checkAtLeastNumArgs(S, 1) || !AL.checkAtMostNumArgs(S, 2))
+  if (!AL.checkAtLeastNumArgs(S, 1) || !AL.checkAtMostNumArgs(S, 3))
     return;
 
   S.AddLaunchBoundsAttr(D, AL, AL.getArgAsExpr(0),
-                        AL.getNumArgs() &gt; 1 ? AL.getArgAsExpr(1) : nullptr);
+                        AL.getNumArgs() &gt; 1 ? AL.getArgAsExpr(1) : nullptr,
+                        AL.getNumArgs() &gt; 2 ? AL.getArgAsExpr(2) : nullptr);
 }
 
 static void handleArgumentWithTypeTagAttr(Sema &amp;S, Decl *D,
diff --git a/clang/lib/Sema/SemaTemplateInstantiateDecl.cpp b/clang/lib/Sema/SemaTemplateInstantiateDecl.cpp
index 37a7d6204413a38..3f7268f5450a6fa 100644
--- a/clang/lib/Sema/SemaTemplateInstantiateDecl.cpp
+++ b/clang/lib/Sema/SemaTemplateInstantiateDecl.cpp
@@ -302,7 +302,15 @@ static void instantiateDependentCUDALaunchBoundsAttr(
     MinBlocks = Result.getAs&lt;Expr&gt;();
   }
 
-  S.AddLaunchBoundsAttr(New, Attr, MaxThreads, MinBlocks);
+  Expr *MaxBlocks = nullptr;
+  if (Attr.getMaxBlocks()) {
+    Result = S.SubstExpr(Attr.getMaxBlocks(), TemplateArgs);
+    if (Result.isInvalid())
+      return;
+    MaxBlocks = Result.getAs&lt;Expr&gt;();
+  }
+
+  S.AddLaunchBoundsAttr(New, Attr, MaxThreads, MinBlocks, MaxBlocks);
 }
 
 static void
diff --git a/clang/test/CodeGenCUDA/launch-bounds.cu b/clang/test/CodeGenCUDA/launch-bounds.cu
index 58bcc410201f35f..31ca9216b413e92 100644
--- a/clang/test/CodeGenCUDA/launch-bounds.cu
+++ b/clang/test/CodeGenCUDA/launch-bounds.cu
@@ -1,9 +1,13 @@
 // RUN: %clang_cc1 %s -triple nvptx-unknown-unknown -fcuda-is-device -emit-llvm -o - | FileCheck %s
+// RUN: %clang_cc1 %s -triple nvptx-unknown-unknown -target-cpu sm_90 -DUSE_MAX_BLOCKS -fcuda-is-device -emit-llvm -o - | FileCheck -check-prefix=CHECK_MAX_BLOCKS %s
 
 #include &quot;Inputs/cuda.h&quot;
 
 #define MAX_THREADS_PER_BLOCK 256
 #define MIN_BLOCKS_PER_MP     2
+#ifdef USE_MAX_BLOCKS
+#define MAX_BLOCKS_PER_MP     4
+#endif
 
 // Test both max threads per block and Min cta per sm.
 extern &quot;C&quot; {
@@ -17,6 +21,21 @@ Kernel1()
 // CHECK: !{{[0-9]+}} = !{ptr @Kernel1, !&quot;maxntidx&quot;, i32 256}
 // CHECK: !{{[0-9]+}} = !{ptr @Kernel1, !&quot;minctasm&quot;, i32 2}
 
+#ifdef USE_MAX_BLOCKS
+// Test max threads per block and min/max cta per sm.
+extern &quot;C&quot; {
+__global__ void
+__launch_bounds__( MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP, MAX_BLOCKS_PER_MP )
+Kernel1_sm_90()
+{
+}
+}
+
+// CHECK_MAX_BLOCKS: !{{[0-9]+}} = !{ptr @Kernel1_sm_90, !&quot;maxntidx&quot;, i32 256}
+// CHECK_MAX_BLOCKS: !{{[0-9]+}} = !{ptr @Kernel1_sm_90, !&quot;minctasm&quot;, i32 2}
+// CHECK_MAX_BLOCKS: !{{[0-9]+}} = !{ptr @Kernel1_sm_90, !&quot;maxclusterrank&quot;, i32 4}
+#endif // USE_MAX_BLOCKS
+
 // Test only max threads per block. Min cta per sm defaults to 0, and
 // CodeGen doesn&#x27;t output a zero value for minctasm.
 extern &quot;C&quot; {
@@ -50,6 +69,20 @@ template __global__ void Kernel4&lt;MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP&gt;();
 // CHECK: !{{[0-9]+}} = !{ptr @{{.*}}Kernel4{{.*}}, !&quot;maxntidx&quot;, i32 256}
 // CHECK: !{{[0-9]+}} = !{ptr @{{.*}}Kernel4{{.*}}, !&quot;minctasm&quot;, i32 2}
 
+#ifdef USE_MAX_BLOCKS
+template &lt;int max_threads_per_block, int min_blocks_per_mp, int max_blocks_per_mp&gt;
+__global__ void
+__launch_bounds__(max_threads_per_block, min_blocks_per_mp, max_blocks_per_mp)
+Kernel4_sm_90()
+{
+}
+template __global__ void Kernel4_sm_90&lt;MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP, MAX_BLOCKS_PER_MP&gt;();
+
+// CHECK_MAX_BLOCKS: !{{[0-9]+}} = !{ptr @{{.*}}Kernel4_sm_90{{.*}}, !&quot;maxntidx&quot;, i32 256}
+// CHECK_MAX_BLOCKS: !{{[0-9]+}} = !{ptr @{{.*}}Kernel4_sm_90{{.*}}, !&quot;minctasm&quot;, i32 2}
+// CHECK_MAX_BLOCKS: !{{[0-9]+}} = !{ptr @{{.*}}Kernel4_sm_90{{.*}}, !&quot;maxclusterrank&quot;, i32 4}
+#endif //USE_MAX_BLOCKS
+
 const int constint = 100;
 template &lt;int max_threads_per_block, int min_blocks_per_mp&gt;
 __global__ void
@@ -63,6 +96,23 @@ template __global__ void Kernel5&lt;MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP&gt;();
 // CHECK: !{{[0-9]+}} = !{ptr @{{.*}}Kernel5{{.*}}, !&quot;maxntidx&quot;, i32 356}
 // CHECK: !{{[0-9]+}} = !{ptr @{{.*}}Kernel5{{.*}}, !&quot;minctasm&quot;, i32 258}
 
+#ifdef USE_MAX_BLOCKS
+
+template &lt;int max_threads_per_block, int min_blocks_per_mp, int max_blocks_per_mp&gt;
+__global__ void
+__launch_bounds__(max_threads_per_block + constint,
+                  min_blocks_per_mp + max_threads_per_block,
+                  max_blocks_per_mp + max_threads_per_block)
+Kernel5_sm_90()
+{
+}
+template __global__ void Kernel5_sm_90&lt;MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP, MAX_BLOCKS_PER_MP&gt;();
+
+// CHECK_MAX_BLOCKS: !{{[0-9]+}} = !{ptr @{{.*}}Kernel5_sm_90{{.*}}, !&quot;maxntidx&quot;, i32 356}
+// CHECK_MAX_BLOCKS: !{{[0-9]+}} = !{ptr @{{.*}}Kernel5_sm_90{{.*}}, !&quot;minctasm&quot;, i32 258}
+// CHECK_MAX_BLOCKS: !{{[0-9]+}} = !{ptr @{{.*}}Kernel5_sm_90{{.*}}, !&quot;maxclusterrank&quot;, i32 260}
+#endif //USE_MAX_BLOCKS
+
 // Make sure we don&#x27;t emit negative launch bounds values.
 __global__ void
 __launch_bounds__( -MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP )
@@ -80,7 +130,26 @@ Kernel7()
 // CHECK:     !{{[0-9]+}} = !{ptr @{{.*}}Kernel7{{.*}}, !&quot;maxntidx&quot;,
 // CHECK-NOT: !{{[0-9]+}} = !{ptr @{{.*}}Kernel7{{.*}}, !&quot;minctasm&quot;,
 
+#ifdef USE_MAX_BLOCKS
+__global__ void
+__launch_bounds__( MAX_THREADS_PER_BLOCK, -MIN_BLOCKS_PER_MP, -MAX_BLOCKS_PER_MP )
+Kernel7_sm_90()
+{
+}
+// CHECK_MAX_BLOCKS:     !{{[0-9]+}} = !{ptr @{{.*}}Kernel7_sm_90{{.*}}, !&quot;maxntidx&quot;,
+// CHECK_MAX_BLOCKS-NOT: !{{[0-9]+}} = !{ptr @{{.*}}Kernel7_sm_90{{.*}}, !&quot;minctasm&quot;,
+// CHECK_MAX_BLOCKS-NOT: !{{[0-9]+}} = !{ptr @{{.*}}Kernel7_sm_90{{.*}}, !&quot;maxclusterrank&quot;,
+#endif // USE_MAX_BLOCKS
+
 const char constchar = 12;
 __global__ void __launch_bounds__(constint, constchar) Kernel8() {}
 // CHECK:     !{{[0-9]+}} = !{ptr @{{.*}}Kernel8{{.*}}, !&quot;maxntidx&quot;, i32 100
 // CHECK:     !{{[0-9]+}} = !{ptr @{{.*}}Kernel8{{.*}}, !&quot;minctasm&quot;, i32 12
+
+#ifdef USE_MAX_BLOCKS
+const char constchar_2 = 14;
+__global__ void __launch_bounds__(constint, constchar, constchar_2) Kernel8_sm_90() {}
+// CHECK_MAX_BLOCKS:     !{{[0-9]+}} = !{ptr @{{.*}}Kernel8_sm_90{{.*}}, !&quot;maxntidx&quot;, i32 100
+// CHECK_MAX_BLOCKS:     !{{[0-9]+}} = !{ptr @{{.*}}Kernel8_sm_90{{.*}}, !&quot;minctasm&quot;, i32 12
+// CHECK_MAX_BLOCKS:     !{{[0-9]+}} = !{ptr @{{.*}}Kernel8_sm_90{{.*}}, !&quot;maxclusterrank&quot;, i32 14
+#endif // USE_MAX_BLOCKS
diff --git a/clang/test/SemaCUDA/launch_bounds.cu b/clang/test/SemaCUDA/launch_bounds.cu
index 0ca0c0145d8bbb6..b1f29480da30c65 100644
--- a/clang/test/SemaCUDA/launch_bounds.cu
+++ b/clang/test/SemaCUDA/launch_bounds.cu
@@ -12,7 +12,7 @@ __launch_bounds__(0x10000000000000000) void TestWayTooBigArg(void); // expected-
 __launch_bounds__(-128, 7) void TestNegArg1(void); // expected-warning {{&#x27;launch_bounds&#x27; attribute parameter 0 is negative and will be ignored}}
 __launch_bounds__(128, -7) void TestNegArg2(void); // expected-warning {{&#x27;launch_bounds&#x27; attribute parameter 1 is negative and will be ignored}}
 
-__launch_bounds__(1, 2, 3) void Test3Args(void); // expected-error {{&#x27;launch_bounds&#x27; attribute takes no more than 2 arguments}}
+__launch_bounds__(1, 2, 3, 4) void Test4Args(void); // expected-error {{&#x27;launch_bounds&#x27; attribute takes no more than 3 arguments}}
 __launch_bounds__() void TestNoArgs(void); // expected-error {{&#x27;launch_bounds&#x27; attribute takes at least 1 argument}}
 
 int TestNoFunction __launch_bounds__(128, 7); // expected-warning {{&#x27;launch_bounds&#x27; attribute only applies to Objective-C methods, functions, and function pointers}}
@@ -47,3 +47,5 @@ __launch_bounds__(Args) void TestTemplateVariadicArgs(void) {} // expected-error
 
 template &lt;int... Args&gt;
 __launch_bounds__(1, Args) void TestTemplateVariadicArgs2(void) {} // expected-error {{expression contains unexpanded parameter pack &#x27;Args&#x27;}}
+
+__launch_bounds__(1, 2, 3) void Test3Args(void); // expected-warning {{maxclusterrank requires sm_90 or higher, CUDA arch provided: unknown, ignoring &#x27;launch_bounds&#x27; attribute}}
diff --git a/clang/test/SemaCUDA/launch_bounds_sm_90.cu b/clang/test/SemaCUDA/launch_bounds_sm_90.cu
new file mode 100644
index 000000000000000..6b2369983b74fbb
--- /dev/null
+++ b/clang/test/SemaCUDA/launch_bounds_sm_90.cu
@@ -0,0 +1,45 @@
+// RUN: %clang_cc1 -std=c++11 -fsyntax-only -triple nvptx-unknown-unknown -target-cpu sm_90  -verify %s
+
+#include &quot;Inputs/cuda.h&quot;
+
+__launch_bounds__(128, 7) void Test2Args(void);
+__launch_bounds__(128) void Test1Arg(void);
+
+__launch_bounds__(0xffffffff) void TestMaxArg(void);
+__launch_bounds__(0x100000000) void TestTooBigArg(void); // expected-error {{integer constant expression evaluates to value 4294967296 that cannot be represented in a 32-bit unsigned integer type}}
+__launch_bounds__(0x10000000000000000) void TestWayTooBigArg(void); // expected-error {{integer literal is too large to be represented in any integer type}}
+__launch_bounds__(1, 1, 0x10000000000000000) void TestWayTooBigArg(void); // expected-error {{integer literal is too large to be represented in any integer type}}
+
+__launch_bounds__(-128, 7) void TestNegArg1(void); // expected-warning {{&#x27;launch_bounds&#x27; attribute parameter 0 is negative and will be ignored}}
+__launch_bounds__(128, -7) void TestNegArg2(void); // expected-warning {{&#x27;launch_bounds&#x27; attribute parameter 1 is negative and will be ignored}}
+__launch_bounds__(128, 1, -7) void TestNegArg2(void); // expected-warning {{&#x27;launch_bounds&#x27; attribute parameter 2 is negative and will be ignored}}
+
+
+__launch_bounds__(1, 2, 3, 4) void Test4Args(void); // expected-error {{&#x27;launch_bounds&#x27; attribute takes no more than 3 arguments}}
+__launch_bounds__() void TestNoArgs(void); // expected-error {{&#x27;launch_bounds&#x27; attribute takes at least 1 argument}}
+
+int TestNoFunction __launch_bounds__(128, 7, 13); // expected-warning {{&#x27;launch_bounds&#x27; attribute only applies to Objective-C methods, functions, and function pointers}}
+
+__launch_bounds__(true) void TestBool(void);
+__launch_bounds__(128, 1, 128.0) void TestFP(void); // expected-error {{&#x27;launch_bounds&#x27; attribute requires parameter 2 to be an integer constant}}
+__launch_bounds__(128, 1, (void*)0) void TestNullptr(void); // expected-error {{&#x27;launch_bounds&#x27; ...

jchlanda · 2023-09-21T09:54:40Z

A friendly ping.

ldrumm

lovely tests. Looks good modulo nits

ldrumm · 2023-09-21T10:33:51Z

clang/lib/CodeGen/Targets/NVPTX.cpp

+    MaxBlocks = Attr->getMaxBlocks()->EvaluateKnownConstInt(getContext());
+    if (MaxBlocks > 0)
+      // Create !{<func-ref>, metadata !"maxclusterrank", i32 <val>} node
+      NVPTXTargetCodeGenInfo::addNVVMMetadata(F, "maxclusterrank",


Do we have enough information to assert this is non-negative?

That's a good question, so makeLaunchBoundsArgEspr does perform a check for negative values, but lets the value pass (unlike for the case of values > 32 bits, when it returns nullptr), I didn't want to change it, so catch the negative case here.

clang/lib/Sema/SemaDeclAttr.cpp

llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp

ldrumm · 2023-09-21T10:53:39Z

including Artem as NVPTX is involved

Artem-B · 2023-09-21T15:57:25Z

clang/include/clang/Basic/DiagnosticSemaKinds.td

@@ -11836,6 +11836,10 @@ def err_sycl_special_type_num_init_method : Error<
  "types with 'sycl_special_class' attribute must have one and only one '__init' "
  "method defined">;

+def warn_cuda_maxclusterrank_sm_90 : Warning<
+  "maxclusterrank requires sm_90 or higher, CUDA arch provided: %0, ignoring "
+  "%1 attribute">, InGroup<IgnoredAttributes>;


Are we ignoring the whole launch_bounds attribute, or only the MaxBlocks parameter?

The whole thing, this is analogous to how we currently handle:

__launch_bounds__(128, -2)

we issue a warning:

/home/dev/llvm/clang/test/SemaCUDA/launch_bounds_running_test.cu:5:24: warning: 'launch_bounds' attribute parameter 1 is negative and will be ignored [-Wcuda-compat] 5 | __launch_bounds__(128, -2) void Test2Args(void); | ^~ /home/dev/llvm/clang/test/SemaCUDA/Inputs/cuda.h:14:61: note: expanded from macro '__launch_bounds__' 14 | #define __launch_bounds__(...) __attribute__((launch_bounds(__VA_ARGS__))) | ^~~~~~~~~~~ 1 warning generated when compiling for host.

vs max cluster rank:

/home/dev/llvm/clang/test/SemaCUDA/launch_bounds_running_test.cu:5:27: warning: 'launch_bounds' attribute parameter 2 is negative and will be ignored [-Wcuda-compat] 5 | __launch_bounds__(128, 2, -8) void Test2Args(void); | ^~ /home/dev/llvm/clang/test/SemaCUDA/Inputs/cuda.h:14:61: note: expanded from macro '__launch_bounds__' 14 | #define __launch_bounds__(...) __attribute__((launch_bounds(__VA_ARGS__))) | ^~~~~~~~~~~ 1 warning generated when compiling for host.

and the resulting asm contains neither of the directives.

Artem-B · 2023-09-21T16:08:41Z

clang/lib/Sema/SemaDeclAttr.cpp

@@ -5607,6 +5607,21 @@ bool Sema::CheckRegparmAttr(const ParsedAttr &AL, unsigned &numParams) {
  return false;
 }

+// Helper to get CudaArch.
+static CudaArch getCudaArch(const TargetInfo &TI) {


Considering that we do have TargetInfo pointer here, instead of trying to figure out the target GPU via features, can we just extract CudaArch directly from NVPTXTargetInfo::GPU ?

Is that the kind of thing you had in mind:

diff --git a/clang/lib/Basic/Targets/NVPTX.h b/clang/lib/Basic/Targets/NVPTX.h index 6fa0b8df97d7..20d76b702a94 100644 --- a/clang/lib/Basic/Targets/NVPTX.h +++ b/clang/lib/Basic/Targets/NVPTX.h @@ -181,6 +181,8 @@ public: bool hasBitIntType() const override { return true; } bool hasBFloat16Type() const override { return true; } + + CudaArch getGPU() const { return GPU; } }; } // namespace targets } // namespace clang diff --git a/clang/lib/Sema/SemaDeclAttr.cpp b/clang/lib/Sema/SemaDeclAttr.cpp index c4ecaec7728b..636bb0694d36 100644 --- a/clang/lib/Sema/SemaDeclAttr.cpp +++ b/clang/lib/Sema/SemaDeclAttr.cpp @@ -10,6 +10,7 @@ // //===----------------------------------------------------------------------===// +#include "../Basic/Targets/NVPTX.h" #include "clang/AST/ASTConsumer.h" #include "clang/AST/ASTContext.h" #include "clang/AST/ASTMutationListener.h" @@ -5609,17 +5610,7 @@ bool Sema::CheckRegparmAttr(const ParsedAttr &AL, unsigned &numParams) { // Helper to get CudaArch. static CudaArch getCudaArch(const TargetInfo &TI) { - if (!TI.hasFeature("ptx")) { - return CudaArch::UNKNOWN; - } - for (const auto &Feature : TI.getTargetOpts().FeatureMap) { - if (Feature.getValue()) { - CudaArch Arch = StringToCudaArch(Feature.getKey()); - if (Arch != CudaArch::UNKNOWN) - return Arch; - } - } - return CudaArch::UNKNOWN; + return static_cast<const targets::NVPTXTargetInfo *>(&TI)->getGPU(); } // Checks whether an argument of launch_bounds attribute is

You may need to verify that TI->getTriple()->isNVPTX() before casting, but other than that, LGTM.

Done in: 3c17966

clang/lib/Sema/SemaDeclAttr.cpp

clang/test/SemaCUDA/launch_bounds.cu

clang/test/SemaCUDA/launch_bounds_sm_90.cu

Artem-B · 2023-09-21T16:28:53Z

llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp

+  if (getMaxNReg(F, Maxnreg))
+    O << ".maxnreg " << Maxnreg << "\n";
+
+  unsigned Maxclusterrank = 0;


Do we want to ignore this directive if the metadata exists, but we're targeting a pre-sm_90 GPU?

It may be useful for non-clang LLVM users (e.g XLA) to be able to always specify launch bounds metadata, and let LLVM decide on what it can do with it. Generating the directive for older GPUs would result in ptxas error, while ignoring it would still allow the kernels to compile and work, the same as would be the case if the metadata was correctly absent. I don't think there's not much point to require users to jump through more hoops just to achieve exactly the same result.

You are right, ptxas reacts to a sample with .maxclusterrank with pre SM_90 with a hard error:

ptxas --gpu-name sm_75 --output-file cluster_rank.o cluster_rank.s ptxas cluster_rank.s, line 18; error : Feature '.maxclusterrank' requires .target sm_90 or higher ptxas fatal : Ptx assembly aborted due to errors

Do I understand you right, that you'd like to see a check similar to what we do in SemaDeclAttr and filter out the directive on targets < SM_90?

We do not have a good way to issue any diagnostics from LLVM, so the choice would be to either reject the IR as invalid, or make an effort to compile to valid PTX. Right now we're neither here nor there.

I'd be fine with either of the options above. That said, ignoring metadata which we can't apply seems OK to me.

I've talked to @alinas who has more experience dealing with IR and she also thinks that ignoring maxclusterrank metadata on older GPUs is the right choice here.

Sure, done in: 261840a

Since SM_90 CUDA supports specifying additional argument to the launch_bounds attribute: maxBlocksPerCluster, to express the maximum number of CTAs that can be part of the cluster. See: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-dimension-directives-maxclusterrank and https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds for details.

…)" This reverts commit dfab31b. SemaDeclAttr.cpp cannot depend on Basic's private headers (lib/Basic/Targets/NVPTX.h)

sam-mccall · 2023-09-27T09:03:18Z

clang/lib/Sema/SemaDeclAttr.cpp

@@ -10,6 +10,7 @@
 //
 //===----------------------------------------------------------------------===//

+#include "../Basic/Targets/NVPTX.h"


This header is not part of clangBasic's interface, but rather its implementation (lib/Basic rather than include/clang/Basic).
Sema can't depend on it - if you need to use its APIs from outside clangBasic they should be moved to a public header.

The bazel build shows the problem: https://buildkite.com/llvm-project/upstream-bazel/builds/75928#018ad568-2f7c-4dda-ae90-3b4d787caad7. Breaking bazel is not itself reason to revert, but here it's flagging a real problem that CMake doesn't catch.

I've reverted as 0afbcb2 - sorry to do this so abruptly, but I can't fix this myself & such problems block downstream use of LLVM.

@sam-mccall, apologies for introducing the bug and thank you for drawing my attention to it.

I've got the fix for the problem:

diff --git a/clang/lib/Sema/SemaDeclAttr.cpp b/clang/lib/Sema/SemaDeclAttr.cpp index 10d1c910d9cd..3b87300e24bc 100644 --- a/clang/lib/Sema/SemaDeclAttr.cpp +++ b/clang/lib/Sema/SemaDeclAttr.cpp @@ -10,7 +10,6 @@ // //===----------------------------------------------------------------------===// -#include "../Basic/Targets/NVPTX.h" #include "clang/AST/ASTConsumer.h" #include "clang/AST/ASTContext.h" #include "clang/AST/ASTMutationListener.h" @@ -5612,7 +5611,8 @@ bool Sema::CheckRegparmAttr(const ParsedAttr &AL, unsigned &numParams) { static CudaArch getCudaArch(const TargetInfo &TI) { if (!TI.getTriple().isNVPTX()) llvm_unreachable("getCudaArch is only valid for NVPTX triple"); - return static_cast<const targets::NVPTXTargetInfo *>(&TI)->getGPU(); + auto &TO = TI.getTargetOpts(); + return StringToCudaArch(TO.CPU); } // Checks whether an argument of launch_bounds attribute is

Would you be so king and point me to the process for "reverting the revert" and folding the fix into the original patch?

@jchlanda ah, that's simpler than I expected! Wish I'd found that before reverting...

That looks good to me, I think rather than a new review you can just git revert 0afbcb2, make the changes you described above, git amend -a and change the description to "Reland [NVPTX] ...", run the tests and push.

Or if you prefer, send it as a new PR, happy to approve it.

…66496) This reverts commit 0afbcb2.

… (#67667) This reverts commit 0afbcb2.

Since SM_90 CUDA supports specifying additional argument to the launch_bounds attribute: maxBlocksPerCluster, to express the maximum number of CTAs that can be part of the cluster. See: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-dimension-directives-maxclusterrank and https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds for details.

…#66496)" This reverts commit dfab31b. SemaDeclAttr.cpp cannot depend on Basic's private headers (lib/Basic/Targets/NVPTX.h)

…66496) (llvm#67667) This reverts commit 0afbcb2.

jchlanda requested review from rnk, nikic and jdoerfert September 15, 2023 11:24

llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. labels Sep 15, 2023

jchlanda mentioned this pull request Sep 15, 2023

[SYCL] Introduce min_work_groups_per_cu and max_work_groups_per_mp intel/llvm#11192

Merged

nikic removed their request for review September 15, 2023 13:24

jchlanda requested review from ldrumm and nikic September 21, 2023 09:54

ldrumm approved these changes Sep 21, 2023

View reviewed changes

ldrumm requested a review from Artem-B September 21, 2023 10:53

Artem-B reviewed Sep 21, 2023

View reviewed changes

jchlanda requested a review from Artem-B September 22, 2023 10:29

Artem-B approved these changes Sep 25, 2023

View reviewed changes

jchlanda force-pushed the jakub/launch_bounds_maxclusterrank branch from 261840a to 437c41f Compare September 26, 2023 18:08

jchlanda merged commit dfab31b into llvm:main Sep 27, 2023

sam-mccall added a commit that referenced this pull request Sep 27, 2023

Revert "[NVPTX] Add support for maxclusterrank in launch_bounds (#66496…

0afbcb2

…)" This reverts commit dfab31b. SemaDeclAttr.cpp cannot depend on Basic's private headers (lib/Basic/Targets/NVPTX.h)

sam-mccall reviewed Sep 27, 2023

View reviewed changes

jchlanda mentioned this pull request Sep 28, 2023

Reland [NVPTX] Add support for maxclusterrank in launch_bounds (#66496) #67667

Merged

jchlanda added a commit to jchlanda/llvm-project that referenced this pull request Sep 28, 2023

Reland [NVPTX] Add support for maxclusterrank in launch_bounds (llvm#…

6d17781

…66496) This reverts commit 0afbcb2.

jchlanda added a commit that referenced this pull request Sep 29, 2023

Reland [NVPTX] Add support for maxclusterrank in launch_bounds (#66496)…

3f8d4a8

… (#67667) This reverts commit 0afbcb2.

legrosbuffle pushed a commit to legrosbuffle/llvm-project that referenced this pull request Sep 29, 2023

Reland [NVPTX] Add support for maxclusterrank in launch_bounds (llvm#…

24b81dc

…66496) (llvm#67667) This reverts commit 0afbcb2.

[NVPTX] Add support for maxclusterrank in launch_bounds #66496

[NVPTX] Add support for maxclusterrank in launch_bounds #66496

Uh oh!

Conversation

jchlanda commented Sep 15, 2023

Uh oh!

llvmbot commented Sep 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jchlanda commented Sep 21, 2023

Uh oh!

ldrumm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ldrumm commented Sep 21, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvmbot commented Sep 15, 2023 •

edited

Loading

ldrumm left a comment •

edited

Loading