Skip to content

[NVPTX] Improve device function byval parameter lowering #129188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 28, 2025

Conversation

AlexMaclean
Copy link
Member

PTX supports 2 methods of accessing device function parameters:

  • "simple" case: If a parameters is only loaded, and all loads can address the parameter via a constant offset, then the parameter may be loaded via the ".param" address space. This case is not possible if the parameters is stored to or has it's address taken. This method is preferable when possible.

  • "move param" case: For more complex cases the address of the param may be placed in a register via a "mov" instruction. This mov also implicitly moves the param to the ".local" address space and allows for it to be written to. This essentially defers the responsibilty of the byval copy to the PTX calling convention.

The handling of these cases in the NVPTX backend for byval pointers has some major issues. We currently attempt to determine if a copy is necessary in NVPTXLowerArgs and either explicitly make an additional copy in the IR, or insert "addrspacecast" to move the param to the param address space. Unfortunately the criteria for determining which case is possible are not correct, leading to miscompilations (https://godbolt.org/z/Gq1fP7a3G). Further, the criteria for the "simple" case aren't enforceable in LLVM IR across other transformations and instruction selection, making deciding between the 2 cases in NVPTXLowerArgs brittle and buggy.

This patch aims to fix these issues and improve address space related optimization. In NVPTXLowerArgs, we conservatively assume that all parameters will use the "move param" case and the local address space. Responsibility for switching to the "simple" case is given to a new MachineIR pass, NVPTXForwardParams, which runs once it has become clear whether or not this is possible. This ensures that the correct address space is known for the "move param" case allowing for optimization, while still using the "simple" case where ever possible.

@llvmbot
Copy link
Member

llvmbot commented Feb 28, 2025

@llvm/pr-subscribers-backend-nvptx

Author: Alex MacLean (AlexMaclean)

Changes

PTX supports 2 methods of accessing device function parameters:

  • "simple" case: If a parameters is only loaded, and all loads can address the parameter via a constant offset, then the parameter may be loaded via the ".param" address space. This case is not possible if the parameters is stored to or has it's address taken. This method is preferable when possible.

  • "move param" case: For more complex cases the address of the param may be placed in a register via a "mov" instruction. This mov also implicitly moves the param to the ".local" address space and allows for it to be written to. This essentially defers the responsibilty of the byval copy to the PTX calling convention.

The handling of these cases in the NVPTX backend for byval pointers has some major issues. We currently attempt to determine if a copy is necessary in NVPTXLowerArgs and either explicitly make an additional copy in the IR, or insert "addrspacecast" to move the param to the param address space. Unfortunately the criteria for determining which case is possible are not correct, leading to miscompilations (https://godbolt.org/z/Gq1fP7a3G). Further, the criteria for the "simple" case aren't enforceable in LLVM IR across other transformations and instruction selection, making deciding between the 2 cases in NVPTXLowerArgs brittle and buggy.

This patch aims to fix these issues and improve address space related optimization. In NVPTXLowerArgs, we conservatively assume that all parameters will use the "move param" case and the local address space. Responsibility for switching to the "simple" case is given to a new MachineIR pass, NVPTXForwardParams, which runs once it has become clear whether or not this is possible. This ensures that the correct address space is known for the "move param" case allowing for optimization, while still using the "simple" case where ever possible.


Patch is 35.87 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/129188.diff

13 Files Affected:

  • (modified) llvm/lib/Target/NVPTX/CMakeLists.txt (+1)
  • (modified) llvm/lib/Target/NVPTX/NVPTX.h (+1)
  • (added) llvm/lib/Target/NVPTX/NVPTXForwardParams.cpp (+169)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp (+2-2)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (+12-4)
  • (modified) llvm/lib/Target/NVPTX/NVPTXInstrInfo.td (+3-18)
  • (modified) llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp (+27-36)
  • (modified) llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp (+3)
  • (added) llvm/test/CodeGen/NVPTX/forward-ld-param.ll (+142)
  • (modified) llvm/test/CodeGen/NVPTX/i128-array.ll (+7-8)
  • (modified) llvm/test/CodeGen/NVPTX/lower-args-gridconstant.ll (+26-24)
  • (modified) llvm/test/CodeGen/NVPTX/lower-args.ll (+8-15)
  • (modified) llvm/test/CodeGen/NVPTX/variadics-backend.ll (+10-10)
diff --git a/llvm/lib/Target/NVPTX/CMakeLists.txt b/llvm/lib/Target/NVPTX/CMakeLists.txt
index dfbda84534732..1cffde138eab7 100644
--- a/llvm/lib/Target/NVPTX/CMakeLists.txt
+++ b/llvm/lib/Target/NVPTX/CMakeLists.txt
@@ -16,6 +16,7 @@ set(NVPTXCodeGen_sources
   NVPTXAtomicLower.cpp
   NVPTXAsmPrinter.cpp
   NVPTXAssignValidGlobalNames.cpp
+  NVPTXForwardParams.cpp
   NVPTXFrameLowering.cpp
   NVPTXGenericToNVVM.cpp
   NVPTXISelDAGToDAG.cpp
diff --git a/llvm/lib/Target/NVPTX/NVPTX.h b/llvm/lib/Target/NVPTX/NVPTX.h
index ca915cd3f3732..62f51861ac55a 100644
--- a/llvm/lib/Target/NVPTX/NVPTX.h
+++ b/llvm/lib/Target/NVPTX/NVPTX.h
@@ -52,6 +52,7 @@ FunctionPass *createNVPTXLowerUnreachablePass(bool TrapUnreachable,
                                               bool NoTrapAfterNoreturn);
 MachineFunctionPass *createNVPTXPeephole();
 MachineFunctionPass *createNVPTXProxyRegErasurePass();
+MachineFunctionPass *createNVPTXForwardParamsPass();
 
 struct NVVMIntrRangePass : PassInfoMixin<NVVMIntrRangePass> {
   PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
diff --git a/llvm/lib/Target/NVPTX/NVPTXForwardParams.cpp b/llvm/lib/Target/NVPTX/NVPTXForwardParams.cpp
new file mode 100644
index 0000000000000..47d44b985363d
--- /dev/null
+++ b/llvm/lib/Target/NVPTX/NVPTXForwardParams.cpp
@@ -0,0 +1,169 @@
+//- NVPTXForwardParams.cpp - NVPTX Forward Device Params Removing Local Copy -//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// PTX supports 2 methods of accessing device function parameters:
+//
+//   - "simple" case: If a parameters is only loaded, and all loads can address
+//     the parameter via a constant offset, then the parameter may be loaded via
+//     the ".param" address space. This case is not possible if the parameters
+//     is stored to or has it's address taken. This method is preferable when
+//     possible. Ex:
+//
+//            ld.param.u32    %r1, [foo_param_1];
+//            ld.param.u32    %r2, [foo_param_1+4];
+//
+//   - "move param" case: For more complex cases the address of the param may be
+//     placed in a register via a "mov" instruction. This "mov" also implicitly
+//     moves the param to the ".local" address space and allows for it to be
+//     written to. This essentially defers the responsibilty of the byval copy
+//     to the PTX calling convention.
+//
+//            mov.b64         %rd1, foo_param_0;
+//            st.local.u32    [%rd1], 42;
+//            add.u64         %rd3, %rd1, %rd2;
+//            ld.local.u32    %r2, [%rd3];
+//
+// In NVPTXLowerArgs and SelectionDAG, we pessimistically assume that all
+// parameters will use the "move param" case and the local address space. This
+// pass is responsible for switching to the "simple" case when possible, as it
+// is more efficient.
+//
+// We do this by simply traversing uses of the param "mov" instructions an
+// trivially checking if they are all loads.
+//
+//===----------------------------------------------------------------------===//
+
+#include "NVPTX.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineInstr.h"
+#include "llvm/CodeGen/MachineOperand.h"
+#include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/TargetRegisterInfo.h"
+#include "llvm/Support/ErrorHandling.h"
+
+using namespace llvm;
+
+static bool traverseMoveUse(MachineInstr &U, const MachineRegisterInfo &MRI,
+                            SmallVectorImpl<MachineInstr *> &RemoveList,
+                            SmallVectorImpl<MachineInstr *> &LoadInsts) {
+  switch (U.getOpcode()) {
+  case NVPTX::LD_f32:
+  case NVPTX::LD_f64:
+  case NVPTX::LD_i16:
+  case NVPTX::LD_i32:
+  case NVPTX::LD_i64:
+  case NVPTX::LD_i8:
+  case NVPTX::LDV_f32_v2:
+  case NVPTX::LDV_f32_v4:
+  case NVPTX::LDV_f64_v2:
+  case NVPTX::LDV_f64_v4:
+  case NVPTX::LDV_i16_v2:
+  case NVPTX::LDV_i16_v4:
+  case NVPTX::LDV_i32_v2:
+  case NVPTX::LDV_i32_v4:
+  case NVPTX::LDV_i64_v2:
+  case NVPTX::LDV_i64_v4:
+  case NVPTX::LDV_i8_v2:
+  case NVPTX::LDV_i8_v4: {
+    LoadInsts.push_back(&U);
+    return true;
+  }
+  case NVPTX::cvta_local:
+  case NVPTX::cvta_local_64:
+  case NVPTX::cvta_to_local:
+  case NVPTX::cvta_to_local_64: {
+    for (auto &U2 : MRI.use_instructions(U.operands_begin()->getReg()))
+      if (!traverseMoveUse(U2, MRI, RemoveList, LoadInsts))
+        return false;
+
+    RemoveList.push_back(&U);
+    return true;
+  }
+  default:
+    return false;
+  }
+}
+
+static bool eliminateMove(MachineInstr &Mov, const MachineRegisterInfo &MRI,
+                          SmallVectorImpl<MachineInstr *> &RemoveList) {
+  SmallVector<MachineInstr *, 16> MaybeRemoveList;
+  SmallVector<MachineInstr *, 16> LoadInsts;
+
+  for (auto &U : MRI.use_instructions(Mov.operands_begin()->getReg()))
+    if (!traverseMoveUse(U, MRI, MaybeRemoveList, LoadInsts))
+      return false;
+
+  RemoveList.append(MaybeRemoveList);
+  RemoveList.push_back(&Mov);
+
+  const MachineOperand *ParamSymbol = Mov.uses().begin();
+  assert(ParamSymbol->isSymbol());
+
+  constexpr unsigned LDInstBasePtrOpIdx = 6;
+  constexpr unsigned LDInstAddrSpaceOpIdx = 2;
+  for (auto *LI : LoadInsts) {
+    (LI->uses().begin() + LDInstBasePtrOpIdx)
+        ->ChangeToES(ParamSymbol->getSymbolName());
+    (LI->uses().begin() + LDInstAddrSpaceOpIdx)
+        ->ChangeToImmediate(NVPTX::AddressSpace::Param);
+  }
+  return true;
+}
+
+static bool forwardDeviceParams(MachineFunction &MF) {
+  const auto &MRI = MF.getRegInfo();
+
+  bool Changed = false;
+  SmallVector<MachineInstr *, 16> RemoveList;
+  for (auto &MI : make_early_inc_range(*MF.begin()))
+    if (MI.getOpcode() == NVPTX::MOV32_PARAM ||
+        MI.getOpcode() == NVPTX::MOV64_PARAM)
+      Changed |= eliminateMove(MI, MRI, RemoveList);
+
+  for (auto *MI : RemoveList)
+    MI->eraseFromParent();
+
+  return Changed;
+}
+
+/// ----------------------------------------------------------------------------
+///                       Pass (Manager) Boilerplate
+/// ----------------------------------------------------------------------------
+
+namespace llvm {
+void initializeNVPTXForwardParamsPassPass(PassRegistry &);
+} // namespace llvm
+
+namespace {
+struct NVPTXForwardParamsPass : public MachineFunctionPass {
+  static char ID;
+  NVPTXForwardParamsPass() : MachineFunctionPass(ID) {
+    initializeNVPTXForwardParamsPassPass(*PassRegistry::getPassRegistry());
+  }
+
+  bool runOnMachineFunction(MachineFunction &MF) override;
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    MachineFunctionPass::getAnalysisUsage(AU);
+  }
+};
+} // namespace
+
+char NVPTXForwardParamsPass::ID = 0;
+
+INITIALIZE_PASS(NVPTXForwardParamsPass, "nvptx-forward-params",
+                "NVPTX Forward Params", false, false)
+
+bool NVPTXForwardParamsPass::runOnMachineFunction(MachineFunction &MF) {
+  return forwardDeviceParams(MF);
+}
+
+MachineFunctionPass *llvm::createNVPTXForwardParamsPass() {
+  return new NVPTXForwardParamsPass();
+}
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index 8a5cdd7412bf3..0461ed4712221 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -2197,11 +2197,11 @@ static SDValue selectBaseADDR(SDValue N, SelectionDAG *DAG) {
   if (N.getOpcode() == NVPTXISD::Wrapper)
     return N.getOperand(0);
 
-  // addrspacecast(MoveParam(arg_symbol) to addrspace(PARAM)) -> arg_symbol
+  // addrspacecast(Wrapper(arg_symbol) to addrspace(PARAM)) -> arg_symbol
   if (AddrSpaceCastSDNode *CastN = dyn_cast<AddrSpaceCastSDNode>(N))
     if (CastN->getSrcAddressSpace() == ADDRESS_SPACE_GENERIC &&
         CastN->getDestAddressSpace() == ADDRESS_SPACE_PARAM &&
-        CastN->getOperand(0).getOpcode() == NVPTXISD::MoveParam)
+        CastN->getOperand(0).getOpcode() == NVPTXISD::Wrapper)
       return selectBaseADDR(CastN->getOperand(0).getOperand(0), DAG);
 
   if (auto *FIN = dyn_cast<FrameIndexSDNode>(N))
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 684098681d1ab..0ae6f9004b458 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -3376,10 +3376,18 @@ SDValue NVPTXTargetLowering::LowerFormalArguments(
     assert(ObjectVT == Ins[InsIdx].VT &&
            "Ins type did not match function type");
     SDValue Arg = getParamSymbol(DAG, i, PtrVT);
-    SDValue p = DAG.getNode(NVPTXISD::MoveParam, dl, ObjectVT, Arg);
-    if (p.getNode())
-      p.getNode()->setIROrder(i + 1);
-    InVals.push_back(p);
+
+    SDValue P;
+    if (isKernelFunction(*F)) {
+      P = DAG.getNode(NVPTXISD::Wrapper, dl, ObjectVT, Arg);
+      P.getNode()->setIROrder(i + 1);
+    } else {
+      P = DAG.getNode(NVPTXISD::MoveParam, dl, ObjectVT, Arg);
+      P.getNode()->setIROrder(i + 1);
+      P = DAG.getAddrSpaceCast(dl, ObjectVT, P, ADDRESS_SPACE_LOCAL,
+                               ADDRESS_SPACE_GENERIC);
+    }
+    InVals.push_back(P);
   }
 
   if (!OutChains.empty())
diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index 36a0a06bdb8aa..6edb0998760b8 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -2324,7 +2324,7 @@ def SDTCallArgProfile : SDTypeProfile<0, 2, [SDTCisInt<0>]>;
 def SDTCallArgMarkProfile : SDTypeProfile<0, 0, []>;
 def SDTCallVoidProfile : SDTypeProfile<0, 1, []>;
 def SDTCallValProfile : SDTypeProfile<1, 0, []>;
-def SDTMoveParamProfile : SDTypeProfile<1, 1, []>;
+def SDTMoveParamProfile : SDTypeProfile<1, 1, [SDTCisInt<0>, SDTCisInt<1>]>;
 def SDTStoreRetvalProfile : SDTypeProfile<0, 2, [SDTCisInt<0>]>;
 def SDTStoreRetvalV2Profile : SDTypeProfile<0, 3, [SDTCisInt<0>]>;
 def SDTStoreRetvalV4Profile : SDTypeProfile<0, 5, [SDTCisInt<0>]>;
@@ -2688,29 +2688,14 @@ def DeclareScalarRegInst :
             ".reg .b$size param$a;",
             [(DeclareScalarParam (i32 imm:$a), (i32 imm:$size), (i32 1))]>;
 
-class MoveParamInst<ValueType T, NVPTXRegClass regclass, string asmstr> :
-  NVPTXInst<(outs regclass:$dst), (ins regclass:$src),
-            !strconcat("mov", asmstr, " \t$dst, $src;"),
-            [(set T:$dst, (MoveParam T:$src))]>;
-
 class MoveParamSymbolInst<NVPTXRegClass regclass, Operand srcty, ValueType vt,
                           string asmstr> :
   NVPTXInst<(outs regclass:$dst), (ins srcty:$src),
             !strconcat("mov", asmstr, " \t$dst, $src;"),
             [(set vt:$dst, (MoveParam texternalsym:$src))]>;
 
-def MoveParamI64 : MoveParamInst<i64, Int64Regs, ".b64">;
-def MoveParamI32 : MoveParamInst<i32, Int32Regs, ".b32">;
-
-def MoveParamSymbolI64 : MoveParamSymbolInst<Int64Regs, i64imm, i64, ".b64">;
-def MoveParamSymbolI32 : MoveParamSymbolInst<Int32Regs, i32imm, i32, ".b32">;
-
-def MoveParamI16 :
-  NVPTXInst<(outs Int16Regs:$dst), (ins Int16Regs:$src),
-            "cvt.u16.u32 \t$dst, $src;", // ??? Why cvt.u16.u32 ?
-            [(set i16:$dst, (MoveParam i16:$src))]>;
-def MoveParamF64 : MoveParamInst<f64, Float64Regs, ".f64">;
-def MoveParamF32 : MoveParamInst<f32, Float32Regs, ".f32">;
+def MOV64_PARAM : MoveParamSymbolInst<Int64Regs, i64imm, i64, ".b64">;
+def MOV32_PARAM : MoveParamSymbolInst<Int32Regs, i32imm, i32, ".b32">;
 
 class PseudoUseParamInst<NVPTXRegClass regclass, ValueType vt> :
   NVPTXInst<(outs), (ins regclass:$src),
diff --git a/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp b/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
index c763b54c8dbfe..6cef245e2d98e 100644
--- a/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
@@ -153,6 +153,7 @@
 #include "llvm/Pass.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/ErrorHandling.h"
+#include "llvm/Support/NVPTXAddrSpace.h"
 #include <numeric>
 #include <queue>
 
@@ -373,19 +374,19 @@ static void adjustByValArgAlignment(Argument *Arg, Value *ArgInParamAS,
   Type *StructType = Arg->getParamByValType();
   const DataLayout &DL = Func->getDataLayout();
 
-  uint64_t NewArgAlign =
-      TLI->getFunctionParamOptimizedAlign(Func, StructType, DL).value();
-  uint64_t CurArgAlign =
-      Arg->getAttribute(Attribute::Alignment).getValueAsInt();
+  const Align NewArgAlign =
+      TLI->getFunctionParamOptimizedAlign(Func, StructType, DL);
+  const Align CurArgAlign = Arg->getParamAlign().valueOrOne();
 
   if (CurArgAlign >= NewArgAlign)
     return;
 
-  LLVM_DEBUG(dbgs() << "Try to use alignment " << NewArgAlign << " instead of "
-                    << CurArgAlign << " for " << *Arg << '\n');
+  LLVM_DEBUG(dbgs() << "Try to use alignment " << NewArgAlign.value()
+                    << " instead of " << CurArgAlign.value() << " for " << *Arg
+                    << '\n');
 
   auto NewAlignAttr =
-      Attribute::get(Func->getContext(), Attribute::Alignment, NewArgAlign);
+      Attribute::getWithAlignment(Func->getContext(), NewArgAlign);
   Arg->removeAttr(Attribute::Alignment);
   Arg->addAttr(NewAlignAttr);
 
@@ -402,7 +403,6 @@ static void adjustByValArgAlignment(Argument *Arg, Value *ArgInParamAS,
   SmallVector<Load> Loads;
   std::queue<LoadContext> Worklist;
   Worklist.push({ArgInParamAS, 0});
-  bool IsGridConstant = isParamGridConstant(*Arg);
 
   while (!Worklist.empty()) {
     LoadContext Ctx = Worklist.front();
@@ -411,15 +411,9 @@ static void adjustByValArgAlignment(Argument *Arg, Value *ArgInParamAS,
     for (User *CurUser : Ctx.InitialVal->users()) {
       if (auto *I = dyn_cast<LoadInst>(CurUser)) {
         Loads.push_back({I, Ctx.Offset});
-        continue;
-      }
-
-      if (auto *I = dyn_cast<BitCastInst>(CurUser)) {
-        Worklist.push({I, Ctx.Offset});
-        continue;
-      }
-
-      if (auto *I = dyn_cast<GetElementPtrInst>(CurUser)) {
+      } else if (isa<BitCastInst>(CurUser) || isa<AddrSpaceCastInst>(CurUser)) {
+        Worklist.push({cast<Instruction>(CurUser), Ctx.Offset});
+      } else if (auto *I = dyn_cast<GetElementPtrInst>(CurUser)) {
         APInt OffsetAccumulated =
             APInt::getZero(DL.getIndexSizeInBits(ADDRESS_SPACE_PARAM));
 
@@ -431,26 +425,13 @@ static void adjustByValArgAlignment(Argument *Arg, Value *ArgInParamAS,
         assert(Offset != OffsetLimit && "Expect Offset less than UINT64_MAX");
 
         Worklist.push({I, Ctx.Offset + Offset});
-        continue;
       }
-
-      if (isa<MemTransferInst>(CurUser))
-        continue;
-
-      // supported for grid_constant
-      if (IsGridConstant &&
-          (isa<CallInst>(CurUser) || isa<StoreInst>(CurUser) ||
-           isa<PtrToIntInst>(CurUser)))
-        continue;
-
-      llvm_unreachable("All users must be one of: load, "
-                       "bitcast, getelementptr, call, store, ptrtoint");
     }
   }
 
   for (Load &CurLoad : Loads) {
-    Align NewLoadAlign(std::gcd(NewArgAlign, CurLoad.Offset));
-    Align CurLoadAlign(CurLoad.Inst->getAlign());
+    Align NewLoadAlign(std::gcd(NewArgAlign.value(), CurLoad.Offset));
+    Align CurLoadAlign = CurLoad.Inst->getAlign();
     CurLoad.Inst->setAlignment(std::max(NewLoadAlign, CurLoadAlign));
   }
 }
@@ -641,7 +622,7 @@ void NVPTXLowerArgs::handleByValParam(const NVPTXTargetMachine &TM,
     copyByValParam(*Func, *Arg);
 }
 
-void NVPTXLowerArgs::markPointerAsGlobal(Value *Ptr) {
+static void markPointerAsAS(Value *Ptr, const unsigned AS) {
   if (Ptr->getType()->getPointerAddressSpace() != ADDRESS_SPACE_GENERIC)
     return;
 
@@ -658,7 +639,7 @@ void NVPTXLowerArgs::markPointerAsGlobal(Value *Ptr) {
   }
 
   Instruction *PtrInGlobal = new AddrSpaceCastInst(
-      Ptr, PointerType::get(Ptr->getContext(), ADDRESS_SPACE_GLOBAL),
+      Ptr, PointerType::get(Ptr->getContext(), AS),
       Ptr->getName(), InsertPt);
   Value *PtrInGeneric = new AddrSpaceCastInst(PtrInGlobal, Ptr->getType(),
                                               Ptr->getName(), InsertPt);
@@ -667,6 +648,10 @@ void NVPTXLowerArgs::markPointerAsGlobal(Value *Ptr) {
   PtrInGlobal->setOperand(0, Ptr);
 }
 
+void NVPTXLowerArgs::markPointerAsGlobal(Value *Ptr) {
+  markPointerAsAS(Ptr, ADDRESS_SPACE_GLOBAL);
+}
+
 // =============================================================================
 // Main function for this pass.
 // =============================================================================
@@ -724,9 +709,15 @@ bool NVPTXLowerArgs::runOnKernelFunction(const NVPTXTargetMachine &TM,
 bool NVPTXLowerArgs::runOnDeviceFunction(const NVPTXTargetMachine &TM,
                                          Function &F) {
   LLVM_DEBUG(dbgs() << "Lowering function args of " << F.getName() << "\n");
+
+  const auto *TLI =
+      cast<NVPTXTargetLowering>(TM.getSubtargetImpl()->getTargetLowering());
+
   for (Argument &Arg : F.args())
-    if (Arg.getType()->isPointerTy() && Arg.hasByValAttr())
-      handleByValParam(TM, &Arg);
+    if (Arg.getType()->isPointerTy() && Arg.hasByValAttr()) {
+      markPointerAsAS(&Arg, ADDRESS_SPACE_LOCAL);
+      adjustByValArgAlignment(&Arg, &Arg, TLI);
+    }
   return true;
 }
 
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
index f2afa6fc20bfa..229fecf2d3b10 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
@@ -100,6 +100,7 @@ void initializeNVPTXLowerUnreachablePass(PassRegistry &);
 void initializeNVPTXCtorDtorLoweringLegacyPass(PassRegistry &);
 void initializeNVPTXLowerArgsPass(PassRegistry &);
 void initializeNVPTXProxyRegErasurePass(PassRegistry &);
+void initializeNVPTXForwardParamsPassPass(PassRegistry &);
 void initializeNVVMIntrRangePass(PassRegistry &);
 void initializeNVVMReflectPass(PassRegistry &);
 void initializeNVPTXAAWrapperPassPass(PassRegistry &);
@@ -127,6 +128,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeNVPTXTarget() {
   initializeNVPTXCtorDtorLoweringLegacyPass(PR);
   initializeNVPTXLowerAggrCopiesPass(PR);
   initializeNVPTXProxyRegErasurePass(PR);
+  initializeNVPTXForwardParamsPassPass(PR);
   initializeNVPTXDAGToDAGISelLegacyPass(PR);
   initializeNVPTXAAWrapperPassPass(PR);
   initializeNVPTXExternalAAWrapperPass(PR);
@@ -429,6 +431,7 @@ bool NVPTXPassConfig::addInstSelector() {
 }
 
 void NVPTXPassConfig::addPreRegAlloc() {
+  addPass(createNVPTXForwardParamsPass());
   // Remove Proxy Register pseudo instructions used to keep `callseq_end` alive.
   addPass(createNVPTXProxyRegErasurePass());
 }
diff --git a/llvm/test/CodeGen/NVPTX/forward-ld-param.ll b/llvm/test/CodeGen/NVPTX/forward-ld-param.ll
new file mode 100644
index 0000000000000..c4e56d197edc0
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/forward-ld-param.ll
@@ -0,0 +1,142 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s | FileCheck %s
+
+target triple = "nvptx64-nvidia-cuda"
+
+define i32 @test_ld_param_const(ptr byval(i32) %a) {
+; CHECK-LABEL: test_ld_param_const(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b32 %r<2>;
+; CHECK-NEXT:    .reg .b64 %rd<4>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.u32 %r1, [test_ld_param_const_param_0+4];
+; CHECK-NEXT:    st.param.b32 [func_retval0], %r1;
+; CHECK-NEXT:    ret;
+  %p2 = getelementptr i32, ptr %a, i32 1
+  %ld = load i32, ptr %p2
+  ret i32 %ld
+}
+
+define i32 @test_ld_param_non_const(ptr byval([10 x i32]) %a, i32 %b) {
+; CHECK-LABEL: test_ld_param_non_const(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b32 %r<2>;
+; CHECK-NEXT:    .reg .b64 %rd<6>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    mov.b64 %rd1, test_ld_param_non_const_param_0;
+; CHECK-NEXT:    cvta.local.u64 %rd2, %rd1;
+; CHECK-NEXT:    cvta.to.local.u64 %rd3, %rd2;
+; CHECK-NEXT:    ld.param.s32 %rd4, [test_ld_param_non_const_param_1];
+; CHECK-NEXT:    add.s64 %rd5, %rd3, %rd4;
+; CHECK-NEXT:    ld.local.u32 %r1, [%rd5];
+; CHECK-NEXT:    st.param.b32 [func_retval0], %r1;
+; CHECK-NEXT:    ret;
+  %p2 = getelementptr i8, ptr %a, i32 %b
+  %ld = load i32, ptr %p2
+  ret i32 %ld
+}
+
+declare void @escape(ptr)
+declare void @byval_user(ptr byval(i32))
+
+define void @test_ld_param_e...
[truncated]

Comment on lines 29 to 30
; CHECK-NEXT: cvta.local.u64 %rd2, %rd1;
; CHECK-NEXT: cvta.to.local.u64 %rd3, %rd2;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: once #129157 lands this will be cleaned up.

Copy link

github-actions bot commented Feb 28, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@AlexMaclean AlexMaclean force-pushed the dev/amaclean/upstream/param-alloca branch from e880f61 to 3d04601 Compare February 28, 2025 16:28
@AlexMaclean AlexMaclean force-pushed the dev/amaclean/upstream/param-alloca branch from 3d04601 to 6a66526 Compare February 28, 2025 18:21
@Artem-B
Copy link
Member

Artem-B commented Feb 28, 2025

Unfortunately the criteria for determining which case is possible are not correct, leading to miscompilations (https://godbolt.org/z/Gq1fP7a3G).

Can you elaborate on what exactly is incorrect about the example?
AFAICT, the code is still valid, even if ptxas itself has to make a copy: https://godbolt.org/z/ffd3d3G6z

My understanding is that ld.param from the parameter address is still legal and that we assumed that ptxas is smart enough to avoid local copies. Even if it does not, the code is still valid, even if it may be suboptimal.

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. The less local storage we drag around, the better. LGTM.

Comment on lines 414 to 416
} else if (isa<BitCastInst>(CurUser) || isa<AddrSpaceCastInst>(CurUser)) {
Worklist.push({cast<Instruction>(CurUser), Ctx.Offset});
} else if (auto *I = dyn_cast<GetElementPtrInst>(CurUser)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that all continue are gone, we don't need {} around single-statement blocks either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

; PTX-NEXT: add.u64 %rd2, %SPL, 0;
; PTX-NEXT: ld.param.s32 %rd3, [non_kernel_function_param_2];
; PTX-NEXT: ld.param.u64 %rd4, [non_kernel_function_param_0+8];
; PTX-NEXT: st.local.u64 [%rd2+8], %rd4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think elimination of this unnecessary local store will be another important effect of this change. Nice.

@akshayrdeodhar
Copy link
Contributor

LGTM!

@akshayrdeodhar
Copy link
Contributor

akshayrdeodhar commented Feb 28, 2025

Unfortunately the criteria for determining which case is possible are not correct, leading to miscompilations (https://godbolt.org/z/Gq1fP7a3G).

Can you elaborate on what exactly is incorrect about the example? AFAICT, the code is still valid, even if ptxas itself has to make a copy: https://godbolt.org/z/ffd3d3G6z

My understanding is that ld.param from the parameter address is still legal and that we assumed that ptxas is smart enough to avoid local copies. Even if it does not, the code is still valid, even if it may be suboptimal.

@Artem-B - from the PTX docs:

Aside from passing structures by value, .param space is also required whenever a formal parameter has its address taken within the called function. In PTX, the address of a function input parameter may be moved into a register using the mov instruction. Note that the parameter will be copied to the stack if necessary, and so the address will be in the .local state space and is accessed via ld.local and st.local instructions. It is not possible to use mov to get the address of or a locally-scoped .param space variable. Starting PTX ISA version 6.0, it is possible to use mov instruction to get address of return parameter of device function.

The example uses ld.local

@Artem-B
Copy link
Member

Artem-B commented Feb 28, 2025

Counter-example, where PTX docs allow mov + ld.param (though only for kernel arguments): https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-function-parameters

The address of a kernel parameter may be moved into a register using the mov instruction. The resulting address is in the .param state space and is accessed using ld.param instructions.

Example

.entry bar ( .param .b32 len )
{
    .reg .u32 %ptr, %n;

    mov.u32      %ptr, len;
    ld.param.u32 %n, [%ptr];
    ...

And indeed, mov+ld.param for kernels does work w/o generating a local copy: https://godbolt.org/z/7E4Tf5zrE

@AlexMaclean AlexMaclean force-pushed the dev/amaclean/upstream/param-alloca branch from 6a66526 to 1a4100e Compare February 28, 2025 20:47
@AlexMaclean
Copy link
Member Author

Counter-example, where PTX docs allow mov + ld.param (though only for kernel arguments): https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-function-parameters

The address of a kernel parameter may be moved into a register using the mov instruction. The resulting address is in the .param state space and is accessed using ld.param instructions.

Example

.entry bar ( .param .b32 len )
{
    .reg .u32 %ptr, %n;

    mov.u32      %ptr, len;
    ld.param.u32 %n, [%ptr];
    ...

And indeed, mov+ld.param for kernels does work w/o generating a local copy: https://godbolt.org/z/7E4Tf5zrE

While they can look similar in simple cases, kernel and device parameters have very different behavior and semantics in PTX. A mov of a kernel parameter is just a mov, but with a device parameter it carries lots of implications. This change is only intended to address problems with device parameters and will leave kernel parameter handling unchanged.

@AlexMaclean AlexMaclean merged commit 0065343 into llvm:main Feb 28, 2025
11 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Feb 28, 2025

LLVM Buildbot has detected a new failure on builder lldb-x86_64-debian running on lldb-x86_64-debian while building llvm at step 6 "test".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/162/builds/17107

Here is the relevant piece of the build log for the reference
Step 6 (test) failure: build (failure)
...
PASS: lldb-unit :: Utility/./UtilityTests/62/127 (2745 of 2789)
PASS: lldb-unit :: Utility/./UtilityTests/78/127 (2746 of 2789)
PASS: lldb-unit :: Utility/./UtilityTests/79/127 (2747 of 2789)
PASS: lldb-shell :: Unwind/basic-block-sections.test (2748 of 2789)
PASS: lldb-shell :: Subprocess/vfork-follow-parent-wp.test (2749 of 2789)
PASS: lldb-api :: tools/lldb-server/TestGdbRemoteLaunch.py (2750 of 2789)
PASS: lldb-api :: tools/lldb-server/TestGdbRemoteThreadsInStopReply.py (2751 of 2789)
PASS: lldb-api :: tools/lldb-server/TestGdbRemoteProcessInfo.py (2752 of 2789)
PASS: lldb-api :: tools/lldb-dap/variables/TestDAP_variables.py (2753 of 2789)
PASS: lldb-api :: tools/lldb-server/TestGdbRemoteExpeditedRegisters.py (2754 of 2789)
FAIL: lldb-api :: tools/lldb-dap/launch/TestDAP_launch.py (2755 of 2789)
******************** TEST 'lldb-api :: tools/lldb-dap/launch/TestDAP_launch.py' FAILED ********************
Script:
--
/usr/bin/python3 /home/worker/2.0.1/lldb-x86_64-debian/llvm-project/lldb/test/API/dotest.py -u CXXFLAGS -u CFLAGS --env LLVM_LIBS_DIR=/home/worker/2.0.1/lldb-x86_64-debian/build/./lib --env LLVM_INCLUDE_DIR=/home/worker/2.0.1/lldb-x86_64-debian/build/include --env LLVM_TOOLS_DIR=/home/worker/2.0.1/lldb-x86_64-debian/build/./bin --arch x86_64 --build-dir /home/worker/2.0.1/lldb-x86_64-debian/build/lldb-test-build.noindex --lldb-module-cache-dir /home/worker/2.0.1/lldb-x86_64-debian/build/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir /home/worker/2.0.1/lldb-x86_64-debian/build/lldb-test-build.noindex/module-cache-clang/lldb-api --executable /home/worker/2.0.1/lldb-x86_64-debian/build/./bin/lldb --compiler /home/worker/2.0.1/lldb-x86_64-debian/build/./bin/clang --dsymutil /home/worker/2.0.1/lldb-x86_64-debian/build/./bin/dsymutil --make /usr/bin/gmake --llvm-tools-dir /home/worker/2.0.1/lldb-x86_64-debian/build/./bin --lldb-obj-root /home/worker/2.0.1/lldb-x86_64-debian/build/tools/lldb --lldb-libs-dir /home/worker/2.0.1/lldb-x86_64-debian/build/./lib -t /home/worker/2.0.1/lldb-x86_64-debian/llvm-project/lldb/test/API/tools/lldb-dap/launch -p TestDAP_launch.py
--
Exit Code: 1

Command Output (stdout):
--
lldb version 21.0.0git (https://github.com/llvm/llvm-project.git revision 006534315972728390d82fc8381c9ab1bf6e6490)
  clang revision 006534315972728390d82fc8381c9ab1bf6e6490
  llvm revision 006534315972728390d82fc8381c9ab1bf6e6490
Skipping the following test categories: ['libc++', 'dsym', 'gmodules', 'debugserver', 'objc']
========= DEBUG ADAPTER PROTOCOL LOGS =========
1740781714.427150965 stdin/stdout --> 
Content-Length: 344

{
  "arguments": {
    "adapterID": "lldb-native",
    "clientID": "vscode",
    "columnsStartAt1": true,
    "linesStartAt1": true,
    "locale": "en-us",
    "pathFormat": "path",
    "sourceInitFile": false,
    "supportsRunInTerminalRequest": true,
    "supportsStartDebuggingRequest": true,
    "supportsVariablePaging": true,
    "supportsVariableType": true
  },
  "command": "initialize",
  "seq": 1,
  "type": "request"
}
1740781714.429972410 stdin/stdout <-- 
Content-Length: 1631


@llvm-ci
Copy link
Collaborator

llvm-ci commented Mar 1, 2025

LLVM Buildbot has detected a new failure on builder clang-ppc64le-linux-multistage running on ppc64le-clang-multistage-test while building llvm at step 5 "ninja check 1".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/76/builds/7371

Here is the relevant piece of the build log for the reference
Step 5 (ninja check 1) failure: stage 1 checked (failure)
******************** TEST 'LLVM :: tools/llvm-exegesis/RISCV/rvv/reduction.test' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
RUN: at line 1: /home/buildbots/llvm-external-buildbots/workers/ppc64le-clang-multistage-test/clang-ppc64le-multistage/stage1/bin/llvm-exegesis -mtriple=riscv64 -mcpu=sifive-p670 -benchmark-phase=assemble-measured-code --mode=latency --opcode-name=PseudoVWREDSUMU_VS_M8_E32 --min-instructions=100 |     /home/buildbots/llvm-external-buildbots/workers/ppc64le-clang-multistage-test/clang-ppc64le-multistage/stage1/bin/FileCheck /home/buildbots/llvm-external-buildbots/workers/ppc64le-clang-multistage-test/clang-ppc64le-multistage/llvm/llvm/test/tools/llvm-exegesis/RISCV/rvv/reduction.test
+ /home/buildbots/llvm-external-buildbots/workers/ppc64le-clang-multistage-test/clang-ppc64le-multistage/stage1/bin/llvm-exegesis -mtriple=riscv64 -mcpu=sifive-p670 -benchmark-phase=assemble-measured-code --mode=latency --opcode-name=PseudoVWREDSUMU_VS_M8_E32 --min-instructions=100
+ /home/buildbots/llvm-external-buildbots/workers/ppc64le-clang-multistage-test/clang-ppc64le-multistage/stage1/bin/FileCheck /home/buildbots/llvm-external-buildbots/workers/ppc64le-clang-multistage-test/clang-ppc64le-multistage/llvm/llvm/test/tools/llvm-exegesis/RISCV/rvv/reduction.test
/home/buildbots/llvm-external-buildbots/workers/ppc64le-clang-multistage-test/clang-ppc64le-multistage/llvm/llvm/test/tools/llvm-exegesis/RISCV/rvv/reduction.test:7:14: error: CHECK-NOT: excluded string found in input
# CHECK-NOT: V[[REG:[0-9]+]] V[[REG]] V{{[0-9]+}}M8 V[[REG]]
             ^
<stdin>:5:31: note: found here
 - 'PseudoVWREDSUMU_VS_M8_E32 V26 V26 V24M8 V26 i_0xffffffffffffffff i_0x5 i_0x0'
                              ^~~~~~~~~~~~~~~~~
<stdin>:5:32: note: captured var "REG"
 - 'PseudoVWREDSUMU_VS_M8_E32 V26 V26 V24M8 V26 i_0xffffffffffffffff i_0x5 i_0x0'
                               ^~

Input file: <stdin>
Check file: /home/buildbots/llvm-external-buildbots/workers/ppc64le-clang-multistage-test/clang-ppc64le-multistage/llvm/llvm/test/tools/llvm-exegesis/RISCV/rvv/reduction.test

-dump-input=help explains the following input dump.

Input was:
<<<<<<
         1: --- 
         2: mode: latency 
         3: key: 
         4:  instructions: 
         5:  - 'PseudoVWREDSUMU_VS_M8_E32 V26 V26 V24M8 V26 i_0xffffffffffffffff i_0x5 i_0x0' 
not:7'0                                   !~~~~~~~~~~~~~~~~                                    error: no match expected
not:7'1                                    !~                                                  captured var "REG"
         6:  config: 'vtype = {AVL: VLMAX, SEW: e32, Policy: tu/mu}' 
         7:  register_initial_values: 
         8:  - 'V26=0x0' 
         9:  - 'V24M8=0x0' 
        10: cpu_name: sifive-p670 
         .
         .
         .
>>>>>>

--

********************


@llvm-ci
Copy link
Collaborator

llvm-ci commented Mar 1, 2025

LLVM Buildbot has detected a new failure on builder clang-armv7-global-isel running on linaro-clang-armv7-global-isel while building llvm at step 7 "ninja check 1".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/39/builds/4774

Here is the relevant piece of the build log for the reference
Step 7 (ninja check 1) failure: stage 1 checked (failure)
******************** TEST 'LLVM :: tools/llvm-exegesis/RISCV/rvv/filter.test' FAILED ********************
Exit Code: 2

Command Output (stderr):
--
RUN: at line 1: /home/tcwg-buildbot/worker/clang-armv7-global-isel/stage1/bin/llvm-exegesis -mtriple=riscv64 -mcpu=sifive-x280 -benchmark-phase=assemble-measured-code --mode=inverse_throughput --opcode-name=PseudoVNCLIPU_WX_M1_MASK     --riscv-filter-config='vtype = {VXRM: rod, AVL: VLMAX, SEW: e(8|16), Policy: ta/mu}' --max-configs-per-opcode=1000 --min-instructions=100 | /home/tcwg-buildbot/worker/clang-armv7-global-isel/stage1/bin/FileCheck /home/tcwg-buildbot/worker/clang-armv7-global-isel/llvm/llvm/test/tools/llvm-exegesis/RISCV/rvv/filter.test
+ /home/tcwg-buildbot/worker/clang-armv7-global-isel/stage1/bin/FileCheck /home/tcwg-buildbot/worker/clang-armv7-global-isel/llvm/llvm/test/tools/llvm-exegesis/RISCV/rvv/filter.test
+ /home/tcwg-buildbot/worker/clang-armv7-global-isel/stage1/bin/llvm-exegesis -mtriple=riscv64 -mcpu=sifive-x280 -benchmark-phase=assemble-measured-code --mode=inverse_throughput --opcode-name=PseudoVNCLIPU_WX_M1_MASK '--riscv-filter-config=vtype = {VXRM: rod, AVL: VLMAX, SEW: e(8|16), Policy: ta/mu}' --max-configs-per-opcode=1000 --min-instructions=100
PseudoVNCLIPU_WX_M1_MASK: Failed to produce any snippet via: instruction has tied variables, avoiding Read-After-Write issue, picking random def and use registers not aliasing each other, for uses, one unique register for each position
FileCheck error: '<stdin>' is empty.
FileCheck command line:  /home/tcwg-buildbot/worker/clang-armv7-global-isel/stage1/bin/FileCheck /home/tcwg-buildbot/worker/clang-armv7-global-isel/llvm/llvm/test/tools/llvm-exegesis/RISCV/rvv/filter.test

--

********************


@llvm-ci
Copy link
Collaborator

llvm-ci commented Mar 1, 2025

LLVM Buildbot has detected a new failure on builder clang-x64-windows-msvc running on windows-gcebot2 while building llvm at step 4 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/63/builds/4359

Here is the relevant piece of the build log for the reference
Step 4 (annotate) failure: 'python ../llvm-zorg/zorg/buildbot/builders/annotated/clang-windows.py ...' (failure)
...
  Passed           : 46095 (99.01%)
  Expectedly Failed:    39 (0.08%)
[1324/1325] Running the LLVM regression tests
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:57: note: using lit tools: C:\Program Files\Git\usr\bin
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:512: note: using ld.lld: c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\ld.lld.exe
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:512: note: using lld-link: c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\lld-link.exe
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:512: note: using ld64.lld: c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\ld64.lld.exe
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:512: note: using wasm-ld: c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\wasm-ld.exe
-- Testing: 57688 tests, 32 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: LLVM :: tools/llvm-exegesis/RISCV/rvv/reduction.test (54088 of 57688)
******************** TEST 'LLVM :: tools/llvm-exegesis/RISCV/rvv/reduction.test' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 1
c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\llvm-exegesis.exe -mtriple=riscv64 -mcpu=sifive-p670 -benchmark-phase=assemble-measured-code --mode=latency --opcode-name=PseudoVWREDSUMU_VS_M8_E32 --min-instructions=100 |     c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\filecheck.exe C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\test\tools\llvm-exegesis\RISCV\rvv\reduction.test
# executed command: 'c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\llvm-exegesis.exe' -mtriple=riscv64 -mcpu=sifive-p670 -benchmark-phase=assemble-measured-code --mode=latency --opcode-name=PseudoVWREDSUMU_VS_M8_E32 --min-instructions=100
# executed command: 'c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\filecheck.exe' 'C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\test\tools\llvm-exegesis\RISCV\rvv\reduction.test'
# .---command stderr------------
# | C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\test\tools\llvm-exegesis\RISCV\rvv\reduction.test:7:14: error: CHECK-NOT: excluded string found in input
# | # CHECK-NOT: V[[REG:[0-9]+]] V[[REG]] V{{[0-9]+}}M8 V[[REG]]
# |              ^
# | <stdin>:5:31: note: found here
# |  - 'PseudoVWREDSUMU_VS_M8_E32 V16 V16 V16M8 V16 i_0xffffffffffffffff i_0x5 i_0x0'
# |                               ^~~~~~~~~~~~~~~~~
# | <stdin>:5:32: note: captured var "REG"
# |  - 'PseudoVWREDSUMU_VS_M8_E32 V16 V16 V16M8 V16 i_0xffffffffffffffff i_0x5 i_0x0'
# |                                ^~
# | 
# | Input file: <stdin>
# | Check file: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\test\tools\llvm-exegesis\RISCV\rvv\reduction.test
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |          1: --- 
# |          2: mode: latency 
# |          3: key: 
# |          4:  instructions: 
# |          5:  - 'PseudoVWREDSUMU_VS_M8_E32 V16 V16 V16M8 V16 i_0xffffffffffffffff i_0x5 i_0x0' 
# | not:7'0                                   !~~~~~~~~~~~~~~~~                                    error: no match expected
# | not:7'1                                    !~                                                  captured var "REG"
# |          6:  config: 'vtype = {AVL: VLMAX, SEW: e32, Policy: tu/mu}' 
# |          7:  register_initial_values: 
# |          8:  - 'V16=0x0' 
# |          9:  - 'V16M8=0x0' 
Step 12 (stage 2 check) failure: stage 2 check (failure)
...
  Passed           : 46095 (99.01%)
  Expectedly Failed:    39 (0.08%)
[1324/1325] Running the LLVM regression tests
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:57: note: using lit tools: C:\Program Files\Git\usr\bin
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:512: note: using ld.lld: c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\ld.lld.exe
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:512: note: using lld-link: c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\lld-link.exe
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:512: note: using ld64.lld: c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\ld64.lld.exe
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:512: note: using wasm-ld: c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\wasm-ld.exe
-- Testing: 57688 tests, 32 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: LLVM :: tools/llvm-exegesis/RISCV/rvv/reduction.test (54088 of 57688)
******************** TEST 'LLVM :: tools/llvm-exegesis/RISCV/rvv/reduction.test' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 1
c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\llvm-exegesis.exe -mtriple=riscv64 -mcpu=sifive-p670 -benchmark-phase=assemble-measured-code --mode=latency --opcode-name=PseudoVWREDSUMU_VS_M8_E32 --min-instructions=100 |     c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\filecheck.exe C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\test\tools\llvm-exegesis\RISCV\rvv\reduction.test
# executed command: 'c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\llvm-exegesis.exe' -mtriple=riscv64 -mcpu=sifive-p670 -benchmark-phase=assemble-measured-code --mode=latency --opcode-name=PseudoVWREDSUMU_VS_M8_E32 --min-instructions=100
# executed command: 'c:\b\slave\clang-x64-windows-msvc\build\stage2\bin\filecheck.exe' 'C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\test\tools\llvm-exegesis\RISCV\rvv\reduction.test'
# .---command stderr------------
# | C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\test\tools\llvm-exegesis\RISCV\rvv\reduction.test:7:14: error: CHECK-NOT: excluded string found in input
# | # CHECK-NOT: V[[REG:[0-9]+]] V[[REG]] V{{[0-9]+}}M8 V[[REG]]
# |              ^
# | <stdin>:5:31: note: found here
# |  - 'PseudoVWREDSUMU_VS_M8_E32 V16 V16 V16M8 V16 i_0xffffffffffffffff i_0x5 i_0x0'
# |                               ^~~~~~~~~~~~~~~~~
# | <stdin>:5:32: note: captured var "REG"
# |  - 'PseudoVWREDSUMU_VS_M8_E32 V16 V16 V16M8 V16 i_0xffffffffffffffff i_0x5 i_0x0'
# |                                ^~
# | 
# | Input file: <stdin>
# | Check file: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\test\tools\llvm-exegesis\RISCV\rvv\reduction.test
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |          1: --- 
# |          2: mode: latency 
# |          3: key: 
# |          4:  instructions: 
# |          5:  - 'PseudoVWREDSUMU_VS_M8_E32 V16 V16 V16M8 V16 i_0xffffffffffffffff i_0x5 i_0x0' 
# | not:7'0                                   !~~~~~~~~~~~~~~~~                                    error: no match expected
# | not:7'1                                    !~                                                  captured var "REG"
# |          6:  config: 'vtype = {AVL: VLMAX, SEW: e32, Policy: tu/mu}' 
# |          7:  register_initial_values: 
# |          8:  - 'V16=0x0' 
# |          9:  - 'V16M8=0x0' 

@llvm-ci
Copy link
Collaborator

llvm-ci commented Mar 1, 2025

LLVM Buildbot has detected a new failure on builder lld-x86_64-win running on as-worker-93 while building llvm at step 7 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/146/builds/2384

Here is the relevant piece of the build log for the reference
Step 7 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM-Unit :: Support/./SupportTests.exe/38/87' FAILED ********************
Script(shard):
--
GTEST_OUTPUT=json:C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe-LLVM-Unit-11224-38-87.json GTEST_SHUFFLE=0 GTEST_TOTAL_SHARDS=87 GTEST_SHARD_INDEX=38 C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe
--

Script:
--
C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe --gtest_filter=ProgramEnvTest.CreateProcessLongPath
--
C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(160): error: Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(163): error: fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied



C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:160
Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:163
fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied




********************


jph-13 pushed a commit to jph-13/llvm-project that referenced this pull request Mar 21, 2025
PTX supports 2 methods of accessing device function parameters:

- "simple" case: If a parameters is only loaded, and all loads can
address the parameter via a constant offset, then the parameter may be
loaded via the ".param" address space. This case is not possible if the
parameters is stored to or has it's address taken. This method is
preferable when possible.

- "move param" case: For more complex cases the address of the param may
be placed in a register via a "mov" instruction. This mov also
implicitly moves the param to the ".local" address space and allows for
it to be written to. This essentially defers the responsibilty of the
byval copy to the PTX calling convention.

The handling of these cases in the NVPTX backend for byval pointers has
some major issues. We currently attempt to determine if a copy is
necessary in NVPTXLowerArgs and either explicitly make an additional
copy in the IR, or insert "addrspacecast" to move the param to the param
address space. Unfortunately the criteria for determining which case is
possible are not correct, leading to miscompilations
(https://godbolt.org/z/Gq1fP7a3G). Further, the criteria for the
"simple" case aren't enforceable in LLVM IR across other transformations
and instruction selection, making deciding between the 2 cases in
NVPTXLowerArgs brittle and buggy.

This patch aims to fix these issues and improve address space related
optimization. In NVPTXLowerArgs, we conservatively assume that all
parameters will use the "move param" case and the local address space.
Responsibility for switching to the "simple" case is given to a new
MachineIR pass, NVPTXForwardParams, which runs once it has become clear
whether or not this is possible. This ensures that the correct address
space is known for the "move param" case allowing for optimization,
while still using the "simple" case where ever possible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants