Skip to content

AMDGPU/GlobalISel: AMDGPURegBankSelect #112863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 24, 2025

Conversation

petar-avramovic
Copy link
Collaborator

@petar-avramovic petar-avramovic commented Oct 18, 2024

Assign register banks to virtual registers. Does not use generic
RegBankSelect. After register bank selection all register operand of
G_ instructions have LLT and register banks exclusively. If they had
register class, reassign appropriate register bank.

Assign register banks using machine uniformity analysis:
Sgpr - uniform values and some lane masks
Vgpr - divergent, non S1, values
Vcc - divergent S1 values(lane masks)

AMDGPURegBankSelect does not consider available instructions and, in
some cases, G_ instructions with some register bank assignment can't be
inst-selected. This is solved in RegBankLegalize.

Exceptions when uniformity analysis does not work:
S32/S64 lane masks:

  • need to end up with sgpr register class after instruction selection
  • In most cases Uniformity analysis declares them as uniform
    (forced by tablegen) resulting in sgpr S32/S64 reg bank
  • When Uniformity analysis declares them as divergent (some phis),
    use intrinsic lane mask analyzer to still assign sgpr register bank
    temporal divergence copy:
  • COPY to vgpr with implicit use of $exec inside of the cycle
  • this copy is declared as uniform by uniformity analysis
  • make sure that assigned bank is vgpr
    Note: uniformity analysis does not consider that registers with vgpr def
    are divergent (you can have uniform value in vgpr).
  • TODO: implicit use of $exec could be implemented as indicator
    that instruction is divergent

Copy link
Collaborator Author

petar-avramovic commented Oct 18, 2024

@llvmbot
Copy link
Member

llvmbot commented Oct 18, 2024

@llvm/pr-subscribers-llvm-globalisel

Author: Petar Avramovic (petar-avramovic)

Changes

Assign register banks to virtual registers.
Defs and uses of G_ instructions have register banks exclusively,
if they had register class, reassign appropriate register bank.

Assign register banks using machine uniformity analysis:
SGPR - uniform values and some lane masks
VGPR - divergent, non S1, values
VCC - divergent S1 values(lane masks)

RBSelect does not consider available instructions and, in some cases, G_
instructions with some register bank assignment can't be inst-selected.
This is solved in RBLegalize.

Exceptions when uniformity analysis does not work:
S32/S64 lane masks:

  • need to end up with SGPR register class after instruction selection
  • In most cases Uniformity analysis declares them as uniform
    (forced by tablegen) resulting in sgpr S32/S64 reg bank
  • When Uniformity analysis declares them as divergent (some phis),
    use intrinsic lane mask analyzer to still assign sgpr register bank
    temporal divergence copy:
  • COPY to vgpr with implicit use of $exec inside of the cycle
  • this copy is declared as uniform by uniformity analysis
  • make sure that assigned bank is vgpr
    Note: uniformity analysis does not consider that registers with vgpr def
    are divergent (you can have uniform value in vgpr).
  • TODO: implicit use of $exec could be implemented as indicator
    that instruction is divergent

Patch is 118.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/112863.diff

5 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp (+38)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h (+22)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp (+193-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir (+480-458)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-select.mir (+238-227)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
index a98d4488bf77fe..6f6ad5cf82cae1 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
@@ -7,13 +7,16 @@
 //===----------------------------------------------------------------------===//
 
 #include "AMDGPUGlobalISelUtils.h"
+#include "AMDGPURegisterBankInfo.h"
 #include "GCNSubtarget.h"
 #include "llvm/CodeGen/GlobalISel/GISelKnownBits.h"
 #include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"
 #include "llvm/CodeGenTypes/LowLevelType.h"
 #include "llvm/IR/Constants.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
 
 using namespace llvm;
+using namespace AMDGPU;
 using namespace MIPatternMatch;
 
 std::pair<Register, unsigned>
@@ -69,3 +72,38 @@ AMDGPU::getBaseWithConstantOffset(MachineRegisterInfo &MRI, Register Reg,
 
   return std::pair(Reg, 0);
 }
+
+IntrinsicLaneMaskAnalyzer::IntrinsicLaneMaskAnalyzer(MachineFunction &MF)
+    : MRI(MF.getRegInfo()) {
+  initLaneMaskIntrinsics(MF);
+}
+
+bool IntrinsicLaneMaskAnalyzer::isS32S64LaneMask(Register Reg) {
+  return S32S64LaneMask.contains(Reg);
+}
+
+void IntrinsicLaneMaskAnalyzer::initLaneMaskIntrinsics(MachineFunction &MF) {
+  for (auto &MBB : MF) {
+    for (auto &MI : MBB) {
+      if (MI.getOpcode() == AMDGPU::G_INTRINSIC &&
+          MI.getOperand(MI.getNumExplicitDefs()).getIntrinsicID() ==
+              Intrinsic::amdgcn_if_break) {
+        S32S64LaneMask.insert(MI.getOperand(3).getReg());
+        findLCSSAPhi(MI.getOperand(0).getReg());
+      }
+
+      if (MI.getOpcode() == AMDGPU::SI_IF ||
+          MI.getOpcode() == AMDGPU::SI_ELSE) {
+        findLCSSAPhi(MI.getOperand(0).getReg());
+      }
+    }
+  }
+}
+
+void IntrinsicLaneMaskAnalyzer::findLCSSAPhi(Register Reg) {
+  S32S64LaneMask.insert(Reg);
+  for (auto &LCSSAPhi : MRI.use_instructions(Reg)) {
+    if (LCSSAPhi.isPHI())
+      S32S64LaneMask.insert(LCSSAPhi.getOperand(0).getReg());
+  }
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
index 5972552b9a4fe8..4d504d0204d81a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
@@ -9,6 +9,8 @@
 #ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPUGLOBALISELUTILS_H
 #define LLVM_LIB_TARGET_AMDGPU_AMDGPUGLOBALISELUTILS_H
 
+#include "llvm/ADT/DenseSet.h"
+#include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/Register.h"
 #include <utility>
 
@@ -26,6 +28,26 @@ std::pair<Register, unsigned>
 getBaseWithConstantOffset(MachineRegisterInfo &MRI, Register Reg,
                           GISelKnownBits *KnownBits = nullptr,
                           bool CheckNUW = false);
+
+// Currently finds S32/S64 lane masks that can be declared as divergent by
+// uniformity analysis (all are phis at the moment).
+// These are defined as i32/i64 in some IR intrinsics (not as i1).
+// Tablegen forces(via telling that lane mask IR intrinsics are uniform) most of
+// S32/S64 lane masks to be uniform, as this results in them ending up with sgpr
+// reg class after instruction-select don't search for all of them.
+class IntrinsicLaneMaskAnalyzer {
+  DenseSet<Register> S32S64LaneMask;
+  MachineRegisterInfo &MRI;
+
+public:
+  IntrinsicLaneMaskAnalyzer(MachineFunction &MF);
+  bool isS32S64LaneMask(Register Reg);
+
+private:
+  void initLaneMaskIntrinsics(MachineFunction &MF);
+  // This will not be needed when we turn of LCSSA for global-isel.
+  void findLCSSAPhi(Register Reg);
+};
 }
 }
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp b/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
index c53a68ff72a8ad..905ad432fe6e0d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
@@ -16,7 +16,12 @@
 //===----------------------------------------------------------------------===//
 
 #include "AMDGPU.h"
+#include "AMDGPUGlobalISelUtils.h"
+#include "AMDGPURegisterBankInfo.h"
+#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineUniformityAnalysis.h"
 #include "llvm/InitializePasses.h"
 
 #define DEBUG_TYPE "rb-select"
@@ -39,6 +44,7 @@ class AMDGPURBSelect : public MachineFunctionPass {
   StringRef getPassName() const override { return "AMDGPU RB select"; }
 
   void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<MachineUniformityAnalysisPass>();
     MachineFunctionPass::getAnalysisUsage(AU);
   }
 
@@ -54,6 +60,7 @@ class AMDGPURBSelect : public MachineFunctionPass {
 
 INITIALIZE_PASS_BEGIN(AMDGPURBSelect, DEBUG_TYPE, "AMDGPU RB select", false,
                       false)
+INITIALIZE_PASS_DEPENDENCY(MachineUniformityAnalysisPass)
 INITIALIZE_PASS_END(AMDGPURBSelect, DEBUG_TYPE, "AMDGPU RB select", false,
                     false)
 
@@ -63,4 +70,189 @@ char &llvm::AMDGPURBSelectID = AMDGPURBSelect::ID;
 
 FunctionPass *llvm::createAMDGPURBSelectPass() { return new AMDGPURBSelect(); }
 
-bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) { return true; }
+bool shouldRBSelect(MachineInstr &MI) {
+  if (isTargetSpecificOpcode(MI.getOpcode()) && !MI.isPreISelOpcode())
+    return false;
+
+  if (MI.getOpcode() == AMDGPU::PHI || MI.getOpcode() == AMDGPU::IMPLICIT_DEF)
+    return false;
+
+  if (MI.isInlineAsm())
+    return false;
+
+  return true;
+}
+
+void setRB(MachineInstr &MI, MachineOperand &DefOP, MachineIRBuilder B,
+           MachineRegisterInfo &MRI, const RegisterBank &RB) {
+  Register Reg = DefOP.getReg();
+  // Register that already has Register class got it during pre-inst selection
+  // of another instruction. Maybe cross bank copy was required so we insert a
+  // copy trat can be removed later. This simplifies post-rb-legalize artifact
+  // combiner and avoids need to special case some patterns.
+  if (MRI.getRegClassOrNull(Reg)) {
+    LLT Ty = MRI.getType(Reg);
+    Register NewReg = MRI.createVirtualRegister({&RB, Ty});
+    DefOP.setReg(NewReg);
+
+    auto &MBB = *MI.getParent();
+    B.setInsertPt(MBB, MI.isPHI() ? MBB.getFirstNonPHI()
+                                  : std::next(MI.getIterator()));
+    B.buildCopy(Reg, NewReg);
+
+    // The problem was discoverd for uniform S1 that was used as both
+    // lane mask(vcc) and regular sgpr S1.
+    // - lane-mask(vcc) use was by si_if, this use is divergent and requires
+    //   non-trivial sgpr-S1-to-vcc copy. But pre-inst-selection of si_if sets
+    //   sreg_64_xexec(S1) on def of uniform S1 making it lane-mask.
+    // - the regular regular sgpr S1(uniform) instruction is now broken since
+    //   it uses sreg_64_xexec(S1) which is divergent.
+
+    // "Clear" reg classes from uses on generic instructions and but register
+    // banks instead.
+    for (auto &UseMI : MRI.use_instructions(Reg)) {
+      if (shouldRBSelect(UseMI)) {
+        for (MachineOperand &Op : UseMI.operands()) {
+          if (Op.isReg() && Op.isUse() && Op.getReg() == Reg)
+            Op.setReg(NewReg);
+        }
+      }
+    }
+
+  } else {
+    MRI.setRegBank(Reg, RB);
+  }
+}
+
+void setRBUse(MachineInstr &MI, MachineOperand &UseOP, MachineIRBuilder B,
+              MachineRegisterInfo &MRI, const RegisterBank &RB) {
+  Register Reg = UseOP.getReg();
+
+  LLT Ty = MRI.getType(Reg);
+  Register NewReg = MRI.createVirtualRegister({&RB, Ty});
+  UseOP.setReg(NewReg);
+
+  if (MI.isPHI()) {
+    auto DefMI = MRI.getVRegDef(Reg)->getIterator();
+    MachineBasicBlock *DefMBB = DefMI->getParent();
+    B.setInsertPt(*DefMBB, DefMBB->SkipPHIsAndLabels(std::next(DefMI)));
+  } else {
+    B.setInstr(MI);
+  }
+
+  B.buildCopy(NewReg, Reg);
+}
+
+// Temporal divergence copy: COPY to vgpr with implicit use of $exec inside of
+// the cycle
+// Note: uniformity analysis does not consider that registers with vgpr def are
+// divergent (you can have uniform value in vgpr).
+// - TODO: implicit use of $exec could be implemented as indicator that
+//   instruction is divergent
+bool isTemporalDivergenceCopy(Register Reg, MachineRegisterInfo &MRI) {
+  MachineInstr *MI = MRI.getVRegDef(Reg);
+  if (MI->getOpcode() == AMDGPU::COPY) {
+    for (auto Op : MI->implicit_operands()) {
+      if (!Op.isReg())
+        continue;
+      Register Reg = Op.getReg();
+      if (Reg == AMDGPU::EXEC) {
+        return true;
+      }
+    }
+  }
+
+  return false;
+}
+
+Register getVReg(MachineOperand &Op) {
+  if (!Op.isReg())
+    return 0;
+
+  Register Reg = Op.getReg();
+  if (!Reg.isVirtual())
+    return 0;
+
+  return Reg;
+}
+
+bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) {
+  MachineUniformityInfo &MUI =
+      getAnalysis<MachineUniformityAnalysisPass>().getUniformityInfo();
+  AMDGPU::IntrinsicLaneMaskAnalyzer ILMA(MF);
+  MachineRegisterInfo &MRI = MF.getRegInfo();
+  const RegisterBankInfo &RBI = *MF.getSubtarget().getRegBankInfo();
+
+  MachineIRBuilder B(MF);
+
+  // Assign register banks to ALL def registers on G_ instructions.
+  // Same for copies if they have no register bank or class on def.
+  for (MachineBasicBlock &MBB : MF) {
+    for (MachineInstr &MI : MBB) {
+      if (!shouldRBSelect(MI))
+        continue;
+
+      for (MachineOperand &DefOP : MI.defs()) {
+        Register DefReg = getVReg(DefOP);
+        if (!DefReg)
+          continue;
+
+        // Copies can have register class on def registers.
+        if (MI.isCopy() && MRI.getRegClassOrNull(DefReg)) {
+          continue;
+        }
+
+        if (MUI.isUniform(DefReg) || ILMA.isS32S64LaneMask(DefReg)) {
+          setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID));
+        } else {
+          if (MRI.getType(DefReg) == LLT::scalar(1))
+            setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID));
+          else
+            setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VGPRRegBankID));
+        }
+      }
+    }
+  }
+
+  // At this point all virtual registers have register class or bank
+  // - Defs of G_ instructions have register banks.
+  // - Defs and uses of inst-selected instructions have register class.
+  // - Defs and uses of copies can have either register class or bank
+  // and most notably
+  // - Uses of G_ instructions can have either register class or bank
+
+  // Reassign uses of G_ instructions to only have register banks.
+  for (MachineBasicBlock &MBB : MF) {
+    for (MachineInstr &MI : MBB) {
+      if (!shouldRBSelect(MI))
+        continue;
+
+      // Copies can have register class on use registers.
+      if (MI.isCopy())
+        continue;
+
+      for (MachineOperand &UseOP : MI.uses()) {
+        Register UseReg = getVReg(UseOP);
+        if (!UseReg)
+          continue;
+
+        if (!MRI.getRegClassOrNull(UseReg))
+          continue;
+
+        if (!isTemporalDivergenceCopy(UseReg, MRI) &&
+            (MUI.isUniform(UseReg) || ILMA.isS32S64LaneMask(UseReg))) {
+          setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID));
+        } else {
+          if (MRI.getType(UseReg) == LLT::scalar(1))
+            setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID));
+          else
+            setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::VGPRRegBankID));
+        }
+      }
+    }
+  }
+
+  // Defs and uses of G_ instructions have register banks exclusively.
+
+  return true;
+}
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
index 880057813adf54..208bf686c98ba8 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
@@ -11,22 +11,22 @@ body: |
     ; CHECK-LABEL: name: uniform_in_vgpr
     ; CHECK: liveins: $sgpr0, $sgpr1, $vgpr0, $vgpr1
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr0
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr1
-    ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
-    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:_(s32) = G_FPTOUI [[COPY]](s32)
-    ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[FPTOUI]], [[COPY1]]
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+    ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
+    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:sgpr(s32) = G_FPTOUI [[COPY]](s32)
+    ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[FPTOUI]], [[COPY1]]
     ; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
     ; CHECK-NEXT: S_ENDPGM 0
-    %0:_(s32) = COPY $sgpr0
-    %1:_(s32) = COPY $sgpr1
-    %3:_(s32) = COPY $vgpr0
-    %4:_(s32) = COPY $vgpr1
-    %2:_(p1) = G_MERGE_VALUES %3(s32), %4(s32)
-    %6:_(s32) = G_FPTOUI %0(s32)
-    %7:_(s32) = G_ADD %6, %1
+    %0:sgpr(s32) = COPY $sgpr0
+    %1:sgpr(s32) = COPY $sgpr1
+    %3:vgpr(s32) = COPY $vgpr0
+    %4:vgpr(s32) = COPY $vgpr1
+    %2:vgpr(p1) = G_MERGE_VALUES %3(s32), %4(s32)
+    %6:sgpr(s32) = G_FPTOUI %0(s32)
+    %7:sgpr(s32) = G_ADD %6, %1
     G_STORE %7(s32), %2(p1) :: (store (s32), addrspace 1)
     S_ENDPGM 0
 ...
@@ -41,26 +41,26 @@ body: |
     ; CHECK-LABEL: name: back_to_back_uniform_in_vgpr
     ; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $vgpr0, $vgpr1
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr0
-    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $vgpr1
-    ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY3]](s32), [[COPY4]](s32)
-    ; CHECK-NEXT: [[FADD:%[0-9]+]]:_(s32) = G_FADD [[COPY]], [[COPY1]]
-    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:_(s32) = G_FPTOUI [[FADD]](s32)
-    ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[FPTOUI]], [[COPY2]]
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:sgpr(s32) = COPY $sgpr2
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+    ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY3]](s32), [[COPY4]](s32)
+    ; CHECK-NEXT: [[FADD:%[0-9]+]]:sgpr(s32) = G_FADD [[COPY]], [[COPY1]]
+    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:sgpr(s32) = G_FPTOUI [[FADD]](s32)
+    ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[FPTOUI]], [[COPY2]]
     ; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
     ; CHECK-NEXT: S_ENDPGM 0
-    %0:_(s32) = COPY $sgpr0
-    %1:_(s32) = COPY $sgpr1
-    %2:_(s32) = COPY $sgpr2
-    %4:_(s32) = COPY $vgpr0
-    %5:_(s32) = COPY $vgpr1
-    %3:_(p1) = G_MERGE_VALUES %4(s32), %5(s32)
-    %7:_(s32) = G_FADD %0, %1
-    %8:_(s32) = G_FPTOUI %7(s32)
-    %9:_(s32) = G_ADD %8, %2
+    %0:sgpr(s32) = COPY $sgpr0
+    %1:sgpr(s32) = COPY $sgpr1
+    %2:sgpr(s32) = COPY $sgpr2
+    %4:vgpr(s32) = COPY $vgpr0
+    %5:vgpr(s32) = COPY $vgpr1
+    %3:vgpr(p1) = G_MERGE_VALUES %4(s32), %5(s32)
+    %7:sgpr(s32) = G_FADD %0, %1
+    %8:sgpr(s32) = G_FPTOUI %7(s32)
+    %9:sgpr(s32) = G_ADD %8, %2
     G_STORE %9(s32), %3(p1) :: (store (s32), addrspace 1)
     S_ENDPGM 0
 ...
@@ -75,36 +75,36 @@ body: |
     ; CHECK-LABEL: name: buffer_load_uniform
     ; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4, $vgpr0, $vgpr1
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $sgpr3
-    ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]](s32), [[COPY3]](s32)
-    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $sgpr4
-    ; CHECK-NEXT: [[COPY5:%[0-9]+]]:_(s32) = COPY $vgpr0
-    ; CHECK-NEXT: [[COPY6:%[0-9]+]]:_(s32) = COPY $vgpr1
-    ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY5]](s32), [[COPY6]](s32)
-    ; CHECK-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CHECK-NEXT: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:_(<4 x s32>) = G_AMDGPU_BUFFER_LOAD [[BUILD_VECTOR]](<4 x s32>), [[C]](s32), [[COPY4]], [[C]], 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
-    ; CHECK-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
-    ; CHECK-NEXT: [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32), [[UV2:%[0-9]+]]:_(s32), [[UV3:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[AMDGPU_BUFFER_LOAD]](<4 x s32>)
-    ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[UV1]], [[C1]]
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:sgpr(s32) = COPY $sgpr2
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:sgpr(s32) = COPY $sgpr3
+    ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:sgpr(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]](s32), [[COPY3]](s32)
+    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:sgpr(s32) = COPY $sgpr4
+    ; CHECK-NEXT: [[COPY5:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY6:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+    ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY5]](s32), [[COPY6]](s32)
+    ; CHECK-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 0
+    ; CHECK-NEXT: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:sgpr(<4 x s32>) = G_AMDGPU_BUFFER_LOAD [[BUILD_VECTOR]](<4 x s32>), [[C]](s32), [[COPY4]], [[C]], 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
+    ; CHECK-NEXT: [[C1:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 1
+    ; CHECK-NEXT: [[UV:%[0-9]+]]:sgpr(s32), [[UV1:%[0-9]+]]:sgpr(s32), [[UV2:%[0-9]+]]:sgpr(s32), [[UV3:%[0-9]+]]:sgpr(s32) = G_UNMERGE_VALUES [[AMDGPU_BUFFER_LOAD]](<4 x s32>)
+    ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[UV1]], [[C1]]
     ; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
     ; CHECK-NEXT: S_ENDPGM 0
-    %3:_(s32) = COPY $sgpr0
-    %4:_(s32) = COPY $sgpr1
-    %5:_(s32) = COPY $sgpr2
-    %6:_(s32) = COPY $sgpr3
-    %0:_(<4 x s32>) = G_BUILD_VECTOR %3(s32), %4(s32), %5(s32), %6(s32)
-    %1:_(s32) = COPY $sgpr4
-    %7:_(s32) = COPY $vgpr0
-    %8:_(s32) = COPY $vgpr1
-    %2:_(p1) = G_MERGE_VALUES %7(s32), %8(s32)
-    %11:_(s32) = G_CONSTANT i32 0
-    %10:_(<4 x s32>) = G_AMDGPU_BUFFER_LOAD %0(<4 x s32>), %11(s32), %1, %11, 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
-    %13:_(s32) = G_CONSTANT i32 1
-    %15:_(s32), %16:_(s32), %17:_(s32), %18:_(s32) = G_UNMERGE_VALUES %10(<4 x s32>)
-    %14:_(s32) = G_ADD %16, %13
+    %3:sgpr(s32) = COPY $sgpr0
+    %4:sgpr(s32) = COPY $sgpr1
+    %5:sgpr(s32) = COPY $sgpr2
+    %6:sgpr(s32) = COPY $sgpr3
+    %0:sgpr(<4 x s32>) = G_BUILD_VECTOR %3(s32), %4(s32), %5(s32), %6(s32)
+    %1:sgpr(s32) = COPY $sgpr4
+    %7:vgpr(s32) = COPY $vgpr0
+    %8:vgpr(s32) = COPY $vgpr1
+    %2:vgpr(p1) = G_MERGE_VALUES %7(s32), %8(s32)
+    %11:sgpr(s32) = G_CONSTANT i32 0
+    %10:sgpr(<4 x s32>) = G_AMDGPU_BUFFER_LOAD %0(<4 x s32>), %11(s32), %1, %11, 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
+    %13:sgpr(s32) = G_CONSTANT i32 1
+    %15:sgpr(s32), %16:sgpr(s32), %17:sgpr(s32), %18:sgpr(s32) = G_UNMERGE_VALUES %10(<4 x s32>)
+    %14:sgpr(s32) = G_ADD %16, %13
     G_STORE %14(s32), %2(p1) :: (store (s32), addrspace 1)
     S_ENDPGM 0
 ...
@@ -119,36 +119,36 @@ body: |
     ; CHECK-LABEL: name: buffer_load_divergent
     ; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0, $vgpr1, $vgpr2
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $sgpr3
-    ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Oct 18, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Petar Avramovic (petar-avramovic)

Changes

Assign register banks to virtual registers.
Defs and uses of G_ instructions have register banks exclusively,
if they had register class, reassign appropriate register bank.

Assign register banks using machine uniformity analysis:
SGPR - uniform values and some lane masks
VGPR - divergent, non S1, values
VCC - divergent S1 values(lane masks)

RBSelect does not consider available instructions and, in some cases, G_
instructions with some register bank assignment can't be inst-selected.
This is solved in RBLegalize.

Exceptions when uniformity analysis does not work:
S32/S64 lane masks:

  • need to end up with SGPR register class after instruction selection
  • In most cases Uniformity analysis declares them as uniform
    (forced by tablegen) resulting in sgpr S32/S64 reg bank
  • When Uniformity analysis declares them as divergent (some phis),
    use intrinsic lane mask analyzer to still assign sgpr register bank
    temporal divergence copy:
  • COPY to vgpr with implicit use of $exec inside of the cycle
  • this copy is declared as uniform by uniformity analysis
  • make sure that assigned bank is vgpr
    Note: uniformity analysis does not consider that registers with vgpr def
    are divergent (you can have uniform value in vgpr).
  • TODO: implicit use of $exec could be implemented as indicator
    that instruction is divergent

Patch is 118.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/112863.diff

5 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp (+38)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h (+22)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp (+193-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir (+480-458)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-select.mir (+238-227)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
index a98d4488bf77fe..6f6ad5cf82cae1 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.cpp
@@ -7,13 +7,16 @@
 //===----------------------------------------------------------------------===//
 
 #include "AMDGPUGlobalISelUtils.h"
+#include "AMDGPURegisterBankInfo.h"
 #include "GCNSubtarget.h"
 #include "llvm/CodeGen/GlobalISel/GISelKnownBits.h"
 #include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"
 #include "llvm/CodeGenTypes/LowLevelType.h"
 #include "llvm/IR/Constants.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
 
 using namespace llvm;
+using namespace AMDGPU;
 using namespace MIPatternMatch;
 
 std::pair<Register, unsigned>
@@ -69,3 +72,38 @@ AMDGPU::getBaseWithConstantOffset(MachineRegisterInfo &MRI, Register Reg,
 
   return std::pair(Reg, 0);
 }
+
+IntrinsicLaneMaskAnalyzer::IntrinsicLaneMaskAnalyzer(MachineFunction &MF)
+    : MRI(MF.getRegInfo()) {
+  initLaneMaskIntrinsics(MF);
+}
+
+bool IntrinsicLaneMaskAnalyzer::isS32S64LaneMask(Register Reg) {
+  return S32S64LaneMask.contains(Reg);
+}
+
+void IntrinsicLaneMaskAnalyzer::initLaneMaskIntrinsics(MachineFunction &MF) {
+  for (auto &MBB : MF) {
+    for (auto &MI : MBB) {
+      if (MI.getOpcode() == AMDGPU::G_INTRINSIC &&
+          MI.getOperand(MI.getNumExplicitDefs()).getIntrinsicID() ==
+              Intrinsic::amdgcn_if_break) {
+        S32S64LaneMask.insert(MI.getOperand(3).getReg());
+        findLCSSAPhi(MI.getOperand(0).getReg());
+      }
+
+      if (MI.getOpcode() == AMDGPU::SI_IF ||
+          MI.getOpcode() == AMDGPU::SI_ELSE) {
+        findLCSSAPhi(MI.getOperand(0).getReg());
+      }
+    }
+  }
+}
+
+void IntrinsicLaneMaskAnalyzer::findLCSSAPhi(Register Reg) {
+  S32S64LaneMask.insert(Reg);
+  for (auto &LCSSAPhi : MRI.use_instructions(Reg)) {
+    if (LCSSAPhi.isPHI())
+      S32S64LaneMask.insert(LCSSAPhi.getOperand(0).getReg());
+  }
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
index 5972552b9a4fe8..4d504d0204d81a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelUtils.h
@@ -9,6 +9,8 @@
 #ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPUGLOBALISELUTILS_H
 #define LLVM_LIB_TARGET_AMDGPU_AMDGPUGLOBALISELUTILS_H
 
+#include "llvm/ADT/DenseSet.h"
+#include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/Register.h"
 #include <utility>
 
@@ -26,6 +28,26 @@ std::pair<Register, unsigned>
 getBaseWithConstantOffset(MachineRegisterInfo &MRI, Register Reg,
                           GISelKnownBits *KnownBits = nullptr,
                           bool CheckNUW = false);
+
+// Currently finds S32/S64 lane masks that can be declared as divergent by
+// uniformity analysis (all are phis at the moment).
+// These are defined as i32/i64 in some IR intrinsics (not as i1).
+// Tablegen forces(via telling that lane mask IR intrinsics are uniform) most of
+// S32/S64 lane masks to be uniform, as this results in them ending up with sgpr
+// reg class after instruction-select don't search for all of them.
+class IntrinsicLaneMaskAnalyzer {
+  DenseSet<Register> S32S64LaneMask;
+  MachineRegisterInfo &MRI;
+
+public:
+  IntrinsicLaneMaskAnalyzer(MachineFunction &MF);
+  bool isS32S64LaneMask(Register Reg);
+
+private:
+  void initLaneMaskIntrinsics(MachineFunction &MF);
+  // This will not be needed when we turn of LCSSA for global-isel.
+  void findLCSSAPhi(Register Reg);
+};
 }
 }
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp b/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
index c53a68ff72a8ad..905ad432fe6e0d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURBSelect.cpp
@@ -16,7 +16,12 @@
 //===----------------------------------------------------------------------===//
 
 #include "AMDGPU.h"
+#include "AMDGPUGlobalISelUtils.h"
+#include "AMDGPURegisterBankInfo.h"
+#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineUniformityAnalysis.h"
 #include "llvm/InitializePasses.h"
 
 #define DEBUG_TYPE "rb-select"
@@ -39,6 +44,7 @@ class AMDGPURBSelect : public MachineFunctionPass {
   StringRef getPassName() const override { return "AMDGPU RB select"; }
 
   void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<MachineUniformityAnalysisPass>();
     MachineFunctionPass::getAnalysisUsage(AU);
   }
 
@@ -54,6 +60,7 @@ class AMDGPURBSelect : public MachineFunctionPass {
 
 INITIALIZE_PASS_BEGIN(AMDGPURBSelect, DEBUG_TYPE, "AMDGPU RB select", false,
                       false)
+INITIALIZE_PASS_DEPENDENCY(MachineUniformityAnalysisPass)
 INITIALIZE_PASS_END(AMDGPURBSelect, DEBUG_TYPE, "AMDGPU RB select", false,
                     false)
 
@@ -63,4 +70,189 @@ char &llvm::AMDGPURBSelectID = AMDGPURBSelect::ID;
 
 FunctionPass *llvm::createAMDGPURBSelectPass() { return new AMDGPURBSelect(); }
 
-bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) { return true; }
+bool shouldRBSelect(MachineInstr &MI) {
+  if (isTargetSpecificOpcode(MI.getOpcode()) && !MI.isPreISelOpcode())
+    return false;
+
+  if (MI.getOpcode() == AMDGPU::PHI || MI.getOpcode() == AMDGPU::IMPLICIT_DEF)
+    return false;
+
+  if (MI.isInlineAsm())
+    return false;
+
+  return true;
+}
+
+void setRB(MachineInstr &MI, MachineOperand &DefOP, MachineIRBuilder B,
+           MachineRegisterInfo &MRI, const RegisterBank &RB) {
+  Register Reg = DefOP.getReg();
+  // Register that already has Register class got it during pre-inst selection
+  // of another instruction. Maybe cross bank copy was required so we insert a
+  // copy trat can be removed later. This simplifies post-rb-legalize artifact
+  // combiner and avoids need to special case some patterns.
+  if (MRI.getRegClassOrNull(Reg)) {
+    LLT Ty = MRI.getType(Reg);
+    Register NewReg = MRI.createVirtualRegister({&RB, Ty});
+    DefOP.setReg(NewReg);
+
+    auto &MBB = *MI.getParent();
+    B.setInsertPt(MBB, MI.isPHI() ? MBB.getFirstNonPHI()
+                                  : std::next(MI.getIterator()));
+    B.buildCopy(Reg, NewReg);
+
+    // The problem was discoverd for uniform S1 that was used as both
+    // lane mask(vcc) and regular sgpr S1.
+    // - lane-mask(vcc) use was by si_if, this use is divergent and requires
+    //   non-trivial sgpr-S1-to-vcc copy. But pre-inst-selection of si_if sets
+    //   sreg_64_xexec(S1) on def of uniform S1 making it lane-mask.
+    // - the regular regular sgpr S1(uniform) instruction is now broken since
+    //   it uses sreg_64_xexec(S1) which is divergent.
+
+    // "Clear" reg classes from uses on generic instructions and but register
+    // banks instead.
+    for (auto &UseMI : MRI.use_instructions(Reg)) {
+      if (shouldRBSelect(UseMI)) {
+        for (MachineOperand &Op : UseMI.operands()) {
+          if (Op.isReg() && Op.isUse() && Op.getReg() == Reg)
+            Op.setReg(NewReg);
+        }
+      }
+    }
+
+  } else {
+    MRI.setRegBank(Reg, RB);
+  }
+}
+
+void setRBUse(MachineInstr &MI, MachineOperand &UseOP, MachineIRBuilder B,
+              MachineRegisterInfo &MRI, const RegisterBank &RB) {
+  Register Reg = UseOP.getReg();
+
+  LLT Ty = MRI.getType(Reg);
+  Register NewReg = MRI.createVirtualRegister({&RB, Ty});
+  UseOP.setReg(NewReg);
+
+  if (MI.isPHI()) {
+    auto DefMI = MRI.getVRegDef(Reg)->getIterator();
+    MachineBasicBlock *DefMBB = DefMI->getParent();
+    B.setInsertPt(*DefMBB, DefMBB->SkipPHIsAndLabels(std::next(DefMI)));
+  } else {
+    B.setInstr(MI);
+  }
+
+  B.buildCopy(NewReg, Reg);
+}
+
+// Temporal divergence copy: COPY to vgpr with implicit use of $exec inside of
+// the cycle
+// Note: uniformity analysis does not consider that registers with vgpr def are
+// divergent (you can have uniform value in vgpr).
+// - TODO: implicit use of $exec could be implemented as indicator that
+//   instruction is divergent
+bool isTemporalDivergenceCopy(Register Reg, MachineRegisterInfo &MRI) {
+  MachineInstr *MI = MRI.getVRegDef(Reg);
+  if (MI->getOpcode() == AMDGPU::COPY) {
+    for (auto Op : MI->implicit_operands()) {
+      if (!Op.isReg())
+        continue;
+      Register Reg = Op.getReg();
+      if (Reg == AMDGPU::EXEC) {
+        return true;
+      }
+    }
+  }
+
+  return false;
+}
+
+Register getVReg(MachineOperand &Op) {
+  if (!Op.isReg())
+    return 0;
+
+  Register Reg = Op.getReg();
+  if (!Reg.isVirtual())
+    return 0;
+
+  return Reg;
+}
+
+bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) {
+  MachineUniformityInfo &MUI =
+      getAnalysis<MachineUniformityAnalysisPass>().getUniformityInfo();
+  AMDGPU::IntrinsicLaneMaskAnalyzer ILMA(MF);
+  MachineRegisterInfo &MRI = MF.getRegInfo();
+  const RegisterBankInfo &RBI = *MF.getSubtarget().getRegBankInfo();
+
+  MachineIRBuilder B(MF);
+
+  // Assign register banks to ALL def registers on G_ instructions.
+  // Same for copies if they have no register bank or class on def.
+  for (MachineBasicBlock &MBB : MF) {
+    for (MachineInstr &MI : MBB) {
+      if (!shouldRBSelect(MI))
+        continue;
+
+      for (MachineOperand &DefOP : MI.defs()) {
+        Register DefReg = getVReg(DefOP);
+        if (!DefReg)
+          continue;
+
+        // Copies can have register class on def registers.
+        if (MI.isCopy() && MRI.getRegClassOrNull(DefReg)) {
+          continue;
+        }
+
+        if (MUI.isUniform(DefReg) || ILMA.isS32S64LaneMask(DefReg)) {
+          setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID));
+        } else {
+          if (MRI.getType(DefReg) == LLT::scalar(1))
+            setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID));
+          else
+            setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VGPRRegBankID));
+        }
+      }
+    }
+  }
+
+  // At this point all virtual registers have register class or bank
+  // - Defs of G_ instructions have register banks.
+  // - Defs and uses of inst-selected instructions have register class.
+  // - Defs and uses of copies can have either register class or bank
+  // and most notably
+  // - Uses of G_ instructions can have either register class or bank
+
+  // Reassign uses of G_ instructions to only have register banks.
+  for (MachineBasicBlock &MBB : MF) {
+    for (MachineInstr &MI : MBB) {
+      if (!shouldRBSelect(MI))
+        continue;
+
+      // Copies can have register class on use registers.
+      if (MI.isCopy())
+        continue;
+
+      for (MachineOperand &UseOP : MI.uses()) {
+        Register UseReg = getVReg(UseOP);
+        if (!UseReg)
+          continue;
+
+        if (!MRI.getRegClassOrNull(UseReg))
+          continue;
+
+        if (!isTemporalDivergenceCopy(UseReg, MRI) &&
+            (MUI.isUniform(UseReg) || ILMA.isS32S64LaneMask(UseReg))) {
+          setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID));
+        } else {
+          if (MRI.getType(UseReg) == LLT::scalar(1))
+            setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID));
+          else
+            setRBUse(MI, UseOP, B, MRI, RBI.getRegBank(AMDGPU::VGPRRegBankID));
+        }
+      }
+    }
+  }
+
+  // Defs and uses of G_ instructions have register banks exclusively.
+
+  return true;
+}
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
index 880057813adf54..208bf686c98ba8 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-rb-legalize.mir
@@ -11,22 +11,22 @@ body: |
     ; CHECK-LABEL: name: uniform_in_vgpr
     ; CHECK: liveins: $sgpr0, $sgpr1, $vgpr0, $vgpr1
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr0
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr1
-    ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
-    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:_(s32) = G_FPTOUI [[COPY]](s32)
-    ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[FPTOUI]], [[COPY1]]
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+    ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
+    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:sgpr(s32) = G_FPTOUI [[COPY]](s32)
+    ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[FPTOUI]], [[COPY1]]
     ; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
     ; CHECK-NEXT: S_ENDPGM 0
-    %0:_(s32) = COPY $sgpr0
-    %1:_(s32) = COPY $sgpr1
-    %3:_(s32) = COPY $vgpr0
-    %4:_(s32) = COPY $vgpr1
-    %2:_(p1) = G_MERGE_VALUES %3(s32), %4(s32)
-    %6:_(s32) = G_FPTOUI %0(s32)
-    %7:_(s32) = G_ADD %6, %1
+    %0:sgpr(s32) = COPY $sgpr0
+    %1:sgpr(s32) = COPY $sgpr1
+    %3:vgpr(s32) = COPY $vgpr0
+    %4:vgpr(s32) = COPY $vgpr1
+    %2:vgpr(p1) = G_MERGE_VALUES %3(s32), %4(s32)
+    %6:sgpr(s32) = G_FPTOUI %0(s32)
+    %7:sgpr(s32) = G_ADD %6, %1
     G_STORE %7(s32), %2(p1) :: (store (s32), addrspace 1)
     S_ENDPGM 0
 ...
@@ -41,26 +41,26 @@ body: |
     ; CHECK-LABEL: name: back_to_back_uniform_in_vgpr
     ; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $vgpr0, $vgpr1
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr0
-    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $vgpr1
-    ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY3]](s32), [[COPY4]](s32)
-    ; CHECK-NEXT: [[FADD:%[0-9]+]]:_(s32) = G_FADD [[COPY]], [[COPY1]]
-    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:_(s32) = G_FPTOUI [[FADD]](s32)
-    ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[FPTOUI]], [[COPY2]]
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:sgpr(s32) = COPY $sgpr2
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+    ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY3]](s32), [[COPY4]](s32)
+    ; CHECK-NEXT: [[FADD:%[0-9]+]]:sgpr(s32) = G_FADD [[COPY]], [[COPY1]]
+    ; CHECK-NEXT: [[FPTOUI:%[0-9]+]]:sgpr(s32) = G_FPTOUI [[FADD]](s32)
+    ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[FPTOUI]], [[COPY2]]
     ; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
     ; CHECK-NEXT: S_ENDPGM 0
-    %0:_(s32) = COPY $sgpr0
-    %1:_(s32) = COPY $sgpr1
-    %2:_(s32) = COPY $sgpr2
-    %4:_(s32) = COPY $vgpr0
-    %5:_(s32) = COPY $vgpr1
-    %3:_(p1) = G_MERGE_VALUES %4(s32), %5(s32)
-    %7:_(s32) = G_FADD %0, %1
-    %8:_(s32) = G_FPTOUI %7(s32)
-    %9:_(s32) = G_ADD %8, %2
+    %0:sgpr(s32) = COPY $sgpr0
+    %1:sgpr(s32) = COPY $sgpr1
+    %2:sgpr(s32) = COPY $sgpr2
+    %4:vgpr(s32) = COPY $vgpr0
+    %5:vgpr(s32) = COPY $vgpr1
+    %3:vgpr(p1) = G_MERGE_VALUES %4(s32), %5(s32)
+    %7:sgpr(s32) = G_FADD %0, %1
+    %8:sgpr(s32) = G_FPTOUI %7(s32)
+    %9:sgpr(s32) = G_ADD %8, %2
     G_STORE %9(s32), %3(p1) :: (store (s32), addrspace 1)
     S_ENDPGM 0
 ...
@@ -75,36 +75,36 @@ body: |
     ; CHECK-LABEL: name: buffer_load_uniform
     ; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4, $vgpr0, $vgpr1
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $sgpr3
-    ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]](s32), [[COPY3]](s32)
-    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $sgpr4
-    ; CHECK-NEXT: [[COPY5:%[0-9]+]]:_(s32) = COPY $vgpr0
-    ; CHECK-NEXT: [[COPY6:%[0-9]+]]:_(s32) = COPY $vgpr1
-    ; CHECK-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY5]](s32), [[COPY6]](s32)
-    ; CHECK-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CHECK-NEXT: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:_(<4 x s32>) = G_AMDGPU_BUFFER_LOAD [[BUILD_VECTOR]](<4 x s32>), [[C]](s32), [[COPY4]], [[C]], 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
-    ; CHECK-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
-    ; CHECK-NEXT: [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32), [[UV2:%[0-9]+]]:_(s32), [[UV3:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[AMDGPU_BUFFER_LOAD]](<4 x s32>)
-    ; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[UV1]], [[C1]]
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr(s32) = COPY $sgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr(s32) = COPY $sgpr1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:sgpr(s32) = COPY $sgpr2
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:sgpr(s32) = COPY $sgpr3
+    ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:sgpr(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]](s32), [[COPY3]](s32)
+    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:sgpr(s32) = COPY $sgpr4
+    ; CHECK-NEXT: [[COPY5:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY6:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+    ; CHECK-NEXT: [[MV:%[0-9]+]]:vgpr(p1) = G_MERGE_VALUES [[COPY5]](s32), [[COPY6]](s32)
+    ; CHECK-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 0
+    ; CHECK-NEXT: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:sgpr(<4 x s32>) = G_AMDGPU_BUFFER_LOAD [[BUILD_VECTOR]](<4 x s32>), [[C]](s32), [[COPY4]], [[C]], 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
+    ; CHECK-NEXT: [[C1:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 1
+    ; CHECK-NEXT: [[UV:%[0-9]+]]:sgpr(s32), [[UV1:%[0-9]+]]:sgpr(s32), [[UV2:%[0-9]+]]:sgpr(s32), [[UV3:%[0-9]+]]:sgpr(s32) = G_UNMERGE_VALUES [[AMDGPU_BUFFER_LOAD]](<4 x s32>)
+    ; CHECK-NEXT: [[ADD:%[0-9]+]]:sgpr(s32) = G_ADD [[UV1]], [[C1]]
     ; CHECK-NEXT: G_STORE [[ADD]](s32), [[MV]](p1) :: (store (s32), addrspace 1)
     ; CHECK-NEXT: S_ENDPGM 0
-    %3:_(s32) = COPY $sgpr0
-    %4:_(s32) = COPY $sgpr1
-    %5:_(s32) = COPY $sgpr2
-    %6:_(s32) = COPY $sgpr3
-    %0:_(<4 x s32>) = G_BUILD_VECTOR %3(s32), %4(s32), %5(s32), %6(s32)
-    %1:_(s32) = COPY $sgpr4
-    %7:_(s32) = COPY $vgpr0
-    %8:_(s32) = COPY $vgpr1
-    %2:_(p1) = G_MERGE_VALUES %7(s32), %8(s32)
-    %11:_(s32) = G_CONSTANT i32 0
-    %10:_(<4 x s32>) = G_AMDGPU_BUFFER_LOAD %0(<4 x s32>), %11(s32), %1, %11, 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
-    %13:_(s32) = G_CONSTANT i32 1
-    %15:_(s32), %16:_(s32), %17:_(s32), %18:_(s32) = G_UNMERGE_VALUES %10(<4 x s32>)
-    %14:_(s32) = G_ADD %16, %13
+    %3:sgpr(s32) = COPY $sgpr0
+    %4:sgpr(s32) = COPY $sgpr1
+    %5:sgpr(s32) = COPY $sgpr2
+    %6:sgpr(s32) = COPY $sgpr3
+    %0:sgpr(<4 x s32>) = G_BUILD_VECTOR %3(s32), %4(s32), %5(s32), %6(s32)
+    %1:sgpr(s32) = COPY $sgpr4
+    %7:vgpr(s32) = COPY $vgpr0
+    %8:vgpr(s32) = COPY $vgpr1
+    %2:vgpr(p1) = G_MERGE_VALUES %7(s32), %8(s32)
+    %11:sgpr(s32) = G_CONSTANT i32 0
+    %10:sgpr(<4 x s32>) = G_AMDGPU_BUFFER_LOAD %0(<4 x s32>), %11(s32), %1, %11, 0, 0, 0 :: (dereferenceable load (<4 x s32>), align 1, addrspace 8)
+    %13:sgpr(s32) = G_CONSTANT i32 1
+    %15:sgpr(s32), %16:sgpr(s32), %17:sgpr(s32), %18:sgpr(s32) = G_UNMERGE_VALUES %10(<4 x s32>)
+    %14:sgpr(s32) = G_ADD %16, %13
     G_STORE %14(s32), %2(p1) :: (store (s32), addrspace 1)
     S_ENDPGM 0
 ...
@@ -119,36 +119,36 @@ body: |
     ; CHECK-LABEL: name: buffer_load_divergent
     ; CHECK: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0, $vgpr1, $vgpr2
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
-    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
-    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $sgpr3
-    ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]...
[truncated]

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget about AGPRs

Comment on lines 77 to 78
if (MI.getOpcode() == AMDGPU::PHI || MI.getOpcode() == AMDGPU::IMPLICIT_DEF)
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should have failed isPreISelOpcode

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied that from existing regbankselect. MI.isPreISelOpcode() || MI.isCopy() also works

if (MI.getOpcode() == AMDGPU::PHI || MI.getOpcode() == AMDGPU::IMPLICIT_DEF)
return false;

if (MI.isInlineAsm())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should have failed isPreISelOpcode

setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::SGPRRegBankID));
} else {
if (MRI.getType(DefReg) == LLT::scalar(1))
setRB(MI, DefOP, B, MRI, RBI.getRegBank(AMDGPU::VCCRegBankID));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you directly use the pointer to the const regbank struct?

if (MI.isCopy())
continue;

for (MachineOperand &UseOP : MI.uses()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the defs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous for loop assigned RegBanks to all defs.
This loop prepares uses for RBLegalize to have register banks only.

}
}

void setRBUse(MachineInstr &MI, MachineOperand &UseOP, MachineIRBuilder B,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static. Also don't pass MachineIRBuilder by value

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

member function

}

} else {
MRI.setRegBank(Reg, RB);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to call the observer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is our pass I felt there was no need to complicate it with observers

@@ -63,4 +70,189 @@ char &llvm::AMDGPURBSelectID = AMDGPURBSelect::ID;

FunctionPass *llvm::createAMDGPURBSelectPass() { return new AMDGPURBSelect(); }

bool AMDGPURBSelect::runOnMachineFunction(MachineFunction &MF) { return true; }
bool shouldRBSelect(MachineInstr &MI) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why free-standing functions, when it is your. register bank select pass?

@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-skeleton branch from ff34aa1 to 0c40f68 Compare October 22, 2024 16:20
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from 2124eb3 to df50c85 Compare October 22, 2024 16:20
@petar-avramovic petar-avramovic changed the title AMDGPU/GlobalISel: RBSelect AMDGPU/GlobalISel: StandaloneRegBankSelect Oct 23, 2024
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-skeleton branch 2 times, most recently from 8bf0a23 to 84284ba Compare October 28, 2024 14:48
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from df50c85 to 36c8a96 Compare October 28, 2024 14:48
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-skeleton branch from 84284ba to 3b0aaef Compare October 28, 2024 14:57
Comment on lines 120 to 121
if (!Op.isReg())
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is illegal for a copy

if (!MI->isCopy())
return false;

for (auto Op : MI->implicit_operands()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know why this is only looking at the implicit operands, and not the only real operand (which I would expect is the only one that matters). Just hardcode looking at the first operand?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing useful in real operands, it is a simple sgpr to vgpr COPY which is quite common. What makes that COPY special is implicit exec, for example:
%21:vgpr_32(s32) = COPY %8:sgpr(s32), implicit $exec_lo
implicit exec is there to stop other passes from moving the COPY outside of the loop.

@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-skeleton branch from 1bcfb1d to f6baaa8 Compare November 5, 2024 15:50
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from 3866424 to 6fbb4c0 Compare November 5, 2024 15:50
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-skeleton branch from f6baaa8 to 0bec6e8 Compare November 8, 2024 14:33
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from 6fbb4c0 to c5dd703 Compare November 8, 2024 14:34
Comment on lines +94 to +97
if (MI.getOpcode() == AMDGPU::SI_IF ||
MI.getOpcode() == AMDGPU::SI_ELSE) {
findLCSSAPhi(MI.getOperand(0).getReg());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you are mixing matching the intrinsic form of if.break above, but the selected pseudos for if and else

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consequence of what legalizer does, si.if and si.else are inst-selected to SI_IF and SI_ELSE in AMDGPULegalizerInfo::legalizeIntrinsic, if.break is still intrinsic for reg bank selection

if (!MI->isCopy() || MI->getNumImplicitOperands() != 1)
return false;

return MI->implicit_operands().begin()->getReg() == TRI.getExec();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just hardcode the operand index

// are divergent (you can have uniform value in vgpr).
// - TODO: implicit use of $exec could be implemented as indicator that
// instruction is divergent
bool isTemporalDivergenceCopy(Register Reg) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special casing a copy with an implicit exec operand doesn't feel hermetic. We already have ugly hacks to insert these copies at one point, and it's not a wholistic strategy. There are ample opportunities to lose the operand, such as introducing new derived operations which do not have the use

Where are these exec operands getting inserted? Should we have a different pseudo instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are these exec operands getting inserted? Should we have a different pseudo instead?

In temporal-divergence lowering, upcoming change in AMDGPUGlobalISelDivergenceLowering, I will just remove this check in this patch.

@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-skeleton branch from 0bec6e8 to d0f20d3 Compare November 12, 2024 10:54
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from c5dd703 to 62dddba Compare November 12, 2024 10:54
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-skeleton branch from d0f20d3 to 3d603c4 Compare November 27, 2024 11:23
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from 62dddba to f36fadc Compare November 27, 2024 11:24
@petar-avramovic
Copy link
Collaborator Author

ping

@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-skeleton branch from 3d603c4 to f153650 Compare November 28, 2024 18:00
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from f36fadc to 6cf16c0 Compare November 28, 2024 18:01
Base automatically changed from users/petar-avramovic/new-rbs-skeleton to main December 3, 2024 21:02
}
}

// At this point all virtual registers have register class or bank
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this in one loop with a post order traversal?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged two loops into one. Order does not matter, could go
for (unsigned i = 0; i < MRI.getNumVirtRegs(); ++i) {
Register Reg = Register::index2VirtReg(i);
but still needs to search for def instructions and lists of uses, don't know if that might be better option

@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch 2 times, most recently from 41857ee to 8f99d8a Compare December 5, 2024 13:30
Comment on lines 211 to 212
assert(!MRI.getRegBankOrNull(DefReg));
if (!MRI.getRegClassOrNull(DefReg))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either the assert or the if are redundant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed that, will fix in next update

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small rework, still want to have both checks. Vregs on def before assigning banks should have either nothing or register class. Assert is there to check that there is no bank already assigned

@petar-avramovic
Copy link
Collaborator Author

There is something wrong with one loop approach, multiple tests are failing. They all have temporal divergent values so I assume that is the issue. Need to investigate

@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from 8f99d8a to c19466b Compare December 16, 2024 16:16
@petar-avramovic
Copy link
Collaborator Author

All fixed (error was in upcoming patches). Ready for review

@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from c19466b to 8bfa4f4 Compare December 17, 2024 14:17
Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have a comment/question, but I think we're at the point where it just makes sense to make progress and land this.

Comment on lines +102 to +108
void IntrinsicLaneMaskAnalyzer::findLCSSAPhi(Register Reg) {
S32S64LaneMask.insert(Reg);
for (const MachineInstr &LCSSAPhi : MRI.use_instructions(Reg)) {
if (LCSSAPhi.isPHI())
S32S64LaneMask.insert(LCSSAPhi.getOperand(0).getReg());
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: How sure are you/we that there can never be "nested" phis here, perhaps due to breaking out of nested loops?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntrinsicLaneMaskAnalyzer is very very trivial. It only handles LCSSA phis and if.break phis.

Control flow intrinsic should only be used by other control flow intrinsics. But since we run LCSSA they can also be used by lcssa-phis.

findLCSSAPhi is really meant to cover intrinsics used for control flow that are affected by LCSSA pass.
%64:_(s32) = G_PHI %15(s32), %bb.3 is lane mask. Uniformity analysis says it is divergent. If it was S1 we would be fine but since it is S32 we need to special case put it to sgpr.

    %15:sreg_32_xm0_xexec(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.if.break), %45(s1), %14(s32)
    SI_LOOP %15(s32), %bb.1, implicit-def $exec, implicit-def $scc, implicit $exec
    G_BR %bb.6

  bb.6:
    %64:_(s32) = G_PHI %15(s32), %bb.3
    G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.end.cf), %64(s32)
    S_ENDPGM 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just working around a bug in uniformity analysis? We already special case amdgcn_if and else in isAlwaysUniform. Did this just miss if.break?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if.break is already forced uniform.
Here we special case lcssa-phi lane masks.
Phi is divergent because it uses value(uniform if.break in this case) outside cycle with the divergent exit, but that phi is lane mask so it needs to end up in sgpr.

@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch from 8bfa4f4 to 5a76ae6 Compare January 13, 2025 10:57
@petar-avramovic petar-avramovic force-pushed the users/petar-avramovic/new-rbs-rb-select branch 2 times, most recently from db43f84 to 1c5977e Compare January 24, 2025 09:32
Assign register banks to virtual registers. Does not use generic
RegBankSelect. After register bank selection all register operand of
G_ instructions have LLT and register banks exclusively. If they had
register class, reassign appropriate register bank.

Assign register banks using machine uniformity analysis:
Sgpr - uniform values and some lane masks
Vgpr - divergent, non S1, values
Vcc  - divergent S1 values(lane masks)

AMDGPURegBankSelect does not consider available instructions and, in
some cases, G_ instructions with some register bank assignment can't be
inst-selected. This is solved in RegBankLegalize.

Exceptions when uniformity analysis does not work:
S32/S64 lane masks:
- need to end up with sgpr register class after instruction selection
- In most cases Uniformity analysis declares them as uniform
  (forced by tablegen) resulting in sgpr S32/S64 reg bank
- When Uniformity analysis declares them as divergent (some phis),
  use intrinsic lane mask analyzer to still assign sgpr register bank
temporal divergence copy:
- COPY to vgpr with implicit use of $exec inside of the cycle
- this copy is declared as uniform by uniformity analysis
- make sure that assigned bank is vgpr
Note: uniformity analysis does not consider that registers with vgpr def
are divergent (you can have uniform value in vgpr).
- TODO: implicit use of $exec could be implemented as indicator
  that instruction is divergent
@petar-avramovic petar-avramovic merged commit f8a56df into main Jan 24, 2025
6 of 8 checks passed
@petar-avramovic petar-avramovic deleted the users/petar-avramovic/new-rbs-rb-select branch January 24, 2025 10:06
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jan 24, 2025

LLVM Buildbot has detected a new failure on builder openmp-offload-libc-amdgpu-runtime running on omp-vega20-1 while building llvm at step 7 "Add check check-offload".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/73/builds/12350

Here is the relevant piece of the build log for the reference
Step 7 (Add check check-offload) failure: test (failure)
******************** TEST 'libomptarget :: amdgcn-amd-amdhsa :: offloading/pgo1.c' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 1
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp    -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fprofile-instr-generate      -Xclang "-fprofile-instrument=clang"
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fprofile-instr-generate -Xclang -fprofile-instrument=clang
# RUN: at line 3
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp 2>&1 | /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c      --check-prefix="CLANG-PGO"
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c --check-prefix=CLANG-PGO
# .---command stderr------------
# | /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c:32:20: error: CLANG-PGO-NEXT: expected string not found in input
# | // CLANG-PGO-NEXT: [ 0 11 20 ]
# |                    ^
# | <stdin>:3:28: note: scanning from here
# | ======== Counters =========
# |                            ^
# | <stdin>:4:1: note: possible intended match here
# | [ 0 13 20 ]
# | ^
# | 
# | Input file: <stdin>
# | Check file: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |            1: ======= GPU Profile ======= 
# |            2: Target: amdgcn-amd-amdhsa 
# |            3: ======== Counters ========= 
# | next:32'0                                X error: no match found
# |            4: [ 0 13 20 ] 
# | next:32'0     ~~~~~~~~~~~~
# | next:32'1     ?            possible intended match
# |            5: [ 10 ] 
# | next:32'0     ~~~~~~~
# |            6: [ 20 ] 
# | next:32'0     ~~~~~~~
# |            7: ========== Data =========== 
# | next:32'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# |            8: { 9515997471539760012 4749112401 0xffffffffffffffd8 0x0 0x0 0x0 3 [...] 0 } 
# | next:32'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# |            9: { 3666282617048535130 24 0xffffffffffffffb0 0x0 0x0 0x0 1 [...] 0 } 
# | next:32'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# |            .
# |            .
# |            .
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants