-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AMDGPU] Introduce "amdgpu-uniform-intrinsic-combine" pass to combine uniform AMDGPU lane Intrinsics. #116953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[AMDGPU] Introduce "amdgpu-uniform-intrinsic-combine" pass to combine uniform AMDGPU lane Intrinsics. #116953
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Pankaj Dwivedi (PankajDwivedi-25) ChangesAs per the draft PR: #99878 discussion, it turned out to have UniformityAnalysis after InstCombine may not fully utilize all the scopes, but it is not going to cause the dependency troubles. This pass allows folding of lane* intrinsics for a subset of patterns. Co-authored by @Acim-Maravic Patch is 26.18 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/116953.diff 5 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 95d0ad0f9dc96a..ca72ac6d3670ce 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -92,7 +92,7 @@ class SILowerI1CopiesPass : public PassInfoMixin<SILowerI1CopiesPass> {
void initializeAMDGPUDAGToDAGISelLegacyPass(PassRegistry &);
-void initializeAMDGPUAlwaysInlinePass(PassRegistry&);
+void initializeAMDGPUAlwaysInlinePass(PassRegistry &);
Pass *createAMDGPUAnnotateKernelFeaturesPass();
Pass *createAMDGPUAttributorLegacyPass();
@@ -226,11 +226,11 @@ extern char &GCNRegPressurePrinterID;
// Passes common to R600 and SI
FunctionPass *createAMDGPUPromoteAlloca();
-void initializeAMDGPUPromoteAllocaPass(PassRegistry&);
+void initializeAMDGPUPromoteAllocaPass(PassRegistry &);
extern char &AMDGPUPromoteAllocaID;
FunctionPass *createAMDGPUPromoteAllocaToVector();
-void initializeAMDGPUPromoteAllocaToVectorPass(PassRegistry&);
+void initializeAMDGPUPromoteAllocaToVectorPass(PassRegistry &);
extern char &AMDGPUPromoteAllocaToVectorID;
struct AMDGPUPromoteAllocaPass : PassInfoMixin<AMDGPUPromoteAllocaPass> {
@@ -299,7 +299,7 @@ class AMDGPULateCodeGenPreparePass
const GCNTargetMachine &TM;
public:
- AMDGPULateCodeGenPreparePass(const GCNTargetMachine &TM) : TM(TM) {};
+ AMDGPULateCodeGenPreparePass(const GCNTargetMachine &TM) : TM(TM){};
PreservedAnalyses run(Function &, FunctionAnalysisManager &);
};
@@ -325,7 +325,7 @@ class AMDGPUAttributorPass : public PassInfoMixin<AMDGPUAttributorPass> {
public:
AMDGPUAttributorPass(TargetMachine &TM, AMDGPUAttributorOptions Options = {})
- : TM(TM), Options(Options) {};
+ : TM(TM), Options(Options){};
PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
};
@@ -339,7 +339,7 @@ class AMDGPUAnnotateUniformValuesPass
FunctionPass *createAMDGPUAnnotateUniformValuesLegacy();
ModulePass *createAMDGPUPrintfRuntimeBinding();
-void initializeAMDGPUPrintfRuntimeBindingPass(PassRegistry&);
+void initializeAMDGPUPrintfRuntimeBindingPass(PassRegistry &);
extern char &AMDGPUPrintfRuntimeBindingID;
void initializeAMDGPUResourceUsageAnalysisPass(PassRegistry &);
@@ -350,15 +350,15 @@ struct AMDGPUPrintfRuntimeBindingPass
PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
};
-ModulePass* createAMDGPUUnifyMetadataPass();
-void initializeAMDGPUUnifyMetadataPass(PassRegistry&);
+ModulePass *createAMDGPUUnifyMetadataPass();
+void initializeAMDGPUUnifyMetadataPass(PassRegistry &);
extern char &AMDGPUUnifyMetadataID;
struct AMDGPUUnifyMetadataPass : PassInfoMixin<AMDGPUUnifyMetadataPass> {
PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
};
-void initializeSIOptimizeExecMaskingPreRAPass(PassRegistry&);
+void initializeSIOptimizeExecMaskingPreRAPass(PassRegistry &);
extern char &SIOptimizeExecMaskingPreRAID;
void initializeSIOptimizeVGPRLiveRangePass(PassRegistry &);
@@ -367,7 +367,7 @@ extern char &SIOptimizeVGPRLiveRangeID;
void initializeAMDGPUAnnotateUniformValuesLegacyPass(PassRegistry &);
extern char &AMDGPUAnnotateUniformValuesLegacyPassID;
-void initializeAMDGPUCodeGenPreparePass(PassRegistry&);
+void initializeAMDGPUCodeGenPreparePass(PassRegistry &);
extern char &AMDGPUCodeGenPrepareID;
void initializeAMDGPURemoveIncompatibleFunctionsPass(PassRegistry &);
@@ -400,10 +400,10 @@ class SIAnnotateControlFlowPass
void initializeSIAnnotateControlFlowLegacyPass(PassRegistry &);
extern char &SIAnnotateControlFlowLegacyPassID;
-void initializeSIMemoryLegalizerPass(PassRegistry&);
+void initializeSIMemoryLegalizerPass(PassRegistry &);
extern char &SIMemoryLegalizerID;
-void initializeSIModeRegisterPass(PassRegistry&);
+void initializeSIModeRegisterPass(PassRegistry &);
extern char &SIModeRegisterID;
void initializeAMDGPUInsertDelayAluPass(PassRegistry &);
@@ -412,25 +412,25 @@ extern char &AMDGPUInsertDelayAluID;
void initializeSIInsertHardClausesPass(PassRegistry &);
extern char &SIInsertHardClausesID;
-void initializeSIInsertWaitcntsPass(PassRegistry&);
+void initializeSIInsertWaitcntsPass(PassRegistry &);
extern char &SIInsertWaitcntsID;
-void initializeSIFormMemoryClausesPass(PassRegistry&);
+void initializeSIFormMemoryClausesPass(PassRegistry &);
extern char &SIFormMemoryClausesID;
-void initializeSIPostRABundlerPass(PassRegistry&);
+void initializeSIPostRABundlerPass(PassRegistry &);
extern char &SIPostRABundlerID;
void initializeGCNCreateVOPDPass(PassRegistry &);
extern char &GCNCreateVOPDID;
-void initializeAMDGPUUnifyDivergentExitNodesPass(PassRegistry&);
+void initializeAMDGPUUnifyDivergentExitNodesPass(PassRegistry &);
extern char &AMDGPUUnifyDivergentExitNodesID;
ImmutablePass *createAMDGPUAAWrapperPass();
-void initializeAMDGPUAAWrapperPassPass(PassRegistry&);
+void initializeAMDGPUAAWrapperPassPass(PassRegistry &);
ImmutablePass *createAMDGPUExternalAAWrapperPass();
-void initializeAMDGPUExternalAAWrapperPass(PassRegistry&);
+void initializeAMDGPUExternalAAWrapperPass(PassRegistry &);
void initializeAMDGPUArgumentUsageInfoPass(PassRegistry &);
@@ -453,6 +453,17 @@ void initializeAMDGPUSetWavePriorityPass(PassRegistry &);
void initializeGCNRewritePartialRegUsesPass(llvm::PassRegistry &);
extern char &GCNRewritePartialRegUsesID;
+void initializeAMDGPUUniformIntrinsicCombinePass(PassRegistry &);
+extern char &AMDGPUUniformIntrinsicCombineID;
+FunctionPass *createAMDGPUUniformIntrinsicCombinePass();
+
+struct AMDGPUUniformIntrinsicCombinePass
+ : public PassInfoMixin<AMDGPUUniformIntrinsicCombinePass> {
+ const AMDGPUTargetMachine &TM;
+ AMDGPUUniformIntrinsicCombinePass(const AMDGPUTargetMachine &TM_) : TM(TM_) {}
+ PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
+};
+
namespace AMDGPU {
enum TargetIndex {
TI_CONSTDATA_START,
@@ -488,7 +499,7 @@ static inline bool addrspacesMayAlias(unsigned AS1, unsigned AS2) {
return ASAliasRules[AS1][AS2];
}
-}
+} // namespace AMDGPU
} // End namespace llvm
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 174a90f0aa419d..1f6a791ed73eb1 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -22,9 +22,9 @@ MODULE_PASS("amdgpu-lower-buffer-fat-pointers",
MODULE_PASS("amdgpu-lower-ctor-dtor", AMDGPUCtorDtorLoweringPass())
MODULE_PASS("amdgpu-sw-lower-lds", AMDGPUSwLowerLDSPass(*this))
MODULE_PASS("amdgpu-lower-module-lds", AMDGPULowerModuleLDSPass(*this))
-MODULE_PASS("amdgpu-perf-hint",
- AMDGPUPerfHintAnalysisPass(
- *static_cast<const GCNTargetMachine *>(this)))
+MODULE_PASS(
+ "amdgpu-perf-hint",
+ AMDGPUPerfHintAnalysisPass(*static_cast<const GCNTargetMachine *>(this)))
MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
#undef MODULE_PASS
@@ -32,12 +32,11 @@ MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
#ifndef MODULE_PASS_WITH_PARAMS
#define MODULE_PASS_WITH_PARAMS(NAME, CLASS, CREATE_PASS, PARSER, PARAMS)
#endif
-MODULE_PASS_WITH_PARAMS(
- "amdgpu-attributor", "AMDGPUAttributorPass",
- [=](AMDGPUAttributorOptions Options) {
- return AMDGPUAttributorPass(*this, Options);
- },
- parseAMDGPUAttributorPassOptions, "closed-world")
+MODULE_PASS_WITH_PARAMS("amdgpu-attributor", "AMDGPUAttributorPass",
+ [=](AMDGPUAttributorOptions Options) {
+ return AMDGPUAttributorPass(*this, Options);
+ },
+ parseAMDGPUAttributorPassOptions, "closed-world")
#undef MODULE_PASS_WITH_PARAMS
#ifndef FUNCTION_PASS
@@ -47,9 +46,9 @@ FUNCTION_PASS("amdgpu-annotate-uniform", AMDGPUAnnotateUniformValuesPass())
FUNCTION_PASS("amdgpu-codegenprepare", AMDGPUCodeGenPreparePass(*this))
FUNCTION_PASS("amdgpu-image-intrinsic-opt",
AMDGPUImageIntrinsicOptimizerPass(*this))
-FUNCTION_PASS("amdgpu-late-codegenprepare",
- AMDGPULateCodeGenPreparePass(
- *static_cast<const GCNTargetMachine *>(this)))
+FUNCTION_PASS(
+ "amdgpu-late-codegenprepare",
+ AMDGPULateCodeGenPreparePass(*static_cast<const GCNTargetMachine *>(this)))
FUNCTION_PASS("amdgpu-lower-kernel-arguments",
AMDGPULowerKernelArgumentsPass(*this))
FUNCTION_PASS("amdgpu-lower-kernel-attributes",
@@ -64,7 +63,11 @@ FUNCTION_PASS("amdgpu-rewrite-undef-for-phi", AMDGPURewriteUndefForPHIPass())
FUNCTION_PASS("amdgpu-unify-divergent-exit-nodes",
AMDGPUUnifyDivergentExitNodesPass())
FUNCTION_PASS("amdgpu-usenative", AMDGPUUseNativeCallsPass())
-FUNCTION_PASS("si-annotate-control-flow", SIAnnotateControlFlowPass(*static_cast<const GCNTargetMachine *>(this)))
+FUNCTION_PASS(
+ "si-annotate-control-flow",
+ SIAnnotateControlFlowPass(*static_cast<const GCNTargetMachine *>(this)))
+FUNCTION_PASS("amdgpu-uniformIntrinsic-combine",
+ AMDGPUUniformIntrinsicCombinePass(*this))
#undef FUNCTION_PASS
#ifndef FUNCTION_ANALYSIS
@@ -82,13 +85,13 @@ FUNCTION_ALIAS_ANALYSIS("amdgpu-aa", AMDGPUAA())
#ifndef FUNCTION_PASS_WITH_PARAMS
#define FUNCTION_PASS_WITH_PARAMS(NAME, CLASS, CREATE_PASS, PARSER, PARAMS)
#endif
-FUNCTION_PASS_WITH_PARAMS(
- "amdgpu-atomic-optimizer",
- "AMDGPUAtomicOptimizerPass",
- [=](ScanOptions Strategy) {
- return AMDGPUAtomicOptimizerPass(*this, Strategy);
- },
- parseAMDGPUAtomicOptimizerStrategy, "strategy=dpp|iterative|none")
+FUNCTION_PASS_WITH_PARAMS("amdgpu-atomic-optimizer",
+ "AMDGPUAtomicOptimizerPass",
+ [=](ScanOptions Strategy) {
+ return AMDGPUAtomicOptimizerPass(*this, Strategy);
+ },
+ parseAMDGPUAtomicOptimizerStrategy,
+ "strategy=dpp|iterative|none")
#undef FUNCTION_PASS_WITH_PARAMS
#ifndef MACHINE_FUNCTION_PASS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp
new file mode 100644
index 00000000000000..d35c30cf42a7a6
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp
@@ -0,0 +1,184 @@
+//===-- AMDGPUUniformIntrinsicCombine.cpp
+//-----------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file
+/// This pass combines uniform intrinsic instructions.
+/// Unifrom Intrinsic combine uses pattern match to identify and optimize
+/// redundent intrinsic instruction.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "GCNSubtarget.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
+#include "llvm/Analysis/UniformityAnalysis.h"
+#include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/InstVisitor.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Target/TargetMachine.h"
+#include "llvm/Transforms/Utils/BasicBlockUtils.h"
+
+#define DEBUG_TYPE "amdgpu-uniformIntrinsic-combine"
+
+using namespace llvm;
+using namespace llvm::AMDGPU;
+using namespace llvm::PatternMatch;
+
+namespace {
+
+class AMDGPUUniformIntrinsicCombine : public FunctionPass {
+public:
+ static char ID;
+ AMDGPUUniformIntrinsicCombine() : FunctionPass(ID) {}
+
+ bool runOnFunction(Function &F) override;
+
+ void getAnalysisUsage(AnalysisUsage &AU) const override {
+ AU.addPreserved<DominatorTreeWrapperPass>();
+ AU.addRequired<UniformityInfoWrapperPass>();
+ AU.addRequired<TargetPassConfig>();
+ }
+};
+
+class AMDGPUUniformIntrinsicCombineImpl
+ : public InstVisitor<AMDGPUUniformIntrinsicCombineImpl> {
+private:
+ const UniformityInfo *UI;
+
+ void optimizeUniformIntrinsicInst(IntrinsicInst &II) const;
+
+public:
+ AMDGPUUniformIntrinsicCombineImpl() = delete;
+
+ AMDGPUUniformIntrinsicCombineImpl(const UniformityInfo *UI) : UI(UI) {}
+
+ bool run(Function &F);
+};
+
+} // namespace
+
+char AMDGPUUniformIntrinsicCombine::ID = 0;
+
+char &llvm::AMDGPUUniformIntrinsicCombineID = AMDGPUUniformIntrinsicCombine::ID;
+
+bool AMDGPUUniformIntrinsicCombine::runOnFunction(Function &F) {
+ if (skipFunction(F)) {
+ return false;
+ }
+
+ const UniformityInfo *UI =
+ &getAnalysis<UniformityInfoWrapperPass>().getUniformityInfo();
+
+ return AMDGPUUniformIntrinsicCombineImpl(UI).run(F);
+}
+
+PreservedAnalyses
+AMDGPUUniformIntrinsicCombinePass::run(Function &F,
+ FunctionAnalysisManager &AM) {
+
+ const auto *UI = &AM.getResult<UniformityInfoAnalysis>(F);
+ const DataLayout *DL = &F.getDataLayout();
+
+ // @todo check if it is required that this method must return bool, if so
+ // figure out what can be returned.
+ bool IsChanged = AMDGPUUniformIntrinsicCombineImpl(UI).run(F);
+
+ if (!IsChanged) {
+ return PreservedAnalyses::all();
+ }
+
+ PreservedAnalyses PA;
+ PA.preserve<DominatorTreeAnalysis>();
+ return PA;
+}
+
+bool AMDGPUUniformIntrinsicCombineImpl::run(Function &F) {
+
+ // @todo check if it is required that this method must return bool, if so
+ // figure out what can be returned.
+ const bool IsChanged{false};
+
+ // Iterate over each instruction in the function to get the desired intrinsic
+ // inst to check for optimization.
+ for (BasicBlock &BB : F) {
+ for (Instruction &I : BB) {
+ if (auto *Call = dyn_cast<CallInst>(&I)) {
+ if (auto *Intrinsic = dyn_cast<IntrinsicInst>(Call)) {
+ optimizeUniformIntrinsicInst(*Intrinsic);
+ }
+ }
+ }
+ }
+
+ return IsChanged;
+}
+
+void AMDGPUUniformIntrinsicCombineImpl::optimizeUniformIntrinsicInst(
+ IntrinsicInst &II) const {
+ llvm::Intrinsic::ID IID = II.getIntrinsicID();
+
+ switch (IID) {
+ case Intrinsic::amdgcn_permlane64: {
+ Value *Src = II.getOperand(0);
+ if (UI->isUniform(Src)) {
+ return II.replaceAllUsesWith(Src);
+ }
+ break;
+ }
+ case Intrinsic::amdgcn_readfirstlane:
+ case Intrinsic::amdgcn_readlane: {
+ Value *Srcv = II.getOperand(0);
+ if (UI->isUniform(Srcv)) {
+ return II.replaceAllUsesWith(Srcv);
+ }
+
+ // The rest of these may not be safe if the exec may not be the same between
+ // the def and use.
+ Value *Src = II.getArgOperand(0);
+ Instruction *SrcInst = dyn_cast<Instruction>(Src);
+ if (SrcInst && SrcInst->getParent() != II.getParent())
+ break;
+
+ // readfirstlane (readfirstlane x) -> readfirstlane x
+ // readlane (readfirstlane x), y -> readfirstlane x
+ if (match(Src,
+ PatternMatch::m_Intrinsic<Intrinsic::amdgcn_readfirstlane>())) {
+ return II.replaceAllUsesWith(Src);
+ }
+
+ if (IID == Intrinsic::amdgcn_readfirstlane) {
+ // readfirstlane (readlane x, y) -> readlane x, y
+ if (match(Src, PatternMatch::m_Intrinsic<Intrinsic::amdgcn_readlane>())) {
+ return II.replaceAllUsesWith(Src);
+ }
+ } else {
+ // readlane (readlane x, y), y -> readlane x, y
+ if (match(Src, PatternMatch::m_Intrinsic<Intrinsic::amdgcn_readlane>(
+ PatternMatch::m_Value(),
+ PatternMatch::m_Specific(II.getArgOperand(1))))) {
+ return II.replaceAllUsesWith(Src);
+ }
+ }
+ break;
+ }
+ }
+}
+
+INITIALIZE_PASS_BEGIN(AMDGPUUniformIntrinsicCombine, DEBUG_TYPE,
+ "AMDGPU uniformIntrinsic Combine", false, false)
+INITIALIZE_PASS_DEPENDENCY(UniformityInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(TargetPassConfig)
+INITIALIZE_PASS_END(AMDGPUUniformIntrinsicCombine, DEBUG_TYPE,
+ "AMDGPU uniformIntrinsic Combine", false, false)
+
+FunctionPass *llvm::createAMDGPUUniformIntrinsicCombinePass() {
+ return new AMDGPUUniformIntrinsicCombine();
+}
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index fed29c3e14aae2..13e0fc61a82443 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -61,6 +61,7 @@ add_llvm_target(AMDGPUCodeGen
AMDGPUHSAMetadataStreamer.cpp
AMDGPUInsertDelayAlu.cpp
AMDGPUInstCombineIntrinsic.cpp
+ AMDGPUUniformIntrinsicCombine.cpp
AMDGPUInstrInfo.cpp
AMDGPUInstructionSelector.cpp
AMDGPUISelDAGToDAG.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-uniform-intrinsic-combine.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-uniform-intrinsic-combine.ll
new file mode 100644
index 00000000000000..cdea0f9ec8fca5
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-uniform-intrinsic-combine.ll
@@ -0,0 +1,221 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+
+; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1010 -passes="instcombine,amdgpu-uniformIntrinsic-combine" -S < %s | FileCheck %s --check-prefixes=GFX,GFX10
+; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 -passes="instcombine,amdgpu-uniformIntrinsic-combine" -S < %s | FileCheck %s --check-prefixes=GFX,GFX11
+; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1200 -passes="instcombine,amdgpu-uniformIntrinsic-combine" -S < %s | FileCheck %s --check-prefixes=GFX,GFX12
+
+define amdgpu_kernel void @permlane64_constant(ptr addrspace(1) %out) {
+; GFX-LABEL: define amdgpu_kernel void @permlane64_constant(
+; GFX-SAME: ptr addrspace(1) [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
+; GFX-NEXT: store i32 77, ptr addrspace(1) [[OUT]], align 4
+; GFX-NEXT: ret void
+;
+ %v = call i32 @llvm.amdgcn.permlane64(i32 77)
+ store i32 %v, ptr addrspace(1) %out
+ ret void
+}
+
+define amdgpu_kernel void @permlane64_undef(ptr addrspace(1) %out) {
+; GFX-LABEL: define amdgpu_kernel void @permlane64_undef(
+; GFX-SAME: ptr addrspace(1) [[OUT:%.*]]) #[[ATTR0]] {
+; GFX-NEXT: ret void
+;
+ %v = call i32 @llvm.amdgcn.permlane64(i32 undef)
+ store i32 %v, ptr addrspace(1) %out
+ ret void
+}
+
+define amdgpu_kernel void @permlane64_sgpr(ptr addrspace(1) %out, i32 %src) {
+; GFX-LABEL: define amdgpu_kernel void @permlane64_sgpr(
+; GFX-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[SRC:%.*]]) #[[ATTR0]] {
+; GFX-NEXT: ret void
+;
+ %v = call i32 @llvm.amdgcn.permlane64(i32 undef)
+ store i32 %v, ptr addrspace(1) %out
+ ret void
+}
+
+define amdgpu_kernel void @permlane64_vgpr(i32 addrspace(1)* %out) {
+; GFX-LABEL: define amdgpu_kernel void @permlane64_vgpr(
+; GFX-SAME: ptr addrspace(1) [[OUT:%.*]]) #[[ATTR0]] {
+; GFX-NEXT: [[TID:%.*]] = call i32 @llvm.amdgcn.workitem.id.x()
+; GFX-NEXT: [[V:%.*]] = call i32 @llvm.amdgcn.permlane64.i32(i32 [[TID]])
+; GFX-NEXT: [[TMP1:%.*]] = sext i32 [[TID]] to i64
+; GFX-NEXT: [[OUT_PTR:%.*]] = getelementptr i32, ptr addrspace(1) [[OUT]], i64 [[TMP1]]
+; GFX-NEXT: store i32 [[V]], ptr addrspace(1) [[OUT_PTR]], align 4
+; GFX-NEXT: ret void
+;
+ %tid = call i32 @llvm.amdgcn.workitem.id.x()
+ %v = call i32 @llvm.amdgcn.permlane64(i32 %tid)
+ %out_ptr = getelementptr i32, i32 addrspace(1)* %out, i32 %tid
+ store i32 %v, i32 addrspace(1)* %out_ptr
+ ret void
+}
+
+define amdgpu_kernel void @permlane64_vgpr_expression(i32 addrspace(1)* %out) {
+; GFX-LABEL: define amdgpu_kernel void @permlane64_vgpr_expression(
+; GFX-SAME: ptr addrspace(1) [[OUT:%.*]]) #[[ATTR0]] {
+; GFX-NEXT: [[TID:%.*]] = call i32 @llvm.amdgcn.workitem.id.x()
+; GFX-NEXT: [[TID2:%.*]] = add i32 [[TID]], 1
+; GFX-NEXT: [[V:%.*]] = call i32 @llvm.amdgcn.permlane64.i32(i32 [[TID2]])
+; GFX-NEXT: [[TMP1:%.*]] = sext i32 [[TID]] to i64
+; GFX-NEXT: [[OUT_PTR:%.*]] = getelementptr i32, ptr addrspace(1) [[OUT]], i64 [[TMP1]]
+; GFX-NEXT: store i32 [[V]], ptr addrspace(1) [[OUT_PTR]], align 4
+; GFX-NEXT: ret void
+;
+ %tid = call i32 @llvm.amdgcn.workitem.id.x()
+ %tid2 = add i32 %tid, 1
+ %v = call i32 @llvm.amdgcn.permlane64(i32 %tid2)
+ %out_ptr = getelementptr i32, i32 addrspace(1)* %out, i32 %tid
+ store i32 %v, i32 addrspace(1)* %out_ptr
+ ret void
+}
+
+define amdgpu_kernel void @readlane_constant(ptr addrspace(1) %out) {
+; GFX-LABEL: define amdgpu_kernel void @readlan...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
5ea4ecb
to
1f450cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change should include an actual use of the pass in the AMDGPU pipeline using AMDGPUTargetMachine. The commandline option should be used enable/disable the pass in the pipeline.
There need to be tests that demonstrate how this optimization enables other optimizations like CSE and hoisting/sinking since convergent calls are getting eliminated.
Also, there must be a test that shows how this pass can be used to eliminate a trivial waterfall loop. That is a loop which depends on "wfall", and all threads in the wave finish their work in the first iteration itself.
I didn't understand this sentence. Can you please elaborate? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a test where a loop has a divergent exit, and a value that is uniform inside the loop is used by one of these intrinsics outside the loop. It should not be optimized because of temporal divergence.
This should be a separate commandline option like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned this does not have enough context to be definitively correct after later transformations. Consider a case like this:
if (ballot_all(x)) {
uniform_x = readfirstlane x
speculatable_use(uniform_x)
...
}
This transform will then produce:
if (ballot_all(x)) {
speculatable_use(x)
...
}
Which may then later be incorrectly hoisted:
speculatable_use(x) // x is not actually uniform anymore
if (ballot_all(x)) { ... }
There needs to be something tying the uniformity to the use position
If Alternatively, @jayfoad , is this a use-case for |
Agreed. The only kind of
But it is already marked as convergent, so I don't think there's a problem here after all. |
No, it could be anything, like a simple |
If it was a simple |
This still sounds overly conservative. If a speculatable use depends on the value being uniform, then it has crossthread semantics and needs to be marked convergent. Still not seeing a reason to always maintain the position of the use. |
Actually there is more going on here than just a lack of |
I see this case is not getting optimized in the checks generated & the loop structure remains intact. |
I still don't see any problem here. I would assume that any speculatable use does not have side effects, so really it just computes some result, e.g.:
After hoisting:
So x may not be uniform when we call speculatable_use, but we only use y if x was uniform. This seems fine, even if speculatable_use returns poison when its argument is not uniform. (As long as it doesn't trigger undefined behavior...) |
So bottom-line is that both |
51eacb9
to
7ee0cca
Compare
a4af024
to
f5b900d
Compare
fe2841a
to
17bb994
Compare
17bb994
to
bb3a69e
Compare
PING |
@jayfoad, waiting for your approval if it is good to go. |
if (UI.isDivergentUse(II.getOperandUse(0))) | ||
return false; | ||
LLVM_DEBUG(dbgs() << "Replacing " << II << " with " << *Src << "\n"); | ||
II.replaceAllUsesWith(Src); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is replacing a contextually / path dependent query with a value that is not. I think we need to attach some kind of convergent use call to capture the point here. What if later code motion moves it such that an assumed uniform value is no longer use-point uniform?
You can maybe get away with replace only dominated uses by this instruction, but I'd need to think if there are still potential hazards if later transforms introduce divergence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't we already discussed something similar earlier? #116953 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are replacing an "always uniform" value X (the intrinsic call) with its uniform operand (Y), which can potentially become uniform divergent after other transformations. But that is no different from a later transformation treating X (the intrinsic call) as divergent by simply ignoring the uniformity analysis at that point of the optimization pipeline. By replacing X with Y, we lose the ability to preserve uniformity but this is not a correctness issue. This is still a performance issue, and then the question is whether the current optimization is useful in general or not.
Tagging @nhaehnle I think I now understand the concern that @arsenm raised. At some static instance in a program, say an operation X uses an value defined by another operation Y. The uniformity of Y depends on its own input operands as well as the threads that are converged at Y, which in turn depends on the control paths that reach Y. If we make a decision at X which depends on Y being uniform, then effectively we are freezing the convergence at Y and choosing only those executions of the whole program where this convergence is preserved at Y. This information needs to be recorded somewhere (maybe at X or at Y or both). In particular, if X is "always uniform" as in the current optimization, then replacing X with Y means that all possible executions of the new program must guarantee that Y is also always uniform. So we cannot just simply check for uniform values and optimize anything based on that. At every such optimization, we have to restrict the possible executions of the optimized program. |
This pass introduces optimizations for AMDGPU intrinsics by leveraging the uniformity of their arguments. When an intrinsic's arguments are detected as uniform, redundant computations are eliminated, and the intrinsic calls are simplified accordingly.
By utilizing the UniformityInfo analysis, this pass identifies cases where intrinsic calls are uniform across all lanes, allowing transformations that reduce unnecessary operations and improve the IR's efficiency.
These changes enhance performance by streamlining intrinsic usage in uniform scenarios without altering the program's semantics.
For background, see PR #99878