Skip to content

[SYCL] Extend global offset intrinsic removal #11909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Dec 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 28 additions & 24 deletions llvm/include/llvm/SYCLLowerIR/GlobalOffset.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
#include "llvm/IR/Module.h"
#include "llvm/IR/PassManager.h"
#include "llvm/SYCLLowerIR/TargetHelpers.h"
#include "llvm/Transforms/Utils/Cloning.h"

namespace llvm {

Expand All @@ -38,41 +39,38 @@ class GlobalOffsetPass : public PassInfoMixin<GlobalOffsetPass> {
/// `Func` belongs, contains both the original function and its clone with the
/// signature extended with the implicit offset parameter and `_with_offset`
/// appended to the name.
/// An alloca of 3 zeros (corresponding to offsets in x, y and z) is added to
/// the original kernel, in order to keep the interface of kernel's call
/// graph unified, regardless of the fact if the global offset has been used.
///
/// \param Func Kernel to be processed.
void processKernelEntryPoint(Function *Func);

/// This function adds an implicit parameter to the function containing a
/// call instruction to the implicit offset intrinsic or another function
/// (which eventually calls the instrinsic). If the call instruction is to
/// the implicit offset intrinsic, then the intrinisic is replaced with the
/// parameter that was added.
/// For a function containing a call instruction to the implicit offset
/// intrinsic, or another function which eventually calls the intrinsic,
/// this function clones the function and adds an implicit parameter to the
/// clone.
/// If the call instruction is to the implicit offset intrinsic then the
/// intrinsic inside the cloned function is replaced with the parameter that
/// was added.
///
/// Once the function, say `F`, containing a call to `Callee` has the
/// implicit parameter added, callers of `F` are processed by recursively
/// calling this function, passing `F` to `CalleeWithImplicitParam`.
///
/// Since the cloning of entry points may alter the users of a function, the
/// cloning must be done as early as possible, as to ensure that no users are
/// added to previous callees in the call-tree.
/// Once the clone of a function, say `F`, containing a call to `Callee`
/// has the implicit parameter added, callers of `F` are processed by
/// getting cloned and their clones are processed by recursively calling the
/// clone of 'F', passing `F` to `CalleeWithImplicitParam`.
///
/// \param Callee is the function (to which this transformation has already
/// been applied), or to the implicit offset intrinsic.
///
/// \param CalleeWithImplicitParam indicates whether Callee is to the
/// implicit intrinsic (when `nullptr`) or to another function (not
/// `nullptr`) - this is used to know whether calls to it needs to have the
/// implicit parameter added to it or replaced with the implicit parameter.
/// `nullptr`) - this is used to know whether calls to it inside clones need
/// to have the implicit parameter added to it or be replaced with the
/// implicit parameter.
void addImplicitParameterToCallers(Module &M, Value *Callee,
Function *CalleeWithImplicitParam);

/// For a given function `Func` extend signature to contain an implicit
/// offset argument.
/// For a given function `Func` create a clone and extend its signature to
/// contain an implicit offset argument.
///
/// \param Func A function to add offset to.
/// \param Func A function to be cloned and add offset to.
///
/// \param ImplicitArgumentType Architecture dependant type of the implicit
/// argument holding the global offset.
Expand All @@ -81,13 +79,15 @@ class GlobalOffsetPass : public PassInfoMixin<GlobalOffsetPass> {
/// keep it intact and create a clone of it with `_wit_offset` appended to
/// the name.
///
/// \returns A pair of new function with the offset argument added and a
/// \param IsKernel Indicates whether Func is a kernel entry point.
///
/// \returns A pair of the new function with the offset argument added, a
/// pointer to the implicit argument (either a func argument or a bitcast
/// turning it to the correct type).
std::pair<Function *, Value *>
addOffsetArgumentToFunction(Module &M, Function *Func,
Type *ImplicitArgumentType = nullptr,
bool KeepOriginal = false);
bool KeepOriginal = false, bool IsKernel = false);

/// Create a mapping of kernel entry points to their metadata nodes. While
/// iterating over kernels make sure that a given kernel entry point has no
Expand All @@ -102,8 +102,12 @@ class GlobalOffsetPass : public PassInfoMixin<GlobalOffsetPass> {
SmallVectorImpl<KernelPayload> &KernelPayloads);

private:
/// Keep track of which functions have been processed to avoid processing
/// twice.
/// Keep track of all cloned offset functions to avoid processing them.
llvm::SmallPtrSet<Function *, 8> Clones;
/// Save clone mappings to obtain pointers to CallInsts during processing.
llvm::ValueToValueMapTy GlobalVMap;
/// Keep track of which non-offset functions have been processed to avoid
/// processing twice.
llvm::DenseMap<Function *, Value *> ProcessedFunctions;
/// Keep a map of all entry point functions with metadata.
llvm::DenseMap<Function *, MDNode *> EntryPointMetadata;
Expand Down
133 changes: 55 additions & 78 deletions llvm/lib/SYCLLowerIR/GlobalOffset.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -83,34 +83,7 @@ PreservedAnalyses GlobalOffsetPass::run(Module &M, ModuleAnalysisManager &) {
if (!ImplicitOffsetIntrinsic || ImplicitOffsetIntrinsic->use_empty())
return PreservedAnalyses::all();

if (!EnableGlobalOffset) {
SmallVector<CallInst *, 4> Worklist;
SmallVector<LoadInst *, 4> LI;
SmallVector<Instruction *, 4> PtrUses;

// Collect all GEPs and Loads from the intrinsic's CallInsts
for (Value *V : ImplicitOffsetIntrinsic->users()) {
Worklist.push_back(cast<CallInst>(V));
for (Value *V2 : V->users())
getLoads(cast<Instruction>(V2), PtrUses, LI);
}

// Replace each use of a collected Load with a Constant 0
for (LoadInst *L : LI)
L->replaceAllUsesWith(ConstantInt::get(L->getType(), 0));

// Remove all collected Loads and GEPs from the kernel.
// PtrUses is returned by `getLoads` in topological order.
// Walk it backwards so we don't violate users.
for (auto *I : reverse(PtrUses))
I->eraseFromParent();

// Remove all collected CallInsts from the kernel.
for (CallInst *CI : Worklist) {
auto *I = cast<Instruction>(CI);
I->eraseFromParent();
}
} else {
if (EnableGlobalOffset) {
// For AMD allocas and pointers have to be to CONSTANT_PRIVATE (5), NVVM is
// happy with ADDRESS_SPACE_GENERIC (0).
TargetAS = AT == ArchType::Cuda ? 0 : 5;
Expand All @@ -133,6 +106,32 @@ PreservedAnalyses GlobalOffsetPass::run(Module &M, ModuleAnalysisManager &) {
// Add implicit parameters to all direct and indirect users of the offset
addImplicitParameterToCallers(M, ImplicitOffsetIntrinsic, nullptr);
}
SmallVector<CallInst *, 4> Worklist;
SmallVector<LoadInst *, 4> Loads;
SmallVector<Instruction *, 4> PtrUses;

// Collect all GEPs and Loads from the intrinsic's CallInsts
for (Value *V : ImplicitOffsetIntrinsic->users()) {
Worklist.push_back(cast<CallInst>(V));
for (Value *V2 : V->users())
getLoads(cast<Instruction>(V2), PtrUses, Loads);
}

// Replace each use of a collected Load with a Constant 0
for (LoadInst *L : Loads)
L->replaceAllUsesWith(ConstantInt::get(L->getType(), 0));

// Remove all collected Loads and GEPs from the kernel.
// PtrUses is returned by `getLoads` in topological order.
// Walk it backwards so we don't violate users.
for (auto *I : reverse(PtrUses))
I->eraseFromParent();

// Remove all collected CallInsts from the kernel.
for (CallInst *CI : Worklist) {
auto *I = cast<Instruction>(CI);
I->eraseFromParent();
}

// Assert that all uses of `ImplicitOffsetIntrinsic` are removed and delete
// it.
Expand Down Expand Up @@ -161,7 +160,8 @@ void GlobalOffsetPass::processKernelEntryPoint(Function *Func) {

auto *NewFunc = addOffsetArgumentToFunction(
M, Func, KernelImplicitArgumentType->getPointerTo(),
/*KeepOriginal=*/true)
/*KeepOriginal=*/true,
/*IsKernel=*/true)
.first;
Argument *NewArgument = std::prev(NewFunc->arg_end());
// Pass byval to the kernel for NVIDIA, AMD's calling convention disallows
Expand All @@ -177,62 +177,43 @@ void GlobalOffsetPass::processKernelEntryPoint(Function *Func) {
FuncMetadata->getOperand(1),
FuncMetadata->getOperand(2)};
KernelMetadata->addOperand(MDNode::get(Ctx, NewMetadata));

// Create alloca of zeros for the implicit offset in the original func.
BasicBlock *EntryBlock = &Func->getEntryBlock();
IRBuilder<> Builder(EntryBlock, EntryBlock->getFirstInsertionPt());
Type *ImplicitOffsetType =
ArrayType::get(Type::getInt32Ty(M.getContext()), 3);
AllocaInst *ImplicitOffset =
Builder.CreateAlloca(ImplicitOffsetType, TargetAS);
uint64_t AllocByteSize =
ImplicitOffset->getAllocationSizeInBits(M.getDataLayout()).value() / 8;
CallInst *MemsetCall =
Builder.CreateMemSet(ImplicitOffset, Builder.getInt8(0), AllocByteSize,
ImplicitOffset->getAlign());
MemsetCall->addParamAttr(0, Attribute::NonNull);
MemsetCall->addDereferenceableParamAttr(0, AllocByteSize);
ProcessedFunctions[Func] = Builder.CreateConstInBoundsGEP2_32(
ImplicitOffsetType, ImplicitOffset, 0, 0);
}

void GlobalOffsetPass::addImplicitParameterToCallers(
Module &M, Value *Callee, Function *CalleeWithImplicitParam) {

// Make sure that all entry point callers are processed.
SmallVector<User *, 8> Users{Callee->users()};
for (User *U : Users) {
auto *Call = dyn_cast<CallInst>(U);
if (!Call)
continue;

Function *Caller = Call->getFunction();
if (EntryPointMetadata.count(Caller) != 0) {
processKernelEntryPoint(Caller);
}
}

// User collection may have changed, so we reinitialize it.
Users = SmallVector<User *, 8>{Callee->users()};
for (User *U : Users) {
auto *CallToOld = dyn_cast<CallInst>(U);
if (!CallToOld)
return;

auto *Caller = CallToOld->getFunction();

// Determine if `Caller` needs processed or if this is another callsite
// from an already-processed function.
Function *NewFunc;
// Only original function uses are considered.
// Clones are processed through a global VMap.
if (Clones.contains(Caller))
continue;

// Kernel entry points need additional processing and change Metdadata.
if (EntryPointMetadata.count(Caller) != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero is false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's a good idea to change it. There are two reasons:

  1. We're dealing here with natural numbers (counts), not bools
  2. I didn't introduce the change originally, so it would add more diff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EntryPointMetadata is a map and therefore .count() is often used as contains, i.e. implicit conversion to bool should be ok in that context.

But I agree with (2)

processKernelEntryPoint(Caller);

// Determine if `Caller` needs to be processed or if this is another
// callsite from a non-offset function or an already-processed function.
Value *ImplicitOffset = ProcessedFunctions[Caller];
bool AlreadyProcessed = ImplicitOffset != nullptr;

Function *NewFunc;
if (AlreadyProcessed) {
NewFunc = Caller;
} else {
std::tie(NewFunc, ImplicitOffset) =
addOffsetArgumentToFunction(M, Caller);
addOffsetArgumentToFunction(M, Caller,
/*KernelImplicitArgumentType*/ nullptr,
/*KeepOriginal=*/true);
}

CallToOld = cast<CallInst>(GlobalVMap[CallToOld]);
if (!CalleeWithImplicitParam) {
// Replace intrinsic call with parameter.
CallToOld->replaceAllUsesWith(ImplicitOffset);
Expand Down Expand Up @@ -269,15 +250,12 @@ void GlobalOffsetPass::addImplicitParameterToCallers(

// Process callers of the old function.
addImplicitParameterToCallers(M, Caller, NewFunc);

// Now that the old function is dead, delete it.
Caller->dropAllReferences();
Caller->eraseFromParent();
}
}

std::pair<Function *, Value *> GlobalOffsetPass::addOffsetArgumentToFunction(
Module &M, Function *Func, Type *ImplicitArgumentType, bool KeepOriginal) {
Module &M, Function *Func, Type *ImplicitArgumentType, bool KeepOriginal,
bool IsKernel) {
FunctionType *FuncTy = Func->getFunctionType();
const AttributeList &FuncAttrs = Func->getAttributes();
ImplicitArgumentType =
Expand Down Expand Up @@ -316,23 +294,22 @@ std::pair<Function *, Value *> GlobalOffsetPass::addOffsetArgumentToFunction(
// TODO: Are there better naming alternatives that allow for unmangling?
NewFunc->setName(Func->getName() + "_with_offset");

ValueToValueMapTy VMap;
for (Function::arg_iterator FuncArg = Func->arg_begin(),
FuncEnd = Func->arg_end(),
NewFuncArg = NewFunc->arg_begin();
FuncArg != FuncEnd; ++FuncArg, ++NewFuncArg) {
VMap[FuncArg] = NewFuncArg;
GlobalVMap[FuncArg] = NewFuncArg;
}

SmallVector<ReturnInst *, 8> Returns;
CloneFunctionInto(NewFunc, Func, VMap,
CloneFunctionInto(NewFunc, Func, GlobalVMap,
CloneFunctionChangeType::GlobalChanges, Returns);
// In order to keep the signatures of functions called by the kernel
// unified, the pass has to copy global offset to an array allocated in
// addrspace(3). This is done as kernels can't allocate and fill the
// array in constant address space, which would be required for the case
// with no global offset.
if (AT == ArchType::AMDHSA) {
// array in constant address space.
// Not required any longer, but left due to deprecatedness.
if (IsKernel && AT == ArchType::AMDHSA) {
BasicBlock *EntryBlock = &NewFunc->getEntryBlock();
IRBuilder<> Builder(EntryBlock, EntryBlock->getFirstInsertionPt());
Type *ImplicitOffsetType =
Expand Down Expand Up @@ -399,8 +376,8 @@ std::pair<Function *, Value *> GlobalOffsetPass::addOffsetArgumentToFunction(
Type::getInt32Ty(M.getContext())->getPointerTo(TargetAS));
}

ProcessedFunctions[NewFunc] = ImplicitOffset;

ProcessedFunctions[Func] = ImplicitOffset;
Clones.insert(NewFunc);
// Return the new function and the offset argument.
return {NewFunc, ImplicitOffset};
}
Expand Down
22 changes: 12 additions & 10 deletions llvm/test/CodeGen/AMDGPU/global-offset-dbg.ll
Original file line number Diff line number Diff line change
Expand Up @@ -11,28 +11,27 @@ declare ptr addrspace(5) @llvm.amdgcn.implicit.offset()
; CHECK-NOT: llvm.amdgcn.implicit.offset

define weak_odr dso_local i64 @_ZTS14other_function() !dbg !11 {
; CHECK: define weak_odr dso_local i64 @_ZTS14other_function(ptr addrspace(5) %0) !dbg !11 {
; CHECK: define weak_odr dso_local i64 @_ZTS14other_function() !dbg !11 {
%1 = tail call ptr addrspace(5) @llvm.amdgcn.implicit.offset()
%2 = getelementptr inbounds i32, ptr addrspace(5) %1, i64 2
%3 = load i32, ptr addrspace(5) %2, align 4
%4 = zext i32 %3 to i64
ret i64 %4
}

; CHECK: weak_odr dso_local i64 @_ZTS14other_function_with_offset(ptr addrspace(5) %0) !dbg !14 {

; Function Attrs: noinline
define weak_odr dso_local void @_ZTS14example_kernel() !dbg !14 {
; CHECK: define weak_odr dso_local void @_ZTS14example_kernel() !dbg !14 {
; CHECK: define weak_odr dso_local void @_ZTS14example_kernel() !dbg !15 {
entry:
%0 = call i64 @_ZTS14other_function(), !dbg !15
; CHECK: %2 = call i64 @_ZTS14other_function(ptr addrspace(5) %1), !dbg !15
; CHECK: %0 = call i64 @_ZTS14other_function(), !dbg !16
ret void
}

; CHECK: define weak_odr dso_local void @_ZTS14example_kernel_with_offset(ptr byref([3 x i32]) %0) !dbg !16 {
; CHECK: %1 = alloca [3 x i32], align 4, addrspace(5), !dbg !17
; CHECK: %2 = addrspacecast ptr %0 to ptr addrspace(4), !dbg !17
; CHECK: call void @llvm.memcpy.p5.p4.i64(ptr addrspace(5) align 4 %1, ptr addrspace(4) align 1 %2, i64 12, i1 false), !dbg !17
; CHECK: %3 = call i64 @_ZTS14other_function(ptr addrspace(5) %1), !dbg !17
; CHECK: define weak_odr dso_local void @_ZTS14example_kernel_with_offset(ptr byref([3 x i32]) %0) !dbg !17 {
; CHECK: call i64 @_ZTS14other_function_with_offset(ptr addrspace(5) %1), !dbg !18

!llvm.dbg.cu = !{!0}
!llvm.module.flags = !{!3, !4}
Expand All @@ -53,5 +52,8 @@ entry:
!13 = !{null}
!14 = distinct !DISubprogram(name: "example_kernel", scope: !1, file: !1, line: 10, type: !12, scopeLine: 10, flags: DIFlagPrototyped, spFlags: DISPFlagDefinition, unit: !0, retainedNodes: !2)
!15 = !DILocation(line: 1, column: 2, scope: !14)
; CHECK: !16 = distinct !DISubprogram(name: "example_kernel", scope: !1, file: !1, line: 10, type: !12, scopeLine: 10, flags: DIFlagPrototyped, spFlags: DISPFlagDefinition, unit: !0, retainedNodes: !2)
; CHECK: !17 = !DILocation(line: 1, column: 2, scope: !16)
; CHECK: !14 = distinct !DISubprogram(name: "other_function", scope: !1, file: !1, line: 3, type: !12, scopeLine: 3, flags: DIFlagPrototyped, spFlags: DISPFlagDefinition, unit: !0, retainedNodes: !2)
; CHECK: !15 = distinct !DISubprogram(name: "example_kernel", scope: !1, file: !1, line: 10, type: !12, scopeLine: 10, flags: DIFlagPrototyped, spFlags: DISPFlagDefinition, unit: !0, retainedNodes: !2)
; CHECK: !16 = !DILocation(line: 1, column: 2, scope: !15)
; CHECK: !17 = distinct !DISubprogram(name: "example_kernel", scope: !1, file: !1, line: 10, type: !12, scopeLine: 10, flags: DIFlagPrototyped, spFlags: DISPFlagDefinition, unit: !0, retainedNodes: !2)
; CHECK: !18 = !DILocation(line: 1, column: 2, scope: !17)
Loading