-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[PreISelIntrinsicLowering] Produce a memset_pattern16 libcall for llvm.experimental.memset.pattern when available #120420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PreISelIntrinsicLowering] Produce a memset_pattern16 libcall for llvm.experimental.memset.pattern when available #120420
Conversation
…m.experimental.memset.pattern when available This is to enable a transition of LoopIdiomRecognize to selecting the llvm.experimental.memset.pattern intrinsic as requested in llvm#118632 (as opposed to supporting selection of the libcall or the intrinsic). As such, although it _is_ a FIXME to add costing considerations on whether to lower to the libcall (when available) or expand directly, lacking such logic is helpful at this stage in order to minimise any potential code gen changes in this transition.
@llvm/pr-subscribers-llvm-transforms Author: Alex Bradbury (asb) ChangesThis is to enable a transition of LoopIdiomRecognize to selecting the llvm.experimental.memset.pattern intrinsic as requested in #118632 (as opposed to supporting selection of the libcall or the intrinsic). As such, although it is a FIXME to add costing considerations on whether to lower to the libcall (when available) or expand directly, lacking such logic is helpful at this stage in order to minimise any unexpected code gen changes in this transition. Full diff: https://github.com/llvm/llvm-project/pull/120420.diff 2 Files Affected:
diff --git a/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp b/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
index 4a3d1673c2a7c1..9067f29faa7b0c 100644
--- a/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
+++ b/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
@@ -32,6 +32,7 @@
#include "llvm/Support/Casting.h"
#include "llvm/Target/TargetMachine.h"
#include "llvm/Transforms/Scalar/LowerConstantIntrinsics.h"
+#include "llvm/Transforms/Utils/BuildLibCalls.h"
#include "llvm/Transforms/Utils/LowerMemIntrinsics.h"
using namespace llvm;
@@ -232,6 +233,59 @@ static bool canEmitLibcall(const TargetMachine *TM, Function *F,
return TLI->getLibcallName(LC) != nullptr;
}
+// Return a value appropriate for use with the memset_pattern16 libcall, if
+// possible and if we know how. (Adapted from equivalent helper in
+// LoopIdiomRecognize).
+static Constant *getMemSetPattern16Value(MemSetPatternInst *Inst,
+ const TargetLibraryInfo &TLI) {
+ // FIXME: This could check for UndefValue because it can be merged into any
+ // other valid pattern.
+
+ // Don't emit libcalls if a non-default address space is being used.
+ if (Inst->getRawDest()->getType()->getPointerAddressSpace() != 0)
+ return nullptr;
+
+ Value *V = Inst->getValue();
+ const DataLayout &DL = Inst->getDataLayout();
+ Module *M = Inst->getModule();
+
+ if (!isLibFuncEmittable(M, &TLI, LibFunc_memset_pattern16))
+ return nullptr;
+
+ // If the value isn't a constant, we can't promote it to being in a constant
+ // array. We could theoretically do a store to an alloca or something, but
+ // that doesn't seem worthwhile.
+ Constant *C = dyn_cast<Constant>(V);
+ if (!C || isa<ConstantExpr>(C))
+ return nullptr;
+
+ // Only handle simple values that are a power of two bytes in size.
+ uint64_t Size = DL.getTypeSizeInBits(V->getType());
+ if (Size == 0 || (Size & 7) || (Size & (Size - 1)))
+ return nullptr;
+
+ // Don't care enough about darwin/ppc to implement this.
+ if (DL.isBigEndian())
+ return nullptr;
+
+ // Convert to size in bytes.
+ Size /= 8;
+
+ // TODO: If CI is larger than 16-bytes, we can try slicing it in half to see
+ // if the top and bottom are the same (e.g. for vectors and large integers).
+ if (Size > 16)
+ return nullptr;
+
+ // If the constant is exactly 16 bytes, just use it.
+ if (Size == 16)
+ return C;
+
+ // Otherwise, we'll use an array of the constants.
+ unsigned ArraySize = 16 / Size;
+ ArrayType *AT = ArrayType::get(V->getType(), ArraySize);
+ return ConstantArray::get(AT, std::vector<Constant *>(ArraySize, C));
+}
+
// TODO: Handle atomic memcpy and memcpy.inline
// TODO: Pass ScalarEvolution
bool PreISelIntrinsicLowering::expandMemIntrinsicUses(Function &F) const {
@@ -322,7 +376,41 @@ bool PreISelIntrinsicLowering::expandMemIntrinsicUses(Function &F) const {
}
case Intrinsic::experimental_memset_pattern: {
auto *Memset = cast<MemSetPatternInst>(Inst);
- expandMemSetPatternAsLoop(Memset);
+ const TargetLibraryInfo &TLI = LookupTLI(*Memset->getFunction());
+ if (Constant *PatternValue = getMemSetPattern16Value(Memset, TLI)) {
+ // FIXME: There is currently no profitability calculation for emitting
+ // the libcall vs expanding the memset.pattern directly.
+ IRBuilder<> Builder(Inst);
+ Module *M = Memset->getModule();
+ const DataLayout &DL = Memset->getDataLayout();
+
+ StringRef FuncName = "memset_pattern16";
+ FunctionCallee MSP = getOrInsertLibFunc(
+ M, TLI, LibFunc_memset_pattern16, Builder.getVoidTy(),
+ Memset->getRawDest()->getType(), Builder.getPtrTy(),
+ Memset->getLength()->getType());
+ inferNonMandatoryLibFuncAttrs(M, FuncName, TLI);
+
+ // Otherwise we should form a memset_pattern16. PatternValue is known
+ // to be an constant array of 16-bytes. Put the value into a mergable
+ // global.
+ GlobalVariable *GV = new GlobalVariable(
+ *M, PatternValue->getType(), true, GlobalValue::PrivateLinkage,
+ PatternValue, ".memset_pattern");
+ GV->setUnnamedAddr(
+ GlobalValue::UnnamedAddr::Global); // Ok to merge these.
+ GV->setAlignment(Align(16));
+ Value *PatternPtr = GV;
+ Value *NumBytes = Builder.CreateMul(
+ Builder.getInt64(
+ DL.getTypeSizeInBits(Memset->getValue()->getType()) / 8),
+ Memset->getLength());
+ CallInst *MemsetPattern16Call = Builder.CreateCall(
+ MSP, {Memset->getRawDest(), PatternPtr, NumBytes});
+ MemsetPattern16Call->setAAMetadata(Memset->getAAMetadata());
+ } else {
+ expandMemSetPatternAsLoop(Memset);
+ }
Changed = true;
Memset->eraseFromParent();
break;
diff --git a/llvm/test/Transforms/PreISelIntrinsicLowering/X86/memset-pattern.ll b/llvm/test/Transforms/PreISelIntrinsicLowering/X86/memset-pattern.ll
new file mode 100644
index 00000000000000..6d5f5b8a6d4fb2
--- /dev/null
+++ b/llvm/test/Transforms/PreISelIntrinsicLowering/X86/memset-pattern.ll
@@ -0,0 +1,69 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -mtriple=x86_64-apple-darwin10.0.0 -passes=pre-isel-intrinsic-lowering -S -o - %s | FileCheck %s
+
+define void @memset_pattern_i128_1_dynvalue(ptr %a, i128 %value) nounwind {
+; CHECK-LABEL: define void @memset_pattern_i128_1_dynvalue(
+; CHECK-SAME: ptr [[A:%.*]], i128 [[VALUE:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: br i1 false, label %[[SPLIT:.*]], label %[[LOADSTORELOOP:.*]]
+; CHECK: [[LOADSTORELOOP]]:
+; CHECK-NEXT: [[TMP1:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP3:%.*]], %[[LOADSTORELOOP]] ]
+; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i128, ptr [[A]], i64 [[TMP1]]
+; CHECK-NEXT: store i128 [[VALUE]], ptr [[TMP2]], align 1
+; CHECK-NEXT: [[TMP3]] = add i64 [[TMP1]], 1
+; CHECK-NEXT: [[TMP4:%.*]] = icmp ult i64 [[TMP3]], 1
+; CHECK-NEXT: br i1 [[TMP4]], label %[[LOADSTORELOOP]], label %[[SPLIT]]
+; CHECK: [[SPLIT]]:
+; CHECK-NEXT: ret void
+;
+ tail call void @llvm.experimental.memset.pattern(ptr %a, i128 %value, i64 1, i1 0)
+ ret void
+}
+
+define void @memset_pattern_i128_1(ptr %a, i128 %value) nounwind {
+; CHECK-LABEL: define void @memset_pattern_i128_1(
+; CHECK-SAME: ptr [[A:%.*]], i128 [[VALUE:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: call void @memset_pattern16(ptr [[A]], ptr @.memset_pattern, i64 16)
+; CHECK-NEXT: ret void
+;
+ tail call void @llvm.experimental.memset.pattern(ptr %a, i128 u0xaaaaaaaabbbbbbbbccccccccdddddddd, i64 1, i1 0)
+ ret void
+}
+
+define void @memset_pattern_i128_1_nz_as(ptr addrspace(1) %a, i128 %value) nounwind {
+; CHECK-LABEL: define void @memset_pattern_i128_1_nz_as(
+; CHECK-SAME: ptr addrspace(1) [[A:%.*]], i128 [[VALUE:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: br i1 false, label %[[SPLIT:.*]], label %[[LOADSTORELOOP:.*]]
+; CHECK: [[LOADSTORELOOP]]:
+; CHECK-NEXT: [[TMP1:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP3:%.*]], %[[LOADSTORELOOP]] ]
+; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i128, ptr addrspace(1) [[A]], i64 [[TMP1]]
+; CHECK-NEXT: store i128 -113427455635030943652277463699152839203, ptr addrspace(1) [[TMP2]], align 1
+; CHECK-NEXT: [[TMP3]] = add i64 [[TMP1]], 1
+; CHECK-NEXT: [[TMP4:%.*]] = icmp ult i64 [[TMP3]], 1
+; CHECK-NEXT: br i1 [[TMP4]], label %[[LOADSTORELOOP]], label %[[SPLIT]]
+; CHECK: [[SPLIT]]:
+; CHECK-NEXT: ret void
+;
+ tail call void @llvm.experimental.memset.pattern(ptr addrspace(1) %a, i128 u0xaaaaaaaabbbbbbbbccccccccdddddddd, i64 1, i1 0)
+ ret void
+}
+
+define void @memset_pattern_i128_16(ptr %a) nounwind {
+; CHECK-LABEL: define void @memset_pattern_i128_16(
+; CHECK-SAME: ptr [[A:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: call void @memset_pattern16(ptr [[A]], ptr @.memset_pattern.1, i64 256)
+; CHECK-NEXT: ret void
+;
+ tail call void @llvm.experimental.memset.pattern(ptr %a, i128 u0xaaaaaaaabbbbbbbbccccccccdddddddd, i64 16, i1 0)
+ ret void
+}
+
+define void @memset_pattern_i128_x(ptr %a, i64 %x) nounwind {
+; CHECK-LABEL: define void @memset_pattern_i128_x(
+; CHECK-SAME: ptr [[A:%.*]], i64 [[X:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[TMP1:%.*]] = mul i64 16, [[X]]
+; CHECK-NEXT: call void @memset_pattern16(ptr [[A]], ptr @.memset_pattern.2, i64 [[TMP1]])
+; CHECK-NEXT: ret void
+;
+ tail call void @llvm.experimental.memset.pattern(ptr %a, i128 u0xaaaaaaaabbbbbbbbccccccccdddddddd, i64 %x, i1 0)
+ ret void
+}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Longer term this should probably be handled by libcall legalization, but I'm fine with having it in PreISelIntrinsicLowering for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a test with e.g. an i64 constant? If I read your code correctly that is also handled by splatting, so we should have coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, definitely an oversight not to cover the 'splatting'. I've added test cases to cover this.
define void @memset_pattern_i128_1_dynvalue(ptr %a, i128 %value) nounwind { | ||
; CHECK-LABEL: define void @memset_pattern_i128_1_dynvalue( | ||
; CHECK-SAME: ptr [[A:%.*]], i128 [[VALUE:%.*]]) #[[ATTR0:[0-9]+]] { | ||
; CHECK-NEXT: br i1 false, label %[[SPLIT:.*]], label %[[LOADSTORELOOP:.*]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Branch on false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a reflection of the naive/unoptimised baseline expansion for memset.pattern, which will be improved in future patches.
llvm/test/Transforms/PreISelIntrinsicLowering/X86/memset-pattern.ll
Outdated
Show resolved
Hide resolved
M, TLI, LibFunc_memset_pattern16, Builder.getVoidTy(), | ||
Memset->getRawDest()->getType(), Builder.getPtrTy(), | ||
Memset->getLength()->getType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally RuntimeLibcalls would know the type signature and you wouldn't need to write it out here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm interpreting this as a general comment for something it would be good to improve in the future rather than a blocking comment for this PR - but let me know if I understood you incorrectly!
// to be an constant array of 16-bytes. Put the value into a mergable | ||
// global. | ||
GlobalVariable *GV = new GlobalVariable( | ||
*M, PatternValue->getType(), true, GlobalValue::PrivateLinkage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best take the address space from the pointer type you're passing this to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just like the LoopIdiomRecognize logic I'm porting across, we avoid generating the libcall if we have a non-zero AS. I've added an assert which makes that clearer and will of course help ensure this piece of code gets updated if that restriction is lifted.
Memset->getLength()); | ||
CallInst *MemsetPattern16Call = Builder.CreateCall( | ||
MSP, {Memset->getRawDest(), PatternPtr, NumBytes}); | ||
MemsetPattern16Call->setAAMetadata(Memset->getAAMetadata()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metadata preserve not tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a tbaa test case.
; | ||
tail call void @llvm.experimental.memset.pattern(ptr %a, i128 u0xaaaaaaaabbbbbbbbccccccccdddddddd, i64 %x, i1 0) | ||
ret void | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these support non-int types? Test with FP and pointer? Plus a poison, null, and nontrivial constantexpr test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, llvm.experimental.memset.pattern only accepts integer types.
DL.getTypeSizeInBits(Memset->getValue()->getType()) / 8), | ||
Memset->getLength()); | ||
CallInst *MemsetPattern16Call = Builder.CreateCall( | ||
MSP, {Memset->getRawDest(), PatternPtr, NumBytes}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve callsite attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were there particular attributes you imagine we'd lose out by not preserving? I'd be concerned about doing something semantically incorrect by blindly copying them, though it seems mergeAttributesAndFlags in SimplifyLibCalls would be one eample to copy. As I think failing to preserve attributes is conservatively correct I've put a TODO with a reference to this for the time being.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The important one would be the parameter alignments. You're replacing one form of the call with another, I don't see how the attributes wouldn't just transfer 1:1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think copying the attributes for the first parameter (the destination ptr) should be fine, and good point that we'll lose the alignment right now. But for the second argument we're going from an integer value argument containing the pattern to a pointer to a 16-byte pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented copying attributes for the first parameter and added a test that demonstrates this.
// Otherwise, we'll use an array of the constants. | ||
unsigned ArraySize = 16 / Size; | ||
ArrayType *AT = ArrayType::get(V->getType(), ArraySize); | ||
return ConstantArray::get(AT, std::vector<Constant *>(ArraySize, C)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just directly put the original constant in the global array? Why do you need to process it into a specific data type? Also seems like you could use getConstantDataArrayInfo if you really want to get it into this array form
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Nikita rightly pointed out, I was missing an important test case that would have perhaps made this clearer. This logic (ported across from LoopIdiomRecognize) handles what icalls "splatting". e.g. if you have an i16 used in a memset in a loop, it will create a 128-bit pattern by duplicating it that is appropriate for use with memset_pattern16. You could alternatively assembly a new i128 constant, but it's not clear that would be better or simpler than the ConstantArray approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the type isn't significant if you're just constructing a source pointer to read from
Don't need temporary std::vector to produce this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're probably talking past each other, for which I can only apologise. I'll try to explain my understanding again and hopefully the source of confusion (likely on my end!) is then more obvious.
So for memset_pattern16 we need to create a pointer argument that points to a 16 byte pattern. If the original memset.pattern intrinsic had an i128 constant argument, then that's great - we make a GlobalVariable from that directly. If it had a narrower argument, the logic here is creating a ConstantArray with that element repeated the appropriate number of times (e.g. i16 repeated 8 times). I agree the type of the pointer isn't significant, but creating a ConstantArray doesn't seem like the worst way of producing a pointer to 16bytes of a repeated value, given I have a narrower value. We could instead write logic that inspects the bit width and creates a new APInt and a constant based on that as appropriate, but I think it would be more complex (perhaps there's a handy helper I'm missing?).
PatternValue, ".memset_pattern"); | ||
GV->setUnnamedAddr( | ||
GlobalValue::UnnamedAddr::Global); // Ok to merge these. | ||
GV->setAlignment(Align(16)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this alignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inheriting this from LoopIdiomRecognize, and my preference would be to keep aspects like this untouched in the initial switchover before later looking to relax. I've put a TODO to re-evaluate. I would guess it was chosen on the assumption that a naturally aligned i128 is most efficiently accessed.
…iclowering-learn-to-emit-memsetpattern16-libcall
…nts for narrower memset.pattern
I believe all review comments are addressed and this is ready for re-review. Thank you. Although nikic gave a LGTM, I'll await @arsenm's feedback as you had a lot of comments coming in after that. |
DL.getTypeSizeInBits(Memset->getValue()->getType()) / 8), | ||
Memset->getLength()); | ||
CallInst *MemsetPattern16Call = Builder.CreateCall( | ||
MSP, {Memset->getRawDest(), PatternPtr, NumBytes}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The important one would be the parameter alignments. You're replacing one form of the call with another, I don't see how the attributes wouldn't just transfer 1:1
// Otherwise, we'll use an array of the constants. | ||
unsigned ArraySize = 16 / Size; | ||
ArrayType *AT = ArrayType::get(V->getType(), ArraySize); | ||
return ConstantArray::get(AT, std::vector<Constant *>(ArraySize, C)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the type isn't significant if you're just constructing a source pointer to read from
Don't need temporary std::vector to produce this
return C; | ||
|
||
// Otherwise, we'll use an array of the constants. | ||
unsigned ArraySize = 16 / Size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid truncating to unsigned
…iclowering-learn-to-emit-memsetpattern16-libcall
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM w/minor comments addressed. I think all of @arsenm's comments are addressed, so up to you if you want to wait for him to confirm.
…iclowering-learn-to-emit-memsetpattern16-libcall
This is to enable a transition of LoopIdiomRecognize to selecting the llvm.experimental.memset.pattern intrinsic as requested in #118632 (as opposed to supporting selection of the libcall or the intrinsic). As such, although it is a FIXME to add costing considerations on whether to lower to the libcall (when available) or expand directly, lacking such logic is helpful at this stage in order to minimise any unexpected code gen changes in this transition.