Skip to content

[PreISelIntrinsicLowering] Produce a memset_pattern16 libcall for llvm.experimental.memset.pattern when available #120420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

asb
Copy link
Contributor

@asb asb commented Dec 18, 2024

This is to enable a transition of LoopIdiomRecognize to selecting the llvm.experimental.memset.pattern intrinsic as requested in #118632 (as opposed to supporting selection of the libcall or the intrinsic). As such, although it is a FIXME to add costing considerations on whether to lower to the libcall (when available) or expand directly, lacking such logic is helpful at this stage in order to minimise any unexpected code gen changes in this transition.

…m.experimental.memset.pattern when available

This is to enable a transition of LoopIdiomRecognize to selecting the
llvm.experimental.memset.pattern intrinsic as requested in llvm#118632 (as
opposed to supporting selection of the libcall or the intrinsic). As
such, although it _is_ a FIXME to add costing considerations on whether
to lower to the libcall (when available) or expand directly, lacking
such logic is helpful at this stage in order to minimise any potential
code gen changes in this transition.
@llvmbot
Copy link
Member

llvmbot commented Dec 18, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Alex Bradbury (asb)

Changes

This is to enable a transition of LoopIdiomRecognize to selecting the llvm.experimental.memset.pattern intrinsic as requested in #118632 (as opposed to supporting selection of the libcall or the intrinsic). As such, although it is a FIXME to add costing considerations on whether to lower to the libcall (when available) or expand directly, lacking such logic is helpful at this stage in order to minimise any unexpected code gen changes in this transition.


Full diff: https://github.com/llvm/llvm-project/pull/120420.diff

2 Files Affected:

  • (modified) llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp (+89-1)
  • (added) llvm/test/Transforms/PreISelIntrinsicLowering/X86/memset-pattern.ll (+69)
diff --git a/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp b/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
index 4a3d1673c2a7c1..9067f29faa7b0c 100644
--- a/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
+++ b/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
@@ -32,6 +32,7 @@
 #include "llvm/Support/Casting.h"
 #include "llvm/Target/TargetMachine.h"
 #include "llvm/Transforms/Scalar/LowerConstantIntrinsics.h"
+#include "llvm/Transforms/Utils/BuildLibCalls.h"
 #include "llvm/Transforms/Utils/LowerMemIntrinsics.h"
 
 using namespace llvm;
@@ -232,6 +233,59 @@ static bool canEmitLibcall(const TargetMachine *TM, Function *F,
   return TLI->getLibcallName(LC) != nullptr;
 }
 
+// Return a value appropriate for use with the memset_pattern16 libcall, if
+// possible and if we know how. (Adapted from equivalent helper in
+// LoopIdiomRecognize).
+static Constant *getMemSetPattern16Value(MemSetPatternInst *Inst,
+                                         const TargetLibraryInfo &TLI) {
+  // FIXME: This could check for UndefValue because it can be merged into any
+  // other valid pattern.
+
+  // Don't emit libcalls if a non-default address space is being used.
+  if (Inst->getRawDest()->getType()->getPointerAddressSpace() != 0)
+    return nullptr;
+
+  Value *V = Inst->getValue();
+  const DataLayout &DL = Inst->getDataLayout();
+  Module *M = Inst->getModule();
+
+  if (!isLibFuncEmittable(M, &TLI, LibFunc_memset_pattern16))
+    return nullptr;
+
+  // If the value isn't a constant, we can't promote it to being in a constant
+  // array.  We could theoretically do a store to an alloca or something, but
+  // that doesn't seem worthwhile.
+  Constant *C = dyn_cast<Constant>(V);
+  if (!C || isa<ConstantExpr>(C))
+    return nullptr;
+
+  // Only handle simple values that are a power of two bytes in size.
+  uint64_t Size = DL.getTypeSizeInBits(V->getType());
+  if (Size == 0 || (Size & 7) || (Size & (Size - 1)))
+    return nullptr;
+
+  // Don't care enough about darwin/ppc to implement this.
+  if (DL.isBigEndian())
+    return nullptr;
+
+  // Convert to size in bytes.
+  Size /= 8;
+
+  // TODO: If CI is larger than 16-bytes, we can try slicing it in half to see
+  // if the top and bottom are the same (e.g. for vectors and large integers).
+  if (Size > 16)
+    return nullptr;
+
+  // If the constant is exactly 16 bytes, just use it.
+  if (Size == 16)
+    return C;
+
+  // Otherwise, we'll use an array of the constants.
+  unsigned ArraySize = 16 / Size;
+  ArrayType *AT = ArrayType::get(V->getType(), ArraySize);
+  return ConstantArray::get(AT, std::vector<Constant *>(ArraySize, C));
+}
+
 // TODO: Handle atomic memcpy and memcpy.inline
 // TODO: Pass ScalarEvolution
 bool PreISelIntrinsicLowering::expandMemIntrinsicUses(Function &F) const {
@@ -322,7 +376,41 @@ bool PreISelIntrinsicLowering::expandMemIntrinsicUses(Function &F) const {
     }
     case Intrinsic::experimental_memset_pattern: {
       auto *Memset = cast<MemSetPatternInst>(Inst);
-      expandMemSetPatternAsLoop(Memset);
+      const TargetLibraryInfo &TLI = LookupTLI(*Memset->getFunction());
+      if (Constant *PatternValue = getMemSetPattern16Value(Memset, TLI)) {
+        // FIXME: There is currently no profitability calculation for emitting
+        // the libcall vs expanding the memset.pattern directly.
+        IRBuilder<> Builder(Inst);
+        Module *M = Memset->getModule();
+        const DataLayout &DL = Memset->getDataLayout();
+
+        StringRef FuncName = "memset_pattern16";
+        FunctionCallee MSP = getOrInsertLibFunc(
+            M, TLI, LibFunc_memset_pattern16, Builder.getVoidTy(),
+            Memset->getRawDest()->getType(), Builder.getPtrTy(),
+            Memset->getLength()->getType());
+        inferNonMandatoryLibFuncAttrs(M, FuncName, TLI);
+
+        // Otherwise we should form a memset_pattern16.  PatternValue is known
+        // to be an constant array of 16-bytes. Put the value into a mergable
+        // global.
+        GlobalVariable *GV = new GlobalVariable(
+            *M, PatternValue->getType(), true, GlobalValue::PrivateLinkage,
+            PatternValue, ".memset_pattern");
+        GV->setUnnamedAddr(
+            GlobalValue::UnnamedAddr::Global); // Ok to merge these.
+        GV->setAlignment(Align(16));
+        Value *PatternPtr = GV;
+        Value *NumBytes = Builder.CreateMul(
+            Builder.getInt64(
+                DL.getTypeSizeInBits(Memset->getValue()->getType()) / 8),
+            Memset->getLength());
+        CallInst *MemsetPattern16Call = Builder.CreateCall(
+            MSP, {Memset->getRawDest(), PatternPtr, NumBytes});
+        MemsetPattern16Call->setAAMetadata(Memset->getAAMetadata());
+      } else {
+        expandMemSetPatternAsLoop(Memset);
+      }
       Changed = true;
       Memset->eraseFromParent();
       break;
diff --git a/llvm/test/Transforms/PreISelIntrinsicLowering/X86/memset-pattern.ll b/llvm/test/Transforms/PreISelIntrinsicLowering/X86/memset-pattern.ll
new file mode 100644
index 00000000000000..6d5f5b8a6d4fb2
--- /dev/null
+++ b/llvm/test/Transforms/PreISelIntrinsicLowering/X86/memset-pattern.ll
@@ -0,0 +1,69 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -mtriple=x86_64-apple-darwin10.0.0 -passes=pre-isel-intrinsic-lowering -S -o - %s | FileCheck %s
+
+define void @memset_pattern_i128_1_dynvalue(ptr %a, i128 %value) nounwind {
+; CHECK-LABEL: define void @memset_pattern_i128_1_dynvalue(
+; CHECK-SAME: ptr [[A:%.*]], i128 [[VALUE:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:    br i1 false, label %[[SPLIT:.*]], label %[[LOADSTORELOOP:.*]]
+; CHECK:       [[LOADSTORELOOP]]:
+; CHECK-NEXT:    [[TMP1:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP3:%.*]], %[[LOADSTORELOOP]] ]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i128, ptr [[A]], i64 [[TMP1]]
+; CHECK-NEXT:    store i128 [[VALUE]], ptr [[TMP2]], align 1
+; CHECK-NEXT:    [[TMP3]] = add i64 [[TMP1]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp ult i64 [[TMP3]], 1
+; CHECK-NEXT:    br i1 [[TMP4]], label %[[LOADSTORELOOP]], label %[[SPLIT]]
+; CHECK:       [[SPLIT]]:
+; CHECK-NEXT:    ret void
+;
+  tail call void @llvm.experimental.memset.pattern(ptr %a, i128 %value, i64 1, i1 0)
+  ret void
+}
+
+define void @memset_pattern_i128_1(ptr %a, i128 %value) nounwind {
+; CHECK-LABEL: define void @memset_pattern_i128_1(
+; CHECK-SAME: ptr [[A:%.*]], i128 [[VALUE:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    call void @memset_pattern16(ptr [[A]], ptr @.memset_pattern, i64 16)
+; CHECK-NEXT:    ret void
+;
+  tail call void @llvm.experimental.memset.pattern(ptr %a, i128 u0xaaaaaaaabbbbbbbbccccccccdddddddd, i64 1, i1 0)
+  ret void
+}
+
+define void @memset_pattern_i128_1_nz_as(ptr addrspace(1) %a, i128 %value) nounwind {
+; CHECK-LABEL: define void @memset_pattern_i128_1_nz_as(
+; CHECK-SAME: ptr addrspace(1) [[A:%.*]], i128 [[VALUE:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    br i1 false, label %[[SPLIT:.*]], label %[[LOADSTORELOOP:.*]]
+; CHECK:       [[LOADSTORELOOP]]:
+; CHECK-NEXT:    [[TMP1:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP3:%.*]], %[[LOADSTORELOOP]] ]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i128, ptr addrspace(1) [[A]], i64 [[TMP1]]
+; CHECK-NEXT:    store i128 -113427455635030943652277463699152839203, ptr addrspace(1) [[TMP2]], align 1
+; CHECK-NEXT:    [[TMP3]] = add i64 [[TMP1]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp ult i64 [[TMP3]], 1
+; CHECK-NEXT:    br i1 [[TMP4]], label %[[LOADSTORELOOP]], label %[[SPLIT]]
+; CHECK:       [[SPLIT]]:
+; CHECK-NEXT:    ret void
+;
+  tail call void @llvm.experimental.memset.pattern(ptr addrspace(1) %a, i128 u0xaaaaaaaabbbbbbbbccccccccdddddddd, i64 1, i1 0)
+  ret void
+}
+
+define void @memset_pattern_i128_16(ptr %a) nounwind {
+; CHECK-LABEL: define void @memset_pattern_i128_16(
+; CHECK-SAME: ptr [[A:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    call void @memset_pattern16(ptr [[A]], ptr @.memset_pattern.1, i64 256)
+; CHECK-NEXT:    ret void
+;
+  tail call void @llvm.experimental.memset.pattern(ptr %a, i128 u0xaaaaaaaabbbbbbbbccccccccdddddddd, i64 16, i1 0)
+  ret void
+}
+
+define void @memset_pattern_i128_x(ptr %a, i64 %x) nounwind {
+; CHECK-LABEL: define void @memset_pattern_i128_x(
+; CHECK-SAME: ptr [[A:%.*]], i64 [[X:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 16, [[X]]
+; CHECK-NEXT:    call void @memset_pattern16(ptr [[A]], ptr @.memset_pattern.2, i64 [[TMP1]])
+; CHECK-NEXT:    ret void
+;
+  tail call void @llvm.experimental.memset.pattern(ptr %a, i128 u0xaaaaaaaabbbbbbbbccccccccdddddddd, i64 %x, i1 0)
+  ret void
+}

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Longer term this should probably be handled by libcall legalization, but I'm fine with having it in PreISelIntrinsicLowering for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a test with e.g. an i64 constant? If I read your code correctly that is also handled by splatting, so we should have coverage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, definitely an oversight not to cover the 'splatting'. I've added test cases to cover this.

define void @memset_pattern_i128_1_dynvalue(ptr %a, i128 %value) nounwind {
; CHECK-LABEL: define void @memset_pattern_i128_1_dynvalue(
; CHECK-SAME: ptr [[A:%.*]], i128 [[VALUE:%.*]]) #[[ATTR0:[0-9]+]] {
; CHECK-NEXT: br i1 false, label %[[SPLIT:.*]], label %[[LOADSTORELOOP:.*]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Branch on false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a reflection of the naive/unoptimised baseline expansion for memset.pattern, which will be improved in future patches.

Comment on lines 389 to 391
M, TLI, LibFunc_memset_pattern16, Builder.getVoidTy(),
Memset->getRawDest()->getType(), Builder.getPtrTy(),
Memset->getLength()->getType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally RuntimeLibcalls would know the type signature and you wouldn't need to write it out here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interpreting this as a general comment for something it would be good to improve in the future rather than a blocking comment for this PR - but let me know if I understood you incorrectly!

// to be an constant array of 16-bytes. Put the value into a mergable
// global.
GlobalVariable *GV = new GlobalVariable(
*M, PatternValue->getType(), true, GlobalValue::PrivateLinkage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best take the address space from the pointer type you're passing this to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just like the LoopIdiomRecognize logic I'm porting across, we avoid generating the libcall if we have a non-zero AS. I've added an assert which makes that clearer and will of course help ensure this piece of code gets updated if that restriction is lifted.

Memset->getLength());
CallInst *MemsetPattern16Call = Builder.CreateCall(
MSP, {Memset->getRawDest(), PatternPtr, NumBytes});
MemsetPattern16Call->setAAMetadata(Memset->getAAMetadata());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metadata preserve not tested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a tbaa test case.

;
tail call void @llvm.experimental.memset.pattern(ptr %a, i128 u0xaaaaaaaabbbbbbbbccccccccdddddddd, i64 %x, i1 0)
ret void
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these support non-int types? Test with FP and pointer? Plus a poison, null, and nontrivial constantexpr test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, llvm.experimental.memset.pattern only accepts integer types.

DL.getTypeSizeInBits(Memset->getValue()->getType()) / 8),
Memset->getLength());
CallInst *MemsetPattern16Call = Builder.CreateCall(
MSP, {Memset->getRawDest(), PatternPtr, NumBytes});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preserve callsite attributes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were there particular attributes you imagine we'd lose out by not preserving? I'd be concerned about doing something semantically incorrect by blindly copying them, though it seems mergeAttributesAndFlags in SimplifyLibCalls would be one eample to copy. As I think failing to preserve attributes is conservatively correct I've put a TODO with a reference to this for the time being.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important one would be the parameter alignments. You're replacing one form of the call with another, I don't see how the attributes wouldn't just transfer 1:1

Copy link
Contributor Author

@asb asb Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think copying the attributes for the first parameter (the destination ptr) should be fine, and good point that we'll lose the alignment right now. But for the second argument we're going from an integer value argument containing the pattern to a pointer to a 16-byte pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented copying attributes for the first parameter and added a test that demonstrates this.

// Otherwise, we'll use an array of the constants.
unsigned ArraySize = 16 / Size;
ArrayType *AT = ArrayType::get(V->getType(), ArraySize);
return ConstantArray::get(AT, std::vector<Constant *>(ArraySize, C));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just directly put the original constant in the global array? Why do you need to process it into a specific data type? Also seems like you could use getConstantDataArrayInfo if you really want to get it into this array form

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Nikita rightly pointed out, I was missing an important test case that would have perhaps made this clearer. This logic (ported across from LoopIdiomRecognize) handles what icalls "splatting". e.g. if you have an i16 used in a memset in a loop, it will create a 128-bit pattern by duplicating it that is appropriate for use with memset_pattern16. You could alternatively assembly a new i128 constant, but it's not clear that would be better or simpler than the ConstantArray approach here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the type isn't significant if you're just constructing a source pointer to read from

Don't need temporary std::vector to produce this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're probably talking past each other, for which I can only apologise. I'll try to explain my understanding again and hopefully the source of confusion (likely on my end!) is then more obvious.

So for memset_pattern16 we need to create a pointer argument that points to a 16 byte pattern. If the original memset.pattern intrinsic had an i128 constant argument, then that's great - we make a GlobalVariable from that directly. If it had a narrower argument, the logic here is creating a ConstantArray with that element repeated the appropriate number of times (e.g. i16 repeated 8 times). I agree the type of the pointer isn't significant, but creating a ConstantArray doesn't seem like the worst way of producing a pointer to 16bytes of a repeated value, given I have a narrower value. We could instead write logic that inspects the bit width and creates a new APInt and a constant based on that as appropriate, but I think it would be more complex (perhaps there's a handy helper I'm missing?).

PatternValue, ".memset_pattern");
GV->setUnnamedAddr(
GlobalValue::UnnamedAddr::Global); // Ok to merge these.
GV->setAlignment(Align(16));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this alignment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inheriting this from LoopIdiomRecognize, and my preference would be to keep aspects like this untouched in the initial switchover before later looking to relax. I've put a TODO to re-evaluate. I would guess it was chosen on the assumption that a naturally aligned i128 is most efficiently accessed.

@asb
Copy link
Contributor Author

asb commented Jan 15, 2025

I believe all review comments are addressed and this is ready for re-review. Thank you. Although nikic gave a LGTM, I'll await @arsenm's feedback as you had a lot of comments coming in after that.

DL.getTypeSizeInBits(Memset->getValue()->getType()) / 8),
Memset->getLength());
CallInst *MemsetPattern16Call = Builder.CreateCall(
MSP, {Memset->getRawDest(), PatternPtr, NumBytes});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important one would be the parameter alignments. You're replacing one form of the call with another, I don't see how the attributes wouldn't just transfer 1:1

// Otherwise, we'll use an array of the constants.
unsigned ArraySize = 16 / Size;
ArrayType *AT = ArrayType::get(V->getType(), ArraySize);
return ConstantArray::get(AT, std::vector<Constant *>(ArraySize, C));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the type isn't significant if you're just constructing a source pointer to read from

Don't need temporary std::vector to produce this

return C;

// Otherwise, we'll use an array of the constants.
unsigned ArraySize = 16 / Size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid truncating to unsigned

asb added a commit that referenced this pull request Jan 15, 2025
…pattern (#120421)

Relates to (but isn't dependent on) #120420.

This allows alias analysis o the intrinsic of the same quality as for
the libcall, which we want in order to move LoopIdiomRecognize over to
selecting the intrinsic.
@asb asb requested a review from arsenm January 28, 2025 10:44
@asb
Copy link
Contributor Author

asb commented Jan 28, 2025

Re-requested review from @arsenm - I believe all review comments are addressed either with code modifications or responses to the comment (I think this is the main one to check).

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM w/minor comments addressed. I think all of @arsenm's comments are addressed, so up to you if you want to wait for him to confirm.

@asb asb merged commit 8fcb126 into llvm:main Jan 30, 2025
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants