[SLPVectorizer][X86] Free load cost for stores with constant pointers #118016

antoniofrighetto · 2024-11-28T15:35:08Z

When estimating the cost for stores of constant buildvectors, do not take into account the cost of the additional load to materialize a vector from a constant pool when dealing with a constant pointer. In such cases, the load is avoided in the first place, as the only operations required simply involve computing the address of the constant (ip+base_addr+offset) and the store itself.

Fixes: #111126.

When estimating the cost for stores of constant buildvectors, do not take into account the cost of the additional load to materialize a vector from a constant pool when dealing with a constant pointer. In such cases, the load is avoided in the first place, as the only operations required simply involve computing the address of the constant (`rip+base_addr+offset`) and the store itself. Fixes regression: llvm#111126.

llvmbot · 2024-11-28T15:35:46Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-x86

Author: Antonio Frighetto (antoniofrighetto)

Changes

When estimating the cost for stores of constant buildvectors, do not take into account the cost of the additional load to materialize a vector from a constant pool when dealing with a constant pointer. In such cases, the load is avoided in the first place, as the only operations required simply involve computing the address of the constant (rip+base_addr+offset) and the store itself.

Fixes regression: #111126.

Full diff: https://github.com/llvm/llvm-project/pull/118016.diff

3 Files Affected:

(modified) llvm/lib/Target/X86/X86TargetTransformInfo.cpp (+3-2)
(modified) llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll (+3-9)
(added) llvm/test/Transforms/SLPVectorizer/X86/buildvector_store_constant.ll (+20)

diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 179e29e40614e7..689355a168c2a1 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -5157,8 +5157,9 @@ InstructionCost X86TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,
 
   InstructionCost Cost = 0;
 
-  // Add a cost for constant load to vector.
-  if (Opcode == Instruction::Store && OpInfo.isConstant())
+  // Add a cost for constant load to vector, if pointer is not a constant.
+  if (auto *SI = dyn_cast_or_null<StoreInst>(I);
+      SI && !isa<Constant>(SI->getPointerOperand()) && OpInfo.isConstant())
     Cost += getMemoryOpCost(Instruction::Load, Src, DL.getABITypeAlign(Src),
                             /*AddressSpace=*/0, CostKind);
 
diff --git a/llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll b/llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll
index da1808fc278c09..75829cb71814f6 100644
--- a/llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll
+++ b/llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll
@@ -57,9 +57,7 @@ define void @_Z3fn1v(ptr %r, ptr %a) #0 {
 ; CHECK-NEXT:    [[CMP_PROL:%.*]] = icmp eq i8 [[TMP3_PROL]], 0
 ; CHECK-NEXT:    br i1 [[CMP_PROL]], label %[[IF_THEN_PROL:.*]], label %[[FOR_INC_PROL:.*]]
 ; CHECK:       [[IF_THEN_PROL]]:
-; CHECK-NEXT:    [[ARRAYIDX_PROL:%.*]] = getelementptr inbounds i8, ptr [[R_022]], i64 2
-; CHECK-NEXT:    store i16 0, ptr [[ARRAYIDX_PROL]], align 2
-; CHECK-NEXT:    store i16 0, ptr [[R_022]], align 2
+; CHECK-NEXT:    store <2 x i16> zeroinitializer, ptr [[R_022]], align 2
 ; CHECK-NEXT:    [[ARRAYIDX5_PROL:%.*]] = getelementptr inbounds i8, ptr [[R_022]], i64 4
 ; CHECK-NEXT:    store i16 0, ptr [[ARRAYIDX5_PROL]], align 2
 ; CHECK-NEXT:    br label %[[FOR_INC_PROL]]
@@ -82,9 +80,7 @@ define void @_Z3fn1v(ptr %r, ptr %a) #0 {
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i8 [[TMP3]], 0
 ; CHECK-NEXT:    br i1 [[CMP]], label %[[IF_THEN:.*]], label %[[FOR_INC:.*]]
 ; CHECK:       [[IF_THEN]]:
-; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 2
-; CHECK-NEXT:    store i16 0, ptr [[ARRAYIDX]], align 2
-; CHECK-NEXT:    store i16 0, ptr [[R_117]], align 2
+; CHECK-NEXT:    store <2 x i16> zeroinitializer, ptr [[R_117]], align 2
 ; CHECK-NEXT:    [[ARRAYIDX5:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 4
 ; CHECK-NEXT:    store i16 0, ptr [[ARRAYIDX5]], align 2
 ; CHECK-NEXT:    br label %[[FOR_INC]]
@@ -96,9 +92,7 @@ define void @_Z3fn1v(ptr %r, ptr %a) #0 {
 ; CHECK-NEXT:    br i1 [[CMP_1]], label %[[IF_THEN_1:.*]], label %[[FOR_INC_1]]
 ; CHECK:       [[IF_THEN_1]]:
 ; CHECK-NEXT:    [[ADD_PTR:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 6
-; CHECK-NEXT:    [[ARRAYIDX_1:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 8
-; CHECK-NEXT:    store i16 0, ptr [[ARRAYIDX_1]], align 2
-; CHECK-NEXT:    store i16 0, ptr [[ADD_PTR]], align 2
+; CHECK-NEXT:    store <2 x i16> zeroinitializer, ptr [[ADD_PTR]], align 2
 ; CHECK-NEXT:    [[ARRAYIDX5_1:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 10
 ; CHECK-NEXT:    store i16 0, ptr [[ARRAYIDX5_1]], align 2
 ; CHECK-NEXT:    br label %[[FOR_INC_1]]
diff --git a/llvm/test/Transforms/SLPVectorizer/X86/buildvector_store_constant.ll b/llvm/test/Transforms/SLPVectorizer/X86/buildvector_store_constant.ll
new file mode 100644
index 00000000000000..1d9e4d20d20af6
--- /dev/null
+++ b/llvm/test/Transforms/SLPVectorizer/X86/buildvector_store_constant.ll
@@ -0,0 +1,20 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -mtriple=x86_64-unknown-linux-gnu --passes=slp-vectorizer -S -o - %s | FileCheck %s
+
+@arr = global [20 x i64] zeroinitializer, align 16
+
+define void @store_from_constant_ptr() {
+; CHECK-LABEL: define void @store_from_constant_ptr() {
+; CHECK-NEXT:    store <2 x i64> splat (i64 1), ptr @arr, align 16
+; CHECK-NEXT:    store <2 x i64> splat (i64 1), ptr getelementptr inbounds (i8, ptr @arr, i64 16), align 16
+; CHECK-NEXT:    store <2 x i64> splat (i64 1), ptr getelementptr inbounds (i8, ptr @arr, i64 32), align 16
+; CHECK-NEXT:    ret void
+;
+  store i64 1, ptr @arr, align 16
+  store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 8), align 8
+  store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 16), align 16
+  store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 24), align 8
+  store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 32), align 16
+  store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 40), align 8
+  ret void
+}

antoniofrighetto · 2024-11-29T18:07:23Z

FWIW, the load emitted by clang-15 (https://godbolt.org/z/nbofoe6zM) shouldn't be related with this, and in any cases, it just seems suboptimal codegen (GCC doesn't emit any load wrt to this instance).

antoniofrighetto · 2024-11-29T18:19:07Z

@alexey-bataev, now that I look better at it, I think that's the load you were referring to in 0e7ed32? Maybe we should remove it all?

alexey-bataev · 2024-11-29T18:52:49Z

@alexey-bataev, now that I look better at it, I think that's the load you were referring to in 0e7ed32? Maybe we should remove it all?

Sorry, do not quite understand. What's better to remove?

RKSimon · 2024-12-02T17:03:35Z

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

-  if (Opcode == Instruction::Store && OpInfo.isConstant())
+  // Add a cost for constant load to vector, if pointer is not a constant.
+  if (auto *SI = dyn_cast_or_null<StoreInst>(I);
+      SI && !isa<Constant>(SI->getPointerOperand()) && OpInfo.isConstant())


We're trying to account for the cost of the constant vector load, but you want to skip it if the store address is constant?

Looking at #111126 - shouldn't we be checking to see if OpInfo.isUniform() and adjusting the Type accordingly?

It looks like the test case thinks it has to load a <20 x i64> splat(i64 1) constant when it just has to load <2 x i64> splat(i64 1)

Sorry if I'm missing something, where is the Type being <20 x i64> in this case? AFAICT, there isn't any load associated to the store when the store address is constant in x86, as it should just be needed to compute the memory address.

The cost you are disabling in the code above is for the constant load - not the memory address computation.

That's correct. There is no cost for any constant load. There is only a memory address computation, which is not a load.

That's only true for scalars - for vector constant it must be loaded from the constant pool unless rematerialized (which for x86 means 0 or -1 splats only). That is the issue in #111126 - the load from LCPI0_0 is being overestimated (its just a <2 x i64> load that is reused in all 10 stores but we're treating it as a <20 x i64> load):

f(): movaps xmm0, xmmword ptr [rip + .LCPI0_0] movaps xmmword ptr [rip + arr], xmm0 movaps xmmword ptr [rip + arr+16], xmm0 movaps xmmword ptr [rip + arr+32], xmm0 movaps xmmword ptr [rip + arr+48], xmm0 movaps xmmword ptr [rip + arr+64], xmm0 movaps xmmword ptr [rip + arr+80], xmm0 movaps xmmword ptr [rip + arr+96], xmm0 movaps xmmword ptr [rip + arr+112], xmm0 movaps xmmword ptr [rip + arr+128], xmm0 movaps xmmword ptr [rip + arr+144], xmm0 ret

I'm investigating an alternative patch but have run out of time tonight :/

Sorry, isn't this just bad codegen? In principle, we should be able to store the constant in the floating-point register, why would we need to load it from memory? GCC seems to catch this: https://godbolt.org/z/6baWMK4cW.

Uhmm, we still need a load when dealing with arbitrary 64-bit constants.

If any, the check would need to take this into account (scalar <= INT_MAX), but feel free to assess if there's a more favourable approach here.

I'd copied the test case from #118016 instead of the original #111126 test case

RKSimon · 2024-12-04T14:12:25Z

Proposed patch: #118642

antoniofrighetto · 2024-12-04T16:36:11Z

Closing this, as favoured approach in #118642.

antoniofrighetto · 2024-12-04T16:42:17Z

@RKSimon, think it would be still valuable to materialize the constant through separate movs + punpcklqdq, and save a load from memory, when possible? If any, that would happen in X86ISelLowering, correct?

RKSimon · 2024-12-04T18:11:53Z

There's plenty of scope for better vector constant rematerialisation, but gpr->fpu transfer + shuffle perf isn't always great, it avoids memory latency but can cause a lot of throughput + register pressure issues - just trying to do this in DAGISel doesn't work well.

I started the X86FixupVectorConstants pass to begin reducing vector constant pool traffic and there's plenty more to do to remove constant broadcasts etc. in X86ISelLowering, that should then put us in a position where we can do other things as well - similar to how MachineLICM attempts to hoist out constant loads if there's spare registers.

antoniofrighetto requested review from RKSimon, alexey-bataev and topperc November 28, 2024 15:35

llvmbot added backend:X86 llvm:transforms labels Nov 28, 2024

RKSimon reviewed Dec 2, 2024

View reviewed changes

RKSimon added a commit that referenced this pull request Dec 4, 2024

[SLP][X86] Update test coverage for #111126

140df02

I'd copied the test case from #118016 instead of the original #111126 test case

antoniofrighetto closed this Dec 4, 2024

[SLPVectorizer][X86] Free load cost for stores with constant pointers #118016

[SLPVectorizer][X86] Free load cost for stores with constant pointers #118016

Uh oh!

Conversation

antoniofrighetto commented Nov 28, 2024 • edited by RKSimon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoniofrighetto commented Nov 29, 2024

Uh oh!

antoniofrighetto commented Nov 29, 2024

Uh oh!

alexey-bataev commented Nov 29, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniofrighetto Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RKSimon commented Dec 4, 2024

Uh oh!

antoniofrighetto commented Dec 4, 2024

Uh oh!

antoniofrighetto commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RKSimon commented Dec 4, 2024

Uh oh!

Uh oh!

antoniofrighetto commented Nov 28, 2024 •

edited by RKSimon

Loading

llvmbot commented Nov 28, 2024 •

edited

Loading

antoniofrighetto Dec 3, 2024 •

edited

Loading

antoniofrighetto commented Dec 4, 2024 •

edited

Loading