Skip to content

[SLPVectorizer][X86] Free load cost for stores with constant pointers #118016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

antoniofrighetto
Copy link
Contributor

@antoniofrighetto antoniofrighetto commented Nov 28, 2024

When estimating the cost for stores of constant buildvectors, do not take into account the cost of the additional load to materialize a vector from a constant pool when dealing with a constant pointer. In such cases, the load is avoided in the first place, as the only operations required simply involve computing the address of the constant (ip+base_addr+offset) and the store itself.

Fixes: #111126.

When estimating the cost for stores of constant buildvectors, do not
take into account the cost of the additional load to materialize a
vector from a constant pool when dealing with a constant pointer.
In such cases, the load is avoided in the first place, as the only
operations required simply involve computing the address of the
constant (`rip+base_addr+offset`) and the store itself.

Fixes regression: llvm#111126.
@llvmbot
Copy link
Member

llvmbot commented Nov 28, 2024

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-x86

Author: Antonio Frighetto (antoniofrighetto)

Changes

When estimating the cost for stores of constant buildvectors, do not take into account the cost of the additional load to materialize a vector from a constant pool when dealing with a constant pointer. In such cases, the load is avoided in the first place, as the only operations required simply involve computing the address of the constant (rip+base_addr+offset) and the store itself.

Fixes regression: #111126.


Full diff: https://github.com/llvm/llvm-project/pull/118016.diff

3 Files Affected:

  • (modified) llvm/lib/Target/X86/X86TargetTransformInfo.cpp (+3-2)
  • (modified) llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll (+3-9)
  • (added) llvm/test/Transforms/SLPVectorizer/X86/buildvector_store_constant.ll (+20)
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 179e29e40614e7..689355a168c2a1 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -5157,8 +5157,9 @@ InstructionCost X86TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,
 
   InstructionCost Cost = 0;
 
-  // Add a cost for constant load to vector.
-  if (Opcode == Instruction::Store && OpInfo.isConstant())
+  // Add a cost for constant load to vector, if pointer is not a constant.
+  if (auto *SI = dyn_cast_or_null<StoreInst>(I);
+      SI && !isa<Constant>(SI->getPointerOperand()) && OpInfo.isConstant())
     Cost += getMemoryOpCost(Instruction::Load, Src, DL.getABITypeAlign(Src),
                             /*AddressSpace=*/0, CostKind);
 
diff --git a/llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll b/llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll
index da1808fc278c09..75829cb71814f6 100644
--- a/llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll
+++ b/llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll
@@ -57,9 +57,7 @@ define void @_Z3fn1v(ptr %r, ptr %a) #0 {
 ; CHECK-NEXT:    [[CMP_PROL:%.*]] = icmp eq i8 [[TMP3_PROL]], 0
 ; CHECK-NEXT:    br i1 [[CMP_PROL]], label %[[IF_THEN_PROL:.*]], label %[[FOR_INC_PROL:.*]]
 ; CHECK:       [[IF_THEN_PROL]]:
-; CHECK-NEXT:    [[ARRAYIDX_PROL:%.*]] = getelementptr inbounds i8, ptr [[R_022]], i64 2
-; CHECK-NEXT:    store i16 0, ptr [[ARRAYIDX_PROL]], align 2
-; CHECK-NEXT:    store i16 0, ptr [[R_022]], align 2
+; CHECK-NEXT:    store <2 x i16> zeroinitializer, ptr [[R_022]], align 2
 ; CHECK-NEXT:    [[ARRAYIDX5_PROL:%.*]] = getelementptr inbounds i8, ptr [[R_022]], i64 4
 ; CHECK-NEXT:    store i16 0, ptr [[ARRAYIDX5_PROL]], align 2
 ; CHECK-NEXT:    br label %[[FOR_INC_PROL]]
@@ -82,9 +80,7 @@ define void @_Z3fn1v(ptr %r, ptr %a) #0 {
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i8 [[TMP3]], 0
 ; CHECK-NEXT:    br i1 [[CMP]], label %[[IF_THEN:.*]], label %[[FOR_INC:.*]]
 ; CHECK:       [[IF_THEN]]:
-; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 2
-; CHECK-NEXT:    store i16 0, ptr [[ARRAYIDX]], align 2
-; CHECK-NEXT:    store i16 0, ptr [[R_117]], align 2
+; CHECK-NEXT:    store <2 x i16> zeroinitializer, ptr [[R_117]], align 2
 ; CHECK-NEXT:    [[ARRAYIDX5:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 4
 ; CHECK-NEXT:    store i16 0, ptr [[ARRAYIDX5]], align 2
 ; CHECK-NEXT:    br label %[[FOR_INC]]
@@ -96,9 +92,7 @@ define void @_Z3fn1v(ptr %r, ptr %a) #0 {
 ; CHECK-NEXT:    br i1 [[CMP_1]], label %[[IF_THEN_1:.*]], label %[[FOR_INC_1]]
 ; CHECK:       [[IF_THEN_1]]:
 ; CHECK-NEXT:    [[ADD_PTR:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 6
-; CHECK-NEXT:    [[ARRAYIDX_1:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 8
-; CHECK-NEXT:    store i16 0, ptr [[ARRAYIDX_1]], align 2
-; CHECK-NEXT:    store i16 0, ptr [[ADD_PTR]], align 2
+; CHECK-NEXT:    store <2 x i16> zeroinitializer, ptr [[ADD_PTR]], align 2
 ; CHECK-NEXT:    [[ARRAYIDX5_1:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 10
 ; CHECK-NEXT:    store i16 0, ptr [[ARRAYIDX5_1]], align 2
 ; CHECK-NEXT:    br label %[[FOR_INC_1]]
diff --git a/llvm/test/Transforms/SLPVectorizer/X86/buildvector_store_constant.ll b/llvm/test/Transforms/SLPVectorizer/X86/buildvector_store_constant.ll
new file mode 100644
index 00000000000000..1d9e4d20d20af6
--- /dev/null
+++ b/llvm/test/Transforms/SLPVectorizer/X86/buildvector_store_constant.ll
@@ -0,0 +1,20 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -mtriple=x86_64-unknown-linux-gnu --passes=slp-vectorizer -S -o - %s | FileCheck %s
+
+@arr = global [20 x i64] zeroinitializer, align 16
+
+define void @store_from_constant_ptr() {
+; CHECK-LABEL: define void @store_from_constant_ptr() {
+; CHECK-NEXT:    store <2 x i64> splat (i64 1), ptr @arr, align 16
+; CHECK-NEXT:    store <2 x i64> splat (i64 1), ptr getelementptr inbounds (i8, ptr @arr, i64 16), align 16
+; CHECK-NEXT:    store <2 x i64> splat (i64 1), ptr getelementptr inbounds (i8, ptr @arr, i64 32), align 16
+; CHECK-NEXT:    ret void
+;
+  store i64 1, ptr @arr, align 16
+  store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 8), align 8
+  store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 16), align 16
+  store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 24), align 8
+  store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 32), align 16
+  store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 40), align 8
+  ret void
+}

@antoniofrighetto
Copy link
Contributor Author

FWIW, the load emitted by clang-15 (https://godbolt.org/z/nbofoe6zM) shouldn't be related with this, and in any cases, it just seems suboptimal codegen (GCC doesn't emit any load wrt to this instance).

@antoniofrighetto
Copy link
Contributor Author

@alexey-bataev, now that I look better at it, I think that's the load you were referring to in 0e7ed32? Maybe we should remove it all?

@alexey-bataev
Copy link
Member

@alexey-bataev, now that I look better at it, I think that's the load you were referring to in 0e7ed32? Maybe we should remove it all?

Sorry, do not quite understand. What's better to remove?

if (Opcode == Instruction::Store && OpInfo.isConstant())
// Add a cost for constant load to vector, if pointer is not a constant.
if (auto *SI = dyn_cast_or_null<StoreInst>(I);
SI && !isa<Constant>(SI->getPointerOperand()) && OpInfo.isConstant())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to account for the cost of the constant vector load, but you want to skip it if the store address is constant?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at #111126 - shouldn't we be checking to see if OpInfo.isUniform() and adjusting the Type accordingly?

It looks like the test case thinks it has to load a <20 x i64> splat(i64 1) constant when it just has to load <2 x i64> splat(i64 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I'm missing something, where is the Type being <20 x i64> in this case? AFAICT, there isn't any load associated to the store when the store address is constant in x86, as it should just be needed to compute the memory address.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost you are disabling in the code above is for the constant load - not the memory address computation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. There is no cost for any constant load. There is only a memory address computation, which is not a load.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's only true for scalars - for vector constant it must be loaded from the constant pool unless rematerialized (which for x86 means 0 or -1 splats only). That is the issue in #111126 - the load from LCPI0_0 is being overestimated (its just a <2 x i64> load that is reused in all 10 stores but we're treating it as a <20 x i64> load):

f():
        movaps  xmm0, xmmword ptr [rip + .LCPI0_0]
        movaps  xmmword ptr [rip + arr], xmm0
        movaps  xmmword ptr [rip + arr+16], xmm0
        movaps  xmmword ptr [rip + arr+32], xmm0
        movaps  xmmword ptr [rip + arr+48], xmm0
        movaps  xmmword ptr [rip + arr+64], xmm0
        movaps  xmmword ptr [rip + arr+80], xmm0
        movaps  xmmword ptr [rip + arr+96], xmm0
        movaps  xmmword ptr [rip + arr+112], xmm0
        movaps  xmmword ptr [rip + arr+128], xmm0
        movaps  xmmword ptr [rip + arr+144], xmm0
        ret

I'm investigating an alternative patch but have run out of time tonight :/

Copy link
Contributor Author

@antoniofrighetto antoniofrighetto Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, isn't this just bad codegen? In principle, we should be able to store the constant in the floating-point register, why would we need to load it from memory? GCC seems to catch this: https://godbolt.org/z/6baWMK4cW.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhmm, we still need a load when dealing with arbitrary 64-bit constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any, the check would need to take this into account (scalar <= INT_MAX), but feel free to assess if there's a more favourable approach here.

RKSimon added a commit that referenced this pull request Dec 4, 2024
I'd copied the test case from #118016 instead of the original #111126 test case
@RKSimon
Copy link
Collaborator

RKSimon commented Dec 4, 2024

Proposed patch: #118642

@antoniofrighetto
Copy link
Contributor Author

Closing this, as favoured approach in #118642.

@antoniofrighetto
Copy link
Contributor Author

antoniofrighetto commented Dec 4, 2024

@RKSimon, think it would be still valuable to materialize the constant through separate movs + punpcklqdq, and save a load from memory, when possible? If any, that would happen in X86ISelLowering, correct?

@RKSimon
Copy link
Collaborator

RKSimon commented Dec 4, 2024

There's plenty of scope for better vector constant rematerialisation, but gpr->fpu transfer + shuffle perf isn't always great, it avoids memory latency but can cause a lot of throughput + register pressure issues - just trying to do this in DAGISel doesn't work well.

I started the X86FixupVectorConstants pass to begin reducing vector constant pool traffic and there's plenty more to do to remove constant broadcasts etc. in X86ISelLowering, that should then put us in a position where we can do other things as well - similar to how MachineLICM attempts to hoist out constant loads if there's spare registers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression from clang 16: missed vectorization in simple array initialization
4 participants