-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[SLPVectorizer][X86] Free load cost for stores with constant pointers #118016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SLPVectorizer][X86] Free load cost for stores with constant pointers #118016
Conversation
When estimating the cost for stores of constant buildvectors, do not take into account the cost of the additional load to materialize a vector from a constant pool when dealing with a constant pointer. In such cases, the load is avoided in the first place, as the only operations required simply involve computing the address of the constant (`rip+base_addr+offset`) and the store itself. Fixes regression: llvm#111126.
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-x86 Author: Antonio Frighetto (antoniofrighetto) ChangesWhen estimating the cost for stores of constant buildvectors, do not take into account the cost of the additional load to materialize a vector from a constant pool when dealing with a constant pointer. In such cases, the load is avoided in the first place, as the only operations required simply involve computing the address of the constant ( Fixes regression: #111126. Full diff: https://github.com/llvm/llvm-project/pull/118016.diff 3 Files Affected:
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 179e29e40614e7..689355a168c2a1 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -5157,8 +5157,9 @@ InstructionCost X86TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,
InstructionCost Cost = 0;
- // Add a cost for constant load to vector.
- if (Opcode == Instruction::Store && OpInfo.isConstant())
+ // Add a cost for constant load to vector, if pointer is not a constant.
+ if (auto *SI = dyn_cast_or_null<StoreInst>(I);
+ SI && !isa<Constant>(SI->getPointerOperand()) && OpInfo.isConstant())
Cost += getMemoryOpCost(Instruction::Load, Src, DL.getABITypeAlign(Src),
/*AddressSpace=*/0, CostKind);
diff --git a/llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll b/llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll
index da1808fc278c09..75829cb71814f6 100644
--- a/llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll
+++ b/llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll
@@ -57,9 +57,7 @@ define void @_Z3fn1v(ptr %r, ptr %a) #0 {
; CHECK-NEXT: [[CMP_PROL:%.*]] = icmp eq i8 [[TMP3_PROL]], 0
; CHECK-NEXT: br i1 [[CMP_PROL]], label %[[IF_THEN_PROL:.*]], label %[[FOR_INC_PROL:.*]]
; CHECK: [[IF_THEN_PROL]]:
-; CHECK-NEXT: [[ARRAYIDX_PROL:%.*]] = getelementptr inbounds i8, ptr [[R_022]], i64 2
-; CHECK-NEXT: store i16 0, ptr [[ARRAYIDX_PROL]], align 2
-; CHECK-NEXT: store i16 0, ptr [[R_022]], align 2
+; CHECK-NEXT: store <2 x i16> zeroinitializer, ptr [[R_022]], align 2
; CHECK-NEXT: [[ARRAYIDX5_PROL:%.*]] = getelementptr inbounds i8, ptr [[R_022]], i64 4
; CHECK-NEXT: store i16 0, ptr [[ARRAYIDX5_PROL]], align 2
; CHECK-NEXT: br label %[[FOR_INC_PROL]]
@@ -82,9 +80,7 @@ define void @_Z3fn1v(ptr %r, ptr %a) #0 {
; CHECK-NEXT: [[CMP:%.*]] = icmp eq i8 [[TMP3]], 0
; CHECK-NEXT: br i1 [[CMP]], label %[[IF_THEN:.*]], label %[[FOR_INC:.*]]
; CHECK: [[IF_THEN]]:
-; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 2
-; CHECK-NEXT: store i16 0, ptr [[ARRAYIDX]], align 2
-; CHECK-NEXT: store i16 0, ptr [[R_117]], align 2
+; CHECK-NEXT: store <2 x i16> zeroinitializer, ptr [[R_117]], align 2
; CHECK-NEXT: [[ARRAYIDX5:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 4
; CHECK-NEXT: store i16 0, ptr [[ARRAYIDX5]], align 2
; CHECK-NEXT: br label %[[FOR_INC]]
@@ -96,9 +92,7 @@ define void @_Z3fn1v(ptr %r, ptr %a) #0 {
; CHECK-NEXT: br i1 [[CMP_1]], label %[[IF_THEN_1:.*]], label %[[FOR_INC_1]]
; CHECK: [[IF_THEN_1]]:
; CHECK-NEXT: [[ADD_PTR:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 6
-; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 8
-; CHECK-NEXT: store i16 0, ptr [[ARRAYIDX_1]], align 2
-; CHECK-NEXT: store i16 0, ptr [[ADD_PTR]], align 2
+; CHECK-NEXT: store <2 x i16> zeroinitializer, ptr [[ADD_PTR]], align 2
; CHECK-NEXT: [[ARRAYIDX5_1:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 10
; CHECK-NEXT: store i16 0, ptr [[ARRAYIDX5_1]], align 2
; CHECK-NEXT: br label %[[FOR_INC_1]]
diff --git a/llvm/test/Transforms/SLPVectorizer/X86/buildvector_store_constant.ll b/llvm/test/Transforms/SLPVectorizer/X86/buildvector_store_constant.ll
new file mode 100644
index 00000000000000..1d9e4d20d20af6
--- /dev/null
+++ b/llvm/test/Transforms/SLPVectorizer/X86/buildvector_store_constant.ll
@@ -0,0 +1,20 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -mtriple=x86_64-unknown-linux-gnu --passes=slp-vectorizer -S -o - %s | FileCheck %s
+
+@arr = global [20 x i64] zeroinitializer, align 16
+
+define void @store_from_constant_ptr() {
+; CHECK-LABEL: define void @store_from_constant_ptr() {
+; CHECK-NEXT: store <2 x i64> splat (i64 1), ptr @arr, align 16
+; CHECK-NEXT: store <2 x i64> splat (i64 1), ptr getelementptr inbounds (i8, ptr @arr, i64 16), align 16
+; CHECK-NEXT: store <2 x i64> splat (i64 1), ptr getelementptr inbounds (i8, ptr @arr, i64 32), align 16
+; CHECK-NEXT: ret void
+;
+ store i64 1, ptr @arr, align 16
+ store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 8), align 8
+ store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 16), align 16
+ store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 24), align 8
+ store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 32), align 16
+ store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 40), align 8
+ ret void
+}
|
FWIW, the load emitted by clang-15 (https://godbolt.org/z/nbofoe6zM) shouldn't be related with this, and in any cases, it just seems suboptimal codegen (GCC doesn't emit any load wrt to this instance). |
@alexey-bataev, now that I look better at it, I think that's the load you were referring to in 0e7ed32? Maybe we should remove it all? |
Sorry, do not quite understand. What's better to remove? |
if (Opcode == Instruction::Store && OpInfo.isConstant()) | ||
// Add a cost for constant load to vector, if pointer is not a constant. | ||
if (auto *SI = dyn_cast_or_null<StoreInst>(I); | ||
SI && !isa<Constant>(SI->getPointerOperand()) && OpInfo.isConstant()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're trying to account for the cost of the constant vector load, but you want to skip it if the store address is constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at #111126 - shouldn't we be checking to see if OpInfo.isUniform()
and adjusting the Type accordingly?
It looks like the test case thinks it has to load a <20 x i64> splat(i64 1)
constant when it just has to load <2 x i64> splat(i64 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I'm missing something, where is the Type being <20 x i64>
in this case? AFAICT, there isn't any load associated to the store when the store address is constant in x86, as it should just be needed to compute the memory address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost you are disabling in the code above is for the constant load - not the memory address computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. There is no cost for any constant load. There is only a memory address computation, which is not a load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's only true for scalars - for vector constant it must be loaded from the constant pool unless rematerialized (which for x86 means 0 or -1 splats only). That is the issue in #111126 - the load from LCPI0_0
is being overestimated (its just a <2 x i64> load that is reused in all 10 stores but we're treating it as a <20 x i64> load):
f():
movaps xmm0, xmmword ptr [rip + .LCPI0_0]
movaps xmmword ptr [rip + arr], xmm0
movaps xmmword ptr [rip + arr+16], xmm0
movaps xmmword ptr [rip + arr+32], xmm0
movaps xmmword ptr [rip + arr+48], xmm0
movaps xmmword ptr [rip + arr+64], xmm0
movaps xmmword ptr [rip + arr+80], xmm0
movaps xmmword ptr [rip + arr+96], xmm0
movaps xmmword ptr [rip + arr+112], xmm0
movaps xmmword ptr [rip + arr+128], xmm0
movaps xmmword ptr [rip + arr+144], xmm0
ret
I'm investigating an alternative patch but have run out of time tonight :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, isn't this just bad codegen? In principle, we should be able to store the constant in the floating-point register, why would we need to load it from memory? GCC seems to catch this: https://godbolt.org/z/6baWMK4cW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhmm, we still need a load when dealing with arbitrary 64-bit constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any, the check would need to take this into account (scalar <= INT_MAX), but feel free to assess if there's a more favourable approach here.
Proposed patch: #118642 |
Closing this, as favoured approach in #118642. |
@RKSimon, think it would be still valuable to materialize the constant through separate movs + punpcklqdq, and save a load from memory, when possible? If any, that would happen in X86ISelLowering, correct? |
There's plenty of scope for better vector constant rematerialisation, but gpr->fpu transfer + shuffle perf isn't always great, it avoids memory latency but can cause a lot of throughput + register pressure issues - just trying to do this in DAGISel doesn't work well. I started the X86FixupVectorConstants pass to begin reducing vector constant pool traffic and there's plenty more to do to remove constant broadcasts etc. in X86ISelLowering, that should then put us in a position where we can do other things as well - similar to how MachineLICM attempts to hoist out constant loads if there's spare registers. |
When estimating the cost for stores of constant buildvectors, do not take into account the cost of the additional load to materialize a vector from a constant pool when dealing with a constant pointer. In such cases, the load is avoided in the first place, as the only operations required simply involve computing the address of the constant (
ip+base_addr+offset
) and the store itself.Fixes: #111126.