Skip to content

[SLPVectorizer][X86] Free load cost for stores with constant pointers #118016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions llvm/lib/Target/X86/X86TargetTransformInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5157,8 +5157,9 @@ InstructionCost X86TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,

InstructionCost Cost = 0;

// Add a cost for constant load to vector.
if (Opcode == Instruction::Store && OpInfo.isConstant())
// Add a cost for constant load to vector, if pointer is not a constant.
if (auto *SI = dyn_cast_or_null<StoreInst>(I);
SI && !isa<Constant>(SI->getPointerOperand()) && OpInfo.isConstant())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to account for the cost of the constant vector load, but you want to skip it if the store address is constant?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at #111126 - shouldn't we be checking to see if OpInfo.isUniform() and adjusting the Type accordingly?

It looks like the test case thinks it has to load a <20 x i64> splat(i64 1) constant when it just has to load <2 x i64> splat(i64 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I'm missing something, where is the Type being <20 x i64> in this case? AFAICT, there isn't any load associated to the store when the store address is constant in x86, as it should just be needed to compute the memory address.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost you are disabling in the code above is for the constant load - not the memory address computation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. There is no cost for any constant load. There is only a memory address computation, which is not a load.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's only true for scalars - for vector constant it must be loaded from the constant pool unless rematerialized (which for x86 means 0 or -1 splats only). That is the issue in #111126 - the load from LCPI0_0 is being overestimated (its just a <2 x i64> load that is reused in all 10 stores but we're treating it as a <20 x i64> load):

f():
        movaps  xmm0, xmmword ptr [rip + .LCPI0_0]
        movaps  xmmword ptr [rip + arr], xmm0
        movaps  xmmword ptr [rip + arr+16], xmm0
        movaps  xmmword ptr [rip + arr+32], xmm0
        movaps  xmmword ptr [rip + arr+48], xmm0
        movaps  xmmword ptr [rip + arr+64], xmm0
        movaps  xmmword ptr [rip + arr+80], xmm0
        movaps  xmmword ptr [rip + arr+96], xmm0
        movaps  xmmword ptr [rip + arr+112], xmm0
        movaps  xmmword ptr [rip + arr+128], xmm0
        movaps  xmmword ptr [rip + arr+144], xmm0
        ret

I'm investigating an alternative patch but have run out of time tonight :/

Copy link
Contributor Author

@antoniofrighetto antoniofrighetto Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, isn't this just bad codegen? In principle, we should be able to store the constant in the floating-point register, why would we need to load it from memory? GCC seems to catch this: https://godbolt.org/z/6baWMK4cW.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhmm, we still need a load when dealing with arbitrary 64-bit constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any, the check would need to take this into account (scalar <= INT_MAX), but feel free to assess if there's a more favourable approach here.

Cost += getMemoryOpCost(Instruction::Load, Src, DL.getABITypeAlign(Src),
/*AddressSpace=*/0, CostKind);

Expand Down
12 changes: 3 additions & 9 deletions llvm/test/Transforms/LoopUnroll/unroll-cleanup.ll
Original file line number Diff line number Diff line change
Expand Up @@ -57,9 +57,7 @@ define void @_Z3fn1v(ptr %r, ptr %a) #0 {
; CHECK-NEXT: [[CMP_PROL:%.*]] = icmp eq i8 [[TMP3_PROL]], 0
; CHECK-NEXT: br i1 [[CMP_PROL]], label %[[IF_THEN_PROL:.*]], label %[[FOR_INC_PROL:.*]]
; CHECK: [[IF_THEN_PROL]]:
; CHECK-NEXT: [[ARRAYIDX_PROL:%.*]] = getelementptr inbounds i8, ptr [[R_022]], i64 2
; CHECK-NEXT: store i16 0, ptr [[ARRAYIDX_PROL]], align 2
; CHECK-NEXT: store i16 0, ptr [[R_022]], align 2
; CHECK-NEXT: store <2 x i16> zeroinitializer, ptr [[R_022]], align 2
; CHECK-NEXT: [[ARRAYIDX5_PROL:%.*]] = getelementptr inbounds i8, ptr [[R_022]], i64 4
; CHECK-NEXT: store i16 0, ptr [[ARRAYIDX5_PROL]], align 2
; CHECK-NEXT: br label %[[FOR_INC_PROL]]
Expand All @@ -82,9 +80,7 @@ define void @_Z3fn1v(ptr %r, ptr %a) #0 {
; CHECK-NEXT: [[CMP:%.*]] = icmp eq i8 [[TMP3]], 0
; CHECK-NEXT: br i1 [[CMP]], label %[[IF_THEN:.*]], label %[[FOR_INC:.*]]
; CHECK: [[IF_THEN]]:
; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 2
; CHECK-NEXT: store i16 0, ptr [[ARRAYIDX]], align 2
; CHECK-NEXT: store i16 0, ptr [[R_117]], align 2
; CHECK-NEXT: store <2 x i16> zeroinitializer, ptr [[R_117]], align 2
; CHECK-NEXT: [[ARRAYIDX5:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 4
; CHECK-NEXT: store i16 0, ptr [[ARRAYIDX5]], align 2
; CHECK-NEXT: br label %[[FOR_INC]]
Expand All @@ -96,9 +92,7 @@ define void @_Z3fn1v(ptr %r, ptr %a) #0 {
; CHECK-NEXT: br i1 [[CMP_1]], label %[[IF_THEN_1:.*]], label %[[FOR_INC_1]]
; CHECK: [[IF_THEN_1]]:
; CHECK-NEXT: [[ADD_PTR:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 6
; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 8
; CHECK-NEXT: store i16 0, ptr [[ARRAYIDX_1]], align 2
; CHECK-NEXT: store i16 0, ptr [[ADD_PTR]], align 2
; CHECK-NEXT: store <2 x i16> zeroinitializer, ptr [[ADD_PTR]], align 2
; CHECK-NEXT: [[ARRAYIDX5_1:%.*]] = getelementptr inbounds i8, ptr [[R_117]], i64 10
; CHECK-NEXT: store i16 0, ptr [[ARRAYIDX5_1]], align 2
; CHECK-NEXT: br label %[[FOR_INC_1]]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
; RUN: opt -mtriple=x86_64-unknown-linux-gnu --passes=slp-vectorizer -S -o - %s | FileCheck %s

@arr = global [20 x i64] zeroinitializer, align 16

define void @store_from_constant_ptr() {
; CHECK-LABEL: define void @store_from_constant_ptr() {
; CHECK-NEXT: store <2 x i64> splat (i64 1), ptr @arr, align 16
; CHECK-NEXT: store <2 x i64> splat (i64 1), ptr getelementptr inbounds (i8, ptr @arr, i64 16), align 16
; CHECK-NEXT: store <2 x i64> splat (i64 1), ptr getelementptr inbounds (i8, ptr @arr, i64 32), align 16
; CHECK-NEXT: ret void
;
store i64 1, ptr @arr, align 16
store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 8), align 8
store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 16), align 16
store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 24), align 8
store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 32), align 16
store i64 1, ptr getelementptr inbounds (i8, ptr @arr, i64 40), align 8
ret void
}