-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[SelectionDAG][RISCV] Avoid store merging across function calls #130430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This patch improves DAGCombiner's handling of potential store merges by detecting function calls between loads and stores. When a function call exists in the chain between a load and its corresponding store, we avoid merging these stores as it would require costly register spilling. Currently it's only enabled for riscv.
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
@llvm/pr-subscribers-backend-risc-v Author: Mikhail R. Gadelha (mikhailramalho) ChangesThis patch improves DAGCombiner's handling of potential store merges by We had to implement a hook on TLI, since TTI is unavailable in DAGCombine. Currently, it's only enabled for riscv. This is the DAG equivalent of PR #129258 Full diff: https://github.com/llvm/llvm-project/pull/130430.diff 4 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 2089d47e9cbc8..f8dd6cdd6aec8 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -3506,6 +3506,10 @@ class TargetLoweringBase {
/// The default implementation just freezes the set of reserved registers.
virtual void finalizeLowering(MachineFunction &MF) const;
+ /// Returns true if it's profitable to allow merging store of loads when there
+ /// are functions calls between the load and the store.
+ virtual bool shouldMergeStoreOfLoadsOverCall(EVT) const { return true; }
+
//===----------------------------------------------------------------------===//
// GlobalISel Hooks
//===----------------------------------------------------------------------===//
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index ef5f2210573e0..85b3682318e32 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -21553,6 +21553,56 @@ bool DAGCombiner::tryStoreMergeOfLoads(SmallVectorImpl<MemOpLink> &StoreNodes,
JointMemOpVT = EVT::getIntegerVT(Context, SizeInBits);
}
+ auto HasCallInLdStChain = [](SmallVectorImpl<MemOpLink> &StoreNodes,
+ SmallVectorImpl<MemOpLink> &LoadNodes,
+ unsigned NumStores) {
+ for (unsigned i = 0; i < NumStores; ++i) {
+ StoreSDNode *St = cast<StoreSDNode>(StoreNodes[i].MemNode);
+ SDValue Val = peekThroughBitcasts(St->getValue());
+ LoadSDNode *Ld = cast<LoadSDNode>(Val);
+ assert(Ld == LoadNodes[i].MemNode && "Load and store mismatch");
+
+ SmallPtrSet<const SDNode *, 32> Visited;
+ SmallVector<std::pair<const SDNode *, bool>, 8> Worklist;
+ Worklist.emplace_back(St->getOperand(0).getNode(), false);
+
+ while (!Worklist.empty()) {
+ auto [Node, FoundCall] = Worklist.pop_back_val();
+ if (!Visited.insert(Node).second || Node->getNumOperands() == 0)
+ continue;
+
+ switch (Node->getOpcode()) {
+ case ISD::CALLSEQ_END:
+ Worklist.emplace_back(Node->getOperand(0).getNode(), true);
+ break;
+ case ISD::TokenFactor:
+ for (SDValue Op : Node->ops())
+ Worklist.emplace_back(Op.getNode(), FoundCall);
+ break;
+ case ISD::LOAD:
+ if (Node == Ld)
+ return FoundCall;
+ [[fallthrough]];
+ default:
+ if (Node->getNumOperands() > 0)
+ Worklist.emplace_back(Node->getOperand(0).getNode(), FoundCall);
+ break;
+ }
+ }
+ return false;
+ }
+ return false;
+ };
+
+ // Check if there is a call in the load/store chain.
+ if (!TLI.shouldMergeStoreOfLoadsOverCall(JointMemOpVT) &&
+ HasCallInLdStChain(StoreNodes, LoadNodes, NumElem)) {
+ StoreNodes.erase(StoreNodes.begin(), StoreNodes.begin() + NumElem);
+ LoadNodes.erase(LoadNodes.begin(), LoadNodes.begin() + NumElem);
+ NumConsecutiveStores -= NumElem;
+ continue;
+ }
+
SDLoc LoadDL(LoadNodes[0].MemNode);
SDLoc StoreDL(StoreNodes[0].MemNode);
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.h b/llvm/lib/Target/RISCV/RISCVISelLowering.h
index ffbc14a29006c..658d1bce2cf6e 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.h
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.h
@@ -1070,6 +1070,13 @@ class RISCVTargetLowering : public TargetLowering {
return false;
}
+ /// Disables storing and loading vectors by default when there are function
+ /// calls between the load and store, since these are more expensive than just
+ /// using scalars
+ bool shouldMergeStoreOfLoadsOverCall(EVT VT) const override {
+ return VT.isScalarInteger();
+ }
+
/// For available scheduling models FDIV + two independent FMULs are much
/// faster than two FDIVs.
unsigned combineRepeatedFPDivisors() const override;
diff --git a/llvm/test/CodeGen/RISCV/stores-of-loads-merging.ll b/llvm/test/CodeGen/RISCV/stores-of-loads-merging.ll
index b2be401b4676f..71bb4d5f41e7d 100644
--- a/llvm/test/CodeGen/RISCV/stores-of-loads-merging.ll
+++ b/llvm/test/CodeGen/RISCV/stores-of-loads-merging.ll
@@ -13,40 +13,40 @@ define void @f(ptr %m, ptr %n, ptr %p, ptr %q, ptr %r, ptr %s, double %t) {
; CHECK-NEXT: sd s0, 32(sp) # 8-byte Folded Spill
; CHECK-NEXT: sd s1, 24(sp) # 8-byte Folded Spill
; CHECK-NEXT: sd s2, 16(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s3, 8(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s4, 0(sp) # 8-byte Folded Spill
; CHECK-NEXT: .cfi_offset ra, -8
; CHECK-NEXT: .cfi_offset s0, -16
; CHECK-NEXT: .cfi_offset s1, -24
; CHECK-NEXT: .cfi_offset s2, -32
-; CHECK-NEXT: csrr a6, vlenb
-; CHECK-NEXT: sub sp, sp, a6
-; CHECK-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x30, 0x22, 0x11, 0x01, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 48 + 1 * vlenb
+; CHECK-NEXT: .cfi_offset s3, -40
+; CHECK-NEXT: .cfi_offset s4, -48
; CHECK-NEXT: mv s0, a5
; CHECK-NEXT: mv s1, a4
; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
; CHECK-NEXT: vle64.v v8, (a0)
; CHECK-NEXT: vse64.v v8, (a1)
-; CHECK-NEXT: vle64.v v8, (a2)
-; CHECK-NEXT: addi a0, sp, 16
-; CHECK-NEXT: vs1r.v v8, (a0) # Unknown-size Folded Spill
+; CHECK-NEXT: ld s3, 0(a2)
+; CHECK-NEXT: ld s4, 8(a2)
; CHECK-NEXT: mv s2, a3
; CHECK-NEXT: call g
-; CHECK-NEXT: addi a0, sp, 16
-; CHECK-NEXT: vl1r.v v8, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT: sd s3, 0(s2)
+; CHECK-NEXT: sd s4, 8(s2)
; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
-; CHECK-NEXT: vse64.v v8, (s2)
; CHECK-NEXT: vle64.v v8, (s1)
; CHECK-NEXT: vse64.v v8, (s0)
-; CHECK-NEXT: csrr a0, vlenb
-; CHECK-NEXT: add sp, sp, a0
-; CHECK-NEXT: .cfi_def_cfa sp, 48
; CHECK-NEXT: ld ra, 40(sp) # 8-byte Folded Reload
; CHECK-NEXT: ld s0, 32(sp) # 8-byte Folded Reload
; CHECK-NEXT: ld s1, 24(sp) # 8-byte Folded Reload
; CHECK-NEXT: ld s2, 16(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s3, 8(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s4, 0(sp) # 8-byte Folded Reload
; CHECK-NEXT: .cfi_restore ra
; CHECK-NEXT: .cfi_restore s0
; CHECK-NEXT: .cfi_restore s1
; CHECK-NEXT: .cfi_restore s2
+; CHECK-NEXT: .cfi_restore s3
+; CHECK-NEXT: .cfi_restore s4
; CHECK-NEXT: addi sp, sp, 48
; CHECK-NEXT: .cfi_def_cfa_offset 0
; CHECK-NEXT: ret
|
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
I'm surprised this even happens. What happens if you just do this unconditionally? |
Signed-off-by: Mikhail R. Gadelha <[email protected]>
We want to still allow merging scalars... If we prevent merges unconditionally, we would miss that Luke requested a test case for when we merge scalars, I'm working on it now and will add it to the PR soon |
I chatted a bit with @mikhailramalho offline, just recording the key bits.
Edit: Re-reading this, I think I may have accidental mislead here. We could have the callee saved heuristic in DAG as mentioned, but we could also just extend the TLI hook you added slightly to have the source type, and have this reasoning explained in a comment in the hook. The result is static for the target. |
I mentioned earlier to have this as a follow-up patch but maybe it's a small change and we can fit in this PR? I'll take a look at how to extend the TLI hook to check for callee save regs |
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
break; | ||
} | ||
} | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this return false be removed? Otherwise we never check anything but the first load/store pair
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, thinking about it, not sure we need to check any of the others. (As in, we don't need the loop, and can only check the first pair.) My reasoning is that for a call to be before LD2 and ST2, but not LD1 and ST2, then the call has to be between either LD1 and LD2 or ST1 and ST2. In either case, either the call aliases the memory operation (preventing the merging entirely), or we can insert the merged LD or ST on whichever side of the call we prefer avoiding the issue.
I haven't glanced at the code to see if it actually takes advantage of that rescheduling. A couple of additional test cases coming...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See ca0fe95, doesn't look like we do any of the aliasing/reordering games.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just quickly checked with an assert, I think it's as you describe, i.e. we can't have one load/store pair with a call and another without. We should probably change this then to just check the first pair
break; | ||
} | ||
} | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, thinking about it, not sure we need to check any of the others. (As in, we don't need the loop, and can only check the first pair.) My reasoning is that for a call to be before LD2 and ST2, but not LD1 and ST2, then the call has to be between either LD1 and LD2 or ST1 and ST2. In either case, either the call aliases the memory operation (preventing the merging entirely), or we can insert the merged LD or ST on whichever side of the call we prefer avoiding the issue.
I haven't glanced at the code to see if it actually takes advantage of that rescheduling. A couple of additional test cases coming...
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
break; | ||
} | ||
} | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See ca0fe95, doesn't look like we do any of the aliasing/reordering games.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to me too. It would be nice to update hasCallInLdStChain to only check the first load/store pair but that's an optimisation, feel free to leave it to another PR if you want
Signed-off-by: Mikhail R. Gadelha <[email protected]>
It's a small change, so I'll update the PR and re-run the tests just to be sure |
This asserts fails because the order of the store/loads are reversed prior to the check, if we can rotate the loads.
Why does the assert fail? |
This line here swaps nodes: https://github.com/llvm/llvm-project/blob/d6c848db4a97103e19c398848bde0876a2c4c88e/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L21487 It happens when building |
We don't have a case to test that, as the assertion I removed only failed when building blender_r. I plan to run creduce and add a new test case for it before I merge this PR |
/// calls between the load and store, since these are more expensive than just | ||
/// using scalars | ||
bool shouldMergeStoreOfLoadsOverCall(EVT SrcVT, EVT MergedVT) const override { | ||
return SrcVT.isScalarInteger() == MergedVT.isScalarInteger(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about scalar FP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think SrcVT can be a scalar FP and MergeVT can be a scalar integer VT. In that case it would still be ok to merge across the call.
Maybe this should be
return !MergedVT.isVector() || SrcVT.isVector();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I glanced at what we do for this today. Quick summary:
2 x half (no zvfh) -> no merge performed
2 x half (zvfh) -> <2 x half> vector result type
2 x float -> <2 x float> vector result type
2 x float + noimplicitfloat --> two integer loads, no merge
So basically, I think the patch as it is should be a net improvement (by disabling case 2 and 3). I'm not opposed to your suggest change, just pointing out it's not needed to avoid a regression. (Edit: Except on re-read, it doesn't do that. FP is non scalar and vector is non scalar, which is equal with the current check. Yeah, we should do Craig's suggestion here.)
Note that the SDAG code doesn't even try to find a wider legal floating point type to merge to. The merge type will only be integer or vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I glanced at what we do for this today. Quick summary:
2 x half (no zvfh) -> no merge performed
2 x half (zvfh) -> <2 x half> vector result type
2 x float -> <2 x float> vector result type
2 x float + noimplicitfloat --> two integer loads, no merge
@mikhailramalho said the case he hit the assert on was 2 x float -> i64.
Note that the SDAG code doesn't even try to find a wider legal floating point type to merge to. The merge type will only be integer or vector.
A wider legal FP doesn't really make sense. You can't really join them without going through integer.
I think either !MergedVT.isVector() || SrcVT.isVector()
or MergedVT.isVector() == SrcVT.isVector()
would cover vector without picking up scalar FP in scalar integer. And allow the assert that was removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test cases (aside from the noimplicitfloat one) added in a5c7f81
If I remember correctly, the NeedRotate flag prevents the code from using vectors doesn't it? Using a ROTL would only work for scalars. |
Indeed, the code that triggers the assert fail is trying to merge two f32 into an i64 |
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This patch improves DAGCombiner's handling of potential store merges by
detecting function calls between loads and stores. When a function call
exists in the chain between a load and its corresponding store, we avoid
merging these stores as it would require costly register spilling.
We had to implement a hook on TLI, since TTI is unavailable in DAGCombine.
Currently, it's only enabled for riscv.
This is the DAG equivalent of PR #129258