Skip to content

[SelectionDAG][RISCV] Avoid store merging across function calls #130430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Mar 22, 2025

Conversation

mikhailramalho
Copy link
Member

@mikhailramalho mikhailramalho commented Mar 8, 2025

This patch improves DAGCombiner's handling of potential store merges by
detecting function calls between loads and stores. When a function call
exists in the chain between a load and its corresponding store, we avoid
merging these stores as it would require costly register spilling.

We had to implement a hook on TLI, since TTI is unavailable in DAGCombine.

Currently, it's only enabled for riscv.

This is the DAG equivalent of PR #129258

This patch improves DAGCombiner's handling of potential store merges by
detecting function calls between loads and stores. When a function call
exists in the chain between a load and its corresponding store, we avoid
merging these stores as it would require costly register spilling.

Currently it's only enabled for riscv.
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
@llvmbot
Copy link
Member

llvmbot commented Mar 10, 2025

@llvm/pr-subscribers-backend-risc-v

Author: Mikhail R. Gadelha (mikhailramalho)

Changes

This patch improves DAGCombiner's handling of potential store merges by
detecting function calls between loads and stores. When a function call
exists in the chain between a load and its corresponding store, we avoid
merging these stores as it would require costly register spilling.

We had to implement a hook on TLI, since TTI is unavailable in DAGCombine.

Currently, it's only enabled for riscv.

This is the DAG equivalent of PR #129258


Full diff: https://github.com/llvm/llvm-project/pull/130430.diff

4 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+4)
  • (modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+50)
  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.h (+7)
  • (modified) llvm/test/CodeGen/RISCV/stores-of-loads-merging.ll (+12-12)
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 2089d47e9cbc8..f8dd6cdd6aec8 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -3506,6 +3506,10 @@ class TargetLoweringBase {
   /// The default implementation just freezes the set of reserved registers.
   virtual void finalizeLowering(MachineFunction &MF) const;
 
+  /// Returns true if it's profitable to allow merging store of loads when there
+  /// are functions calls between the load and the store.
+  virtual bool shouldMergeStoreOfLoadsOverCall(EVT) const { return true; }
+
   //===----------------------------------------------------------------------===//
   //  GlobalISel Hooks
   //===----------------------------------------------------------------------===//
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index ef5f2210573e0..85b3682318e32 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -21553,6 +21553,56 @@ bool DAGCombiner::tryStoreMergeOfLoads(SmallVectorImpl<MemOpLink> &StoreNodes,
       JointMemOpVT = EVT::getIntegerVT(Context, SizeInBits);
     }
 
+    auto HasCallInLdStChain = [](SmallVectorImpl<MemOpLink> &StoreNodes,
+                                 SmallVectorImpl<MemOpLink> &LoadNodes,
+                                 unsigned NumStores) {
+      for (unsigned i = 0; i < NumStores; ++i) {
+        StoreSDNode *St = cast<StoreSDNode>(StoreNodes[i].MemNode);
+        SDValue Val = peekThroughBitcasts(St->getValue());
+        LoadSDNode *Ld = cast<LoadSDNode>(Val);
+        assert(Ld == LoadNodes[i].MemNode && "Load and store mismatch");
+
+        SmallPtrSet<const SDNode *, 32> Visited;
+        SmallVector<std::pair<const SDNode *, bool>, 8> Worklist;
+        Worklist.emplace_back(St->getOperand(0).getNode(), false);
+
+        while (!Worklist.empty()) {
+          auto [Node, FoundCall] = Worklist.pop_back_val();
+          if (!Visited.insert(Node).second || Node->getNumOperands() == 0)
+            continue;
+
+          switch (Node->getOpcode()) {
+          case ISD::CALLSEQ_END:
+            Worklist.emplace_back(Node->getOperand(0).getNode(), true);
+            break;
+          case ISD::TokenFactor:
+            for (SDValue Op : Node->ops())
+              Worklist.emplace_back(Op.getNode(), FoundCall);
+            break;
+          case ISD::LOAD:
+            if (Node == Ld)
+              return FoundCall;
+            [[fallthrough]];
+          default:
+            if (Node->getNumOperands() > 0)
+              Worklist.emplace_back(Node->getOperand(0).getNode(), FoundCall);
+            break;
+          }
+        }
+        return false;
+      }
+      return false;
+    };
+
+    // Check if there is a call in the load/store chain.
+    if (!TLI.shouldMergeStoreOfLoadsOverCall(JointMemOpVT) &&
+        HasCallInLdStChain(StoreNodes, LoadNodes, NumElem)) {
+      StoreNodes.erase(StoreNodes.begin(), StoreNodes.begin() + NumElem);
+      LoadNodes.erase(LoadNodes.begin(), LoadNodes.begin() + NumElem);
+      NumConsecutiveStores -= NumElem;
+      continue;
+    }
+
     SDLoc LoadDL(LoadNodes[0].MemNode);
     SDLoc StoreDL(StoreNodes[0].MemNode);
 
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.h b/llvm/lib/Target/RISCV/RISCVISelLowering.h
index ffbc14a29006c..658d1bce2cf6e 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.h
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.h
@@ -1070,6 +1070,13 @@ class RISCVTargetLowering : public TargetLowering {
     return false;
   }
 
+  /// Disables storing and loading vectors by default when there are function
+  /// calls between the load and store, since these are more expensive than just
+  /// using scalars
+  bool shouldMergeStoreOfLoadsOverCall(EVT VT) const override {
+    return VT.isScalarInteger();
+  }
+
   /// For available scheduling models FDIV + two independent FMULs are much
   /// faster than two FDIVs.
   unsigned combineRepeatedFPDivisors() const override;
diff --git a/llvm/test/CodeGen/RISCV/stores-of-loads-merging.ll b/llvm/test/CodeGen/RISCV/stores-of-loads-merging.ll
index b2be401b4676f..71bb4d5f41e7d 100644
--- a/llvm/test/CodeGen/RISCV/stores-of-loads-merging.ll
+++ b/llvm/test/CodeGen/RISCV/stores-of-loads-merging.ll
@@ -13,40 +13,40 @@ define void @f(ptr %m, ptr %n, ptr %p, ptr %q, ptr %r, ptr %s, double %t) {
 ; CHECK-NEXT:    sd s0, 32(sp) # 8-byte Folded Spill
 ; CHECK-NEXT:    sd s1, 24(sp) # 8-byte Folded Spill
 ; CHECK-NEXT:    sd s2, 16(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    sd s3, 8(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    sd s4, 0(sp) # 8-byte Folded Spill
 ; CHECK-NEXT:    .cfi_offset ra, -8
 ; CHECK-NEXT:    .cfi_offset s0, -16
 ; CHECK-NEXT:    .cfi_offset s1, -24
 ; CHECK-NEXT:    .cfi_offset s2, -32
-; CHECK-NEXT:    csrr a6, vlenb
-; CHECK-NEXT:    sub sp, sp, a6
-; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x30, 0x22, 0x11, 0x01, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 48 + 1 * vlenb
+; CHECK-NEXT:    .cfi_offset s3, -40
+; CHECK-NEXT:    .cfi_offset s4, -48
 ; CHECK-NEXT:    mv s0, a5
 ; CHECK-NEXT:    mv s1, a4
 ; CHECK-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
 ; CHECK-NEXT:    vle64.v v8, (a0)
 ; CHECK-NEXT:    vse64.v v8, (a1)
-; CHECK-NEXT:    vle64.v v8, (a2)
-; CHECK-NEXT:    addi a0, sp, 16
-; CHECK-NEXT:    vs1r.v v8, (a0) # Unknown-size Folded Spill
+; CHECK-NEXT:    ld s3, 0(a2)
+; CHECK-NEXT:    ld s4, 8(a2)
 ; CHECK-NEXT:    mv s2, a3
 ; CHECK-NEXT:    call g
-; CHECK-NEXT:    addi a0, sp, 16
-; CHECK-NEXT:    vl1r.v v8, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT:    sd s3, 0(s2)
+; CHECK-NEXT:    sd s4, 8(s2)
 ; CHECK-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
-; CHECK-NEXT:    vse64.v v8, (s2)
 ; CHECK-NEXT:    vle64.v v8, (s1)
 ; CHECK-NEXT:    vse64.v v8, (s0)
-; CHECK-NEXT:    csrr a0, vlenb
-; CHECK-NEXT:    add sp, sp, a0
-; CHECK-NEXT:    .cfi_def_cfa sp, 48
 ; CHECK-NEXT:    ld ra, 40(sp) # 8-byte Folded Reload
 ; CHECK-NEXT:    ld s0, 32(sp) # 8-byte Folded Reload
 ; CHECK-NEXT:    ld s1, 24(sp) # 8-byte Folded Reload
 ; CHECK-NEXT:    ld s2, 16(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    ld s3, 8(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    ld s4, 0(sp) # 8-byte Folded Reload
 ; CHECK-NEXT:    .cfi_restore ra
 ; CHECK-NEXT:    .cfi_restore s0
 ; CHECK-NEXT:    .cfi_restore s1
 ; CHECK-NEXT:    .cfi_restore s2
+; CHECK-NEXT:    .cfi_restore s3
+; CHECK-NEXT:    .cfi_restore s4
 ; CHECK-NEXT:    addi sp, sp, 48
 ; CHECK-NEXT:    .cfi_def_cfa_offset 0
 ; CHECK-NEXT:    ret

@arsenm
Copy link
Contributor

arsenm commented Mar 17, 2025

This patch improves DAGCombiner's handling of potential store merges by
detecting function calls between loads and stores.

I'm surprised this even happens. What happens if you just do this unconditionally?

@mikhailramalho
Copy link
Member Author

This patch improves DAGCombiner's handling of potential store merges by
detecting function calls between loads and stores.

I'm surprised this even happens. What happens if you just do this unconditionally?

We want to still allow merging scalars... If we prevent merges unconditionally, we would miss that

Luke requested a test case for when we merge scalars, I'm working on it now and will add it to the PR soon

@preames
Copy link
Collaborator

preames commented Mar 17, 2025

I chatted a bit with @mikhailramalho offline, just recording the key bits.

  1. As mentioned by Luke above, we probably want to avoid inhibiting scalar merging. Similarly, if we already have two small vectors combining into one larger vector is probably at least neutral. (These are RISCV specific profitability statements.)

  2. I would expect to see something in the check in terms of the callee saved registers for the callee of the call. The reason that scalar to vector merging is likely unprofitable is that we have scalar CSRs in the standard calling convention, but do not have vector CSRs.

  3. A minimal patch might simply check if the proposed merged type has any CSRs. That would cover the most painful case today. A more complicate heuristic would try to reason about whether the merged type was a different register class than the source types, and then apply the CSR check. We could get fancy when both register classes have CSRs, but the destination has fewer, but I suggest we defer that case until someone has an example which cares.

  4. We do appear to have calls to getRegClassFor in DAGCombine already, so that knowledge should be available. I don't see us currently using information about the calling convention in this code, hopefully that's available? I haven't checked.

Edit: Re-reading this, I think I may have accidental mislead here. We could have the callee saved heuristic in DAG as mentioned, but we could also just extend the TLI hook you added slightly to have the source type, and have this reasoning explained in a comment in the hook. The result is static for the target.

@mikhailramalho
Copy link
Member Author

I chatted a bit with @mikhailramalho offline, just recording the key bits.

  1. As mentioned by Luke above, we probably want to avoid inhibiting scalar merging. Similarly, if we already have two small vectors combining into one larger vector is probably at least neutral. (These are RISCV specific profitability statements.)
  2. I would expect to see something in the check in terms of the callee saved registers for the callee of the call. The reason that scalar to vector merging is likely unprofitable is that we have scalar CSRs in the standard calling convention, but do not have vector CSRs.
  3. A minimal patch might simply check if the proposed merged type has any CSRs. That would cover the most painful case today. A more complicate heuristic would try to reason about whether the merged type was a different register class than the source types, and then apply the CSR check. We could get fancy when both register classes have CSRs, but the destination has fewer, but I suggest we defer that case until someone has an example which cares.
  4. We do appear to have calls to getRegClassFor in DAGCombine already, so that knowledge should be available. I don't see us currently using information about the calling convention in this code, hopefully that's available? I haven't checked.

Edit: Re-reading this, I think I may have accidental mislead here. We could have the callee saved heuristic in DAG as mentioned, but we could also just extend the TLI hook you added slightly to have the source type, and have this reasoning explained in a comment in the hook. The result is static for the target.

I mentioned earlier to have this as a follow-up patch but maybe it's a small change and we can fit in this PR? I'll take a look at how to extend the TLI hook to check for callee save regs

Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
break;
}
}
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return false be removed? Otherwise we never check anything but the first load/store pair

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, thinking about it, not sure we need to check any of the others. (As in, we don't need the loop, and can only check the first pair.) My reasoning is that for a call to be before LD2 and ST2, but not LD1 and ST2, then the call has to be between either LD1 and LD2 or ST1 and ST2. In either case, either the call aliases the memory operation (preventing the merging entirely), or we can insert the merged LD or ST on whichever side of the call we prefer avoiding the issue.

I haven't glanced at the code to see if it actually takes advantage of that rescheduling. A couple of additional test cases coming...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See ca0fe95, doesn't look like we do any of the aliasing/reordering games.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just quickly checked with an assert, I think it's as you describe, i.e. we can't have one load/store pair with a call and another without. We should probably change this then to just check the first pair

break;
}
}
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, thinking about it, not sure we need to check any of the others. (As in, we don't need the loop, and can only check the first pair.) My reasoning is that for a call to be before LD2 and ST2, but not LD1 and ST2, then the call has to be between either LD1 and LD2 or ST1 and ST2. In either case, either the call aliases the memory operation (preventing the merging entirely), or we can insert the merged LD or ST on whichever side of the call we prefer avoiding the issue.

I haven't glanced at the code to see if it actually takes advantage of that rescheduling. A couple of additional test cases coming...

Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

break;
}
}
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See ca0fe95, doesn't look like we do any of the aliasing/reordering games.

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me too. It would be nice to update hasCallInLdStChain to only check the first load/store pair but that's an optimisation, feel free to leave it to another PR if you want

@mikhailramalho
Copy link
Member Author

LGTM to me too. It would be nice to update hasCallInLdStChain to only check the first load/store pair but that's an optimisation, feel free to leave it to another PR if you want

It's a small change, so I'll update the PR and re-run the tests just to be sure

This asserts fails because the order of the store/loads are reversed
prior to the check, if we can rotate the loads.
@topperc
Copy link
Collaborator

topperc commented Mar 20, 2025

Why does the assert fail?

@mikhailramalho
Copy link
Member Author

Why does the assert fail?

This line here swaps nodes: https://github.com/llvm/llvm-project/blob/d6c848db4a97103e19c398848bde0876a2c4c88e/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L21487

It happens when building 526.blender_r/src/blender/source/blender/blenloader/intern/readfile.c

@mikhailramalho
Copy link
Member Author

We don't have a case to test that, as the assertion I removed only failed when building blender_r. I plan to run creduce and add a new test case for it before I merge this PR

/// calls between the load and store, since these are more expensive than just
/// using scalars
bool shouldMergeStoreOfLoadsOverCall(EVT SrcVT, EVT MergedVT) const override {
return SrcVT.isScalarInteger() == MergedVT.isScalarInteger();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about scalar FP?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SrcVT can be a scalar FP and MergeVT can be a scalar integer VT. In that case it would still be ok to merge across the call.

Maybe this should be

  return !MergedVT.isVector() || SrcVT.isVector();

Copy link
Collaborator

@preames preames Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I glanced at what we do for this today. Quick summary:
2 x half (no zvfh) -> no merge performed
2 x half (zvfh) -> <2 x half> vector result type
2 x float -> <2 x float> vector result type
2 x float + noimplicitfloat --> two integer loads, no merge

So basically, I think the patch as it is should be a net improvement (by disabling case 2 and 3). I'm not opposed to your suggest change, just pointing out it's not needed to avoid a regression. (Edit: Except on re-read, it doesn't do that. FP is non scalar and vector is non scalar, which is equal with the current check. Yeah, we should do Craig's suggestion here.)

Note that the SDAG code doesn't even try to find a wider legal floating point type to merge to. The merge type will only be integer or vector.

Copy link
Collaborator

@topperc topperc Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I glanced at what we do for this today. Quick summary:
2 x half (no zvfh) -> no merge performed
2 x half (zvfh) -> <2 x half> vector result type
2 x float -> <2 x float> vector result type
2 x float + noimplicitfloat --> two integer loads, no merge

@mikhailramalho said the case he hit the assert on was 2 x float -> i64.

Note that the SDAG code doesn't even try to find a wider legal floating point type to merge to. The merge type will only be integer or vector.

A wider legal FP doesn't really make sense. You can't really join them without going through integer.

I think either !MergedVT.isVector() || SrcVT.isVector() or MergedVT.isVector() == SrcVT.isVector() would cover vector without picking up scalar FP in scalar integer. And allow the assert that was removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test cases (aside from the noimplicitfloat one) added in a5c7f81

@topperc
Copy link
Collaborator

topperc commented Mar 20, 2025

Why does the assert fail?

This line here swaps nodes: https://github.com/llvm/llvm-project/blob/d6c848db4a97103e19c398848bde0876a2c4c88e/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L21487

It happens when building 526.blender_r/src/blender/source/blender/blenloader/intern/readfile.c

If I remember correctly, the NeedRotate flag prevents the code from using vectors doesn't it? Using a ROTL would only work for scalars.

@mikhailramalho
Copy link
Member Author

Why does the assert fail?

This line here swaps nodes: https://github.com/llvm/llvm-project/blob/d6c848db4a97103e19c398848bde0876a2c4c88e/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L21487
It happens when building 526.blender_r/src/blender/source/blender/blenloader/intern/readfile.c

If I remember correctly, the NeedRotate flag prevents the code from using vectors doesn't it? Using a ROTL would only work for scalars.

Indeed, the code that triggers the assert fail is trying to merge two f32 into an i64

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@topperc topperc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikhailramalho mikhailramalho merged commit f138e36 into llvm:main Mar 22, 2025
9 of 11 checks passed
@mikhailramalho mikhailramalho deleted the dag-spillcost-fix branch March 22, 2025 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants