[ARM] Speedups for CombineBaseUpdate. #129725

davemgreen · 2025-03-04T15:31:31Z

This attempts to put limits onto CombineBaseUpdate for degenerate cases like #127477. The biggest change is to add a limit to the number of base updates to check in CombineBaseUpdate. 64 is hopefully plenty high enough for most runtime unrolled loops to generate postinc where they are beneficial.

It also moves the check for isValidBaseUpdate later so that it only happens if we will generate a valid instruction. The 1024 limit to hasPredecessorHelper comes from the X86 backend, which uses the same limit.

I haven't added a test case as it would need to be very big and my attempts at generating a smaller version did not show anything useful.

Fixes #127477.

llvmbot · 2025-03-04T15:32:13Z

@llvm/pr-subscribers-backend-arm

Author: David Green (davemgreen)

Changes

This attempts to put limits onto CombineBaseUpdate for degenerate cases like #127477. The biggest change is to add a limit to the number of base updates to check in CombineBaseUpdate. 64 is hopefully plenty high enough for most runtime unrolled loops to generate postinc where they are beneficial.

It also moves the check for isValidBaseUpdate later so that it only happens if we will generate a valid instruction. The 1024 limit to hasPredecessorHelper comes from the X86 backend, which uses the same limit.

I haven't added a test case as it would need to be very big and my attempts at generating a smaller version did not show anything useful.

Fixes #127477.

Full diff: https://github.com/llvm/llvm-project/pull/129725.diff

1 Files Affected:

(modified) llvm/lib/Target/ARM/ARMISelLowering.cpp (+36-29)

diff --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index e3dc337bd0843..b5e65d1424644 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -149,6 +149,11 @@ MVEMaxSupportedInterleaveFactor("mve-max-interleave-factor", cl::Hidden,
   cl::desc("Maximum interleave factor for MVE VLDn to generate."),
   cl::init(2));
 
+cl::opt<unsigned> ArmMaxBaseUpdatesToCheck(
+    "arm-max-base-updates-to-check", cl::Hidden,
+    cl::desc("Maximum number of base-updates to check generating postindex."),
+    cl::init(64));
+
 /// Value type used for "flags" operands / results (either CPSR or FPSCR_NZCV).
 constexpr MVT FlagsVT = MVT::i32;
 
@@ -15865,6 +15870,21 @@ struct BaseUpdateUser {
   unsigned ConstInc;
 };
 
+static bool isValidBaseUpdate(SDNode *N, SDNode *User) {
+  // Check that the add is independent of the load/store.
+  // Otherwise, folding it would create a cycle. Search through Addr
+  // as well, since the User may not be a direct user of Addr and
+  // only share a base pointer.
+  SmallPtrSet<const SDNode *, 32> Visited;
+  SmallVector<const SDNode *, 16> Worklist;
+  Worklist.push_back(N);
+  Worklist.push_back(User);
+  if (SDNode::hasPredecessorHelper(N, Visited, Worklist, 1024) ||
+      SDNode::hasPredecessorHelper(User, Visited, Worklist, 1024))
+    return false;
+  return true;
+}
+
 static bool TryCombineBaseUpdate(struct BaseUpdateTarget &Target,
                                  struct BaseUpdateUser &User,
                                  bool SimpleConstIncOnly,
@@ -16066,6 +16086,9 @@ static bool TryCombineBaseUpdate(struct BaseUpdateTarget &Target,
   if (SimpleConstIncOnly && User.ConstInc != NumBytes)
     return false;
 
+  if (!isValidBaseUpdate(N, User.N))
+    return false;
+
   // OK, we found an ADD we can fold into the base update.
   // Now, create a _UPD node, taking care of not breaking alignment.
 
@@ -16214,21 +16237,6 @@ static bool findPointerConstIncrement(SDNode *N, SDValue *Ptr, SDValue *CInc) {
   }
 }
 
-static bool isValidBaseUpdate(SDNode *N, SDNode *User) {
-  // Check that the add is independent of the load/store.
-  // Otherwise, folding it would create a cycle. Search through Addr
-  // as well, since the User may not be a direct user of Addr and
-  // only share a base pointer.
-  SmallPtrSet<const SDNode *, 32> Visited;
-  SmallVector<const SDNode *, 16> Worklist;
-  Worklist.push_back(N);
-  Worklist.push_back(User);
-  if (SDNode::hasPredecessorHelper(N, Visited, Worklist) ||
-      SDNode::hasPredecessorHelper(User, Visited, Worklist))
-    return false;
-  return true;
-}
-
 /// CombineBaseUpdate - Target-specific DAG combine function for VLDDUP,
 /// NEON load/store intrinsics, and generic vector load/stores, to merge
 /// base address updates.
@@ -16242,6 +16250,10 @@ static SDValue CombineBaseUpdate(SDNode *N,
   const unsigned AddrOpIdx = ((isIntrinsic || isStore) ? 2 : 1);
   BaseUpdateTarget Target = {N, isIntrinsic, isStore, AddrOpIdx};
 
+  // Limit the number of possible base-updates we look at to prevent degenerate
+  // cases.
+  unsigned MaxBaseUpdates = ArmMaxBaseUpdatesToCheck;
+
   SDValue Addr = N->getOperand(AddrOpIdx);
 
   SmallVector<BaseUpdateUser, 8> BaseUpdates;
@@ -16256,8 +16268,11 @@ static SDValue CombineBaseUpdate(SDNode *N,
     unsigned ConstInc =
         getPointerConstIncrement(User->getOpcode(), Addr, Inc, DCI.DAG);
 
-    if (ConstInc || User->getOpcode() == ISD::ADD)
+    if (ConstInc || User->getOpcode() == ISD::ADD) {
       BaseUpdates.push_back({User, Inc, ConstInc});
+      if (BaseUpdates.size() > MaxBaseUpdates)
+        break;
+    }
   }
 
   // If the address is a constant pointer increment itself, find
@@ -16284,27 +16299,19 @@ static SDValue CombineBaseUpdate(SDNode *N,
       unsigned NewConstInc = UserOffset - Offset;
       SDValue NewInc = DCI.DAG.getConstant(NewConstInc, SDLoc(N), MVT::i32);
       BaseUpdates.push_back({User, NewInc, NewConstInc});
+      if (BaseUpdates.size() > MaxBaseUpdates)
+        break;
     }
   }
 
   // Try to fold the load/store with an update that matches memory
   // access size. This should work well for sequential loads.
-  //
-  // Filter out invalid updates as well.
   unsigned NumValidUpd = BaseUpdates.size();
-  for (unsigned I = 0; I < NumValidUpd;) {
+  for (unsigned I = 0; I < NumValidUpd; I++) {
     BaseUpdateUser &User = BaseUpdates[I];
-    if (!isValidBaseUpdate(N, User.N)) {
-      --NumValidUpd;
-      std::swap(BaseUpdates[I], BaseUpdates[NumValidUpd]);
-      continue;
-    }
-
     if (TryCombineBaseUpdate(Target, User, /*SimpleConstIncOnly=*/true, DCI))
       return SDValue();
-    ++I;
   }
-  BaseUpdates.resize(NumValidUpd);
 
   // Try to fold with other users. Non-constant updates are considered
   // first, and constant updates are sorted to not break a sequence of
@@ -16360,8 +16367,8 @@ static SDValue PerformMVEVLDCombine(SDNode *N,
     Visited.insert(Addr.getNode());
     Worklist.push_back(N);
     Worklist.push_back(User);
-    if (SDNode::hasPredecessorHelper(N, Visited, Worklist) ||
-        SDNode::hasPredecessorHelper(User, Visited, Worklist))
+    if (SDNode::hasPredecessorHelper(N, Visited, Worklist, 1024) ||
+        SDNode::hasPredecessorHelper(User, Visited, Worklist, 1024))
       continue;
 
     // Find the new opcode for the updating load/store.

smithp35

For a non-expert this looks reasonable to me. I've made a couple of stylistic comments but nothing significant.

I guess the biggest decision point is whether 64 is the right choice. It seems to be the right choice for the problematic test case, and sounds plausible to be enough for unrolled loops.
I'm inclined to say let's give it a try and tune it there's any evidence of regressions.

smithp35 · 2025-03-05T11:11:26Z

llvm/lib/Target/ARM/ARMISelLowering.cpp

@@ -16284,27 +16299,19 @@ static SDValue CombineBaseUpdate(SDNode *N,
      unsigned NewConstInc = UserOffset - Offset;
      SDValue NewInc = DCI.DAG.getConstant(NewConstInc, SDLoc(N), MVT::i32);
      BaseUpdates.push_back({User, NewInc, NewConstInc});
+      if (BaseUpdates.size() > MaxBaseUpdates)


Being pedantic, wouldn't this take BaseUpdates one over MaxBaseUpdates? Not important in practice, but possibly worth making this == MaxBaseUpdates.

I tend to go with > just in case something else adds a use and we skip the exact bound. I've made it >= now for these two.

smithp35 · 2025-03-05T11:12:28Z

llvm/lib/Target/ARM/ARMISelLowering.cpp

@@ -16242,6 +16250,10 @@ static SDValue CombineBaseUpdate(SDNode *N,
  const unsigned AddrOpIdx = ((isIntrinsic || isStore) ? 2 : 1);
  BaseUpdateTarget Target = {N, isIntrinsic, isStore, AddrOpIdx};

+  // Limit the number of possible base-updates we look at to prevent degenerate
+  // cases.
+  unsigned MaxBaseUpdates = ArmMaxBaseUpdatesToCheck;


Other than acting as a place to put the comment, couldn't we just use ArmMaxBaseUpdatesToCheck inline in the two places it is used?

Don't have a strong opinion on that, happy to keep this.

Yeah I added this mostly as a place to put the comment. It could also help to not dereference the global multiple times, but that likely isn't a huge issue.

smithp35 · 2025-03-05T11:20:15Z

llvm/lib/Target/ARM/ARMISelLowering.cpp

@@ -16360,8 +16367,8 @@ static SDValue PerformMVEVLDCombine(SDNode *N,
    Visited.insert(Addr.getNode());
    Worklist.push_back(N);
    Worklist.push_back(User);
-    if (SDNode::hasPredecessorHelper(N, Visited, Worklist) ||
-        SDNode::hasPredecessorHelper(User, Visited, Worklist))
+    if (SDNode::hasPredecessorHelper(N, Visited, Worklist, 1024) ||


We've got 1024 used in two different functions. Would it be worth making it a global constant with a comment explaining it.

The x86 and SystemZ backend seem to use Max uncommented, although it is local to the function.

Added MaxSteps.

This attempts to put limits onto CombineBaseUpdate for degenerate cases like 127477. The biggest change is to add a limit to the number of base updates to check in CombineBaseUpdate. 64 is hopefully plenty high enough for most runtime unrolled loops to generate postinc where they are beneficial. It also moves the check for isValidBaseUpdate later so that it only happens if we will generate a valid instruction. The 1024 limit to hasPredecessorHelper comes from the X86 backend, which uses the same limit. I haven't added a test case as it would need to be very big and my attempts at generating a smaller version did not show anything useful. Fixes llvm#127477.

asavonic · 2025-03-05T12:54:20Z

LGTM, thanks a lot David!

I can confirm that limiting BaseUpdates.size() is critical. Just setting MaxSteps was not enough to fix the issue.

asavonic · 2025-03-05T12:50:40Z

llvm/lib/Target/ARM/ARMISelLowering.cpp

    BaseUpdateUser &User = BaseUpdates[I];
-    if (!isValidBaseUpdate(N, User.N)) {


Good idea. The downside is that there are two calls to TryCombineBaseUpdate, so we potentially call isValidBaseUpdate twice. This only happens if we can convert an instruction to post increment, but there is a dependency on operands that we cannot break. Shouldn't be a common case, and with MaxSteps parameter being set, it is not as bad for compile time.

smithp35

LGTM on my side. Thanks for the updates.

kongy · 2025-03-06T06:07:19Z

Is this change ready to merge? We need this change to fix the Android platform build.

davemgreen · 2025-03-06T09:34:20Z

Is this change ready to merge? We need this change to fix the Android platform build.

Hi - just running some testing to make sure it performs OK. It looks like it should be OK from what I ran.

This attempts to put limits onto CombineBaseUpdate for degenerate cases like llvm#127477. The biggest change is to add a limit to the number of base updates to check in CombineBaseUpdate. 64 is hopefully plenty high enough for most runtime unrolled loops to generate postinc where they are beneficial. It also moves the check for isValidBaseUpdate later so that it only happens if we will generate a valid instruction. The 1024 limit to hasPredecessorHelper comes from the X86 backend, which uses the same limit. I haven't added a test case as it would need to be very big and my attempts at generating a smaller version did not show anything useful. Fixes llvm#127477.

This attempts to put limits onto CombineBaseUpdate for degenerate cases like llvm#127477. The biggest change is to add a limit to the number of base updates to check in CombineBaseUpdate. 64 is hopefully plenty high enough for most runtime unrolled loops to generate postinc where they are beneficial. It also moves the check for isValidBaseUpdate later so that it only happens if we will generate a valid instruction. The 1024 limit to hasPredecessorHelper comes from the X86 backend, which uses the same limit. I haven't added a test case as it would need to be very big and my attempts at generating a smaller version did not show anything useful. Fixes llvm#127477. (cherry picked from commit 86cf4ed)

davemgreen requested review from asavonic, DanielKristofKiss, nasherm and smithp35 March 4, 2025 15:31

llvmbot added the backend:ARM label Mar 4, 2025

davemgreen mentioned this pull request Mar 4, 2025

Takes a long time to compile ARM32 Android libart-disassembler #127477

Closed

smithp35 reviewed Mar 5, 2025

View reviewed changes

davemgreen added 2 commits March 5, 2025 12:53

Address comments

5877ba0

davemgreen force-pushed the gh-arm-slowbaseupdate branch from 38c818d to 5877ba0 Compare March 5, 2025 12:54

asavonic reviewed Mar 5, 2025

View reviewed changes

smithp35 approved these changes Mar 5, 2025

View reviewed changes

DanielKristofKiss approved these changes Mar 5, 2025

View reviewed changes

davemgreen merged commit 86cf4ed into llvm:main Mar 6, 2025
11 checks passed

davemgreen deleted the gh-arm-slowbaseupdate branch March 6, 2025 09:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ARM] Speedups for CombineBaseUpdate. #129725

[ARM] Speedups for CombineBaseUpdate. #129725

Uh oh!

davemgreen commented Mar 4, 2025

Uh oh!

llvmbot commented Mar 4, 2025

Uh oh!

smithp35 left a comment

Uh oh!

smithp35 Mar 5, 2025

Uh oh!

davemgreen Mar 5, 2025

Uh oh!

smithp35 Mar 5, 2025

Uh oh!

davemgreen Mar 5, 2025

Uh oh!

smithp35 Mar 5, 2025

Uh oh!

davemgreen Mar 5, 2025

Uh oh!

asavonic commented Mar 5, 2025

Uh oh!

asavonic Mar 5, 2025

Uh oh!

smithp35 left a comment

Uh oh!

kongy commented Mar 6, 2025

Uh oh!

davemgreen commented Mar 6, 2025

Uh oh!

Uh oh!

Uh oh!

		BaseUpdateUser &User = BaseUpdates[I];
		if (!isValidBaseUpdate(N, User.N)) {

[ARM] Speedups for CombineBaseUpdate. #129725

[ARM] Speedups for CombineBaseUpdate. #129725

Uh oh!

Conversation

davemgreen commented Mar 4, 2025

Uh oh!

llvmbot commented Mar 4, 2025

Uh oh!

smithp35 left a comment

Choose a reason for hiding this comment

Uh oh!

smithp35 Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

davemgreen Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

smithp35 Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

davemgreen Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

smithp35 Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

davemgreen Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

asavonic commented Mar 5, 2025

Uh oh!

asavonic Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

smithp35 left a comment

Choose a reason for hiding this comment

Uh oh!

kongy commented Mar 6, 2025

Uh oh!

davemgreen commented Mar 6, 2025

Uh oh!

Uh oh!

Uh oh!