[SimplifyCFG][NFC] Improve compile time for TryToSimplifyUncondBranchFromEmptyBlock optimization. #110715

aemerson · 2024-10-01T18:08:28Z

In some pathological cases this optimization can spend an unreasonable amount of time populating the set for predecessors of the successor block. This change sinks some of that initializing to the point where it's actually necessary so we can take advantage of the existing early-exits.

rdar://137063034

llvmbot · 2024-10-01T18:09:14Z

@llvm/pr-subscribers-llvm-transforms

Author: Amara Emerson (aemerson)

Changes

In some pathological cases this optimization can spend an unreasonable amount of time.
Specifically, there could be a very large number of predecessor blocks of the unconditional
branch target. This change adds a threshold bailout to mitigate this.

rdar://137063034

Full diff: https://github.com/llvm/llvm-project/pull/110715.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Utils/Local.cpp (+10)
(added) llvm/test/Transforms/SimplifyCFG/branch-common-pred-threshold.ll (+44)

diff --git a/llvm/lib/Transforms/Utils/Local.cpp b/llvm/lib/Transforms/Utils/Local.cpp
index 7659fc69196151..3e00a63e5f2cf9 100644
--- a/llvm/lib/Transforms/Utils/Local.cpp
+++ b/llvm/lib/Transforms/Utils/Local.cpp
@@ -122,6 +122,11 @@ static cl::opt<unsigned> MaxPhiEntriesIncreaseAfterRemovingEmptyBlock(
 // bitreverse idioms.
 static const unsigned BitPartRecursionMaxDepth = 48;
 
+static cl::opt<unsigned> MaxNumPredsThreshold(
+    "simplifycfg-uncondbr-max-num-preds-threshold", cl::init(128), cl::Hidden,
+    cl::desc("The maximum number of predecessors a block can have to simplify "
+             "uncond-brs from empty blocks."));
+
 //===----------------------------------------------------------------------===//
 //  Local constant propagation.
 //
@@ -1165,6 +1170,11 @@ bool llvm::TryToSimplifyUncondBranchFromEmptyBlock(BasicBlock *BB,
   if (BB == Succ)
     return false;
 
+  // Some pathological IR can have extremely high numbers of predecessor blocks.
+  // Bail out early if so.
+  if (Succ->hasNPredecessorsOrMore(MaxNumPredsThreshold))
+    return false;
+
   SmallPtrSet<BasicBlock *, 16> BBPreds(pred_begin(BB), pred_end(BB));
   SmallPtrSet<BasicBlock *, 16> SuccPreds(pred_begin(Succ), pred_end(Succ));
 
diff --git a/llvm/test/Transforms/SimplifyCFG/branch-common-pred-threshold.ll b/llvm/test/Transforms/SimplifyCFG/branch-common-pred-threshold.ll
new file mode 100644
index 00000000000000..3f1c93762e34eb
--- /dev/null
+++ b/llvm/test/Transforms/SimplifyCFG/branch-common-pred-threshold.ll
@@ -0,0 +1,44 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt < %s -passes=simplifycfg -simplifycfg-uncondbr-max-num-preds-threshold=3 -S | FileCheck %s
+
+; This test checks that we don't optimize the edge from Pred -> BB to Pred -> Succ when
+; the number of predecessors of Succ is greater than the threshold.
+
+define i8 @succ_has_3_preds(i8 noundef %arg, i1 %c1, i1 %c2) {
+; CHECK-LABEL: @succ_has_3_preds(
+; CHECK-NEXT:  Pred:
+; CHECK-NEXT:    call void @dummy()
+; CHECK-NEXT:    br i1 [[C1:%.*]], label [[COMMONPRED:%.*]], label [[EXTRA_BB:%.*]]
+; CHECK:       extra_bb:
+; CHECK-NEXT:    br label [[SUCC:%.*]]
+; CHECK:       CommonPred:
+; CHECK-NEXT:    call void @dummy()
+; CHECK-NEXT:    br i1 [[C2:%.*]], label [[SUCC]], label [[BB:%.*]]
+; CHECK:       BB:
+; CHECK-NEXT:    [[PHI1:%.*]] = phi i8 [ 1, [[COMMONPRED]] ]
+; CHECK-NEXT:    br label [[SUCC]]
+; CHECK:       Succ:
+; CHECK-NEXT:    [[PHI2:%.*]] = phi i8 [ [[PHI1]], [[BB]] ], [ 4, [[COMMONPRED]] ], [ 0, [[EXTRA_BB]] ]
+; CHECK-NEXT:    ret i8 [[PHI2]]
+;
+Pred:
+call void @dummy()
+  br i1 %c1, label %CommonPred, label %extra_bb
+
+extra_bb:
+  br i1 %c1, label %CommonPred, label %Succ
+
+CommonPred:
+call void @dummy()
+  br i1 %c2, label %Succ, label %BB
+
+BB:
+  %phi1 = phi i8 [1, %CommonPred]
+  br label %Succ
+
+Succ:
+  %phi2 = phi i8 [ %phi1, %BB ], [ 4, %CommonPred ], [0, %extra_bb]
+  ret i8 %phi2
+}
+
+declare void @dummy()

nikic · 2024-10-02T08:13:33Z

Could you please explain in more detail where the compile-time issue comes from? Just many predecessors is not a problem by itself, something must be causing quadratic complexity if this is an issue.

128 is really not particularly high in this context, that's something you'll easily see with a large switch statement.

llvm-opt-benchmark results indicate that this does improve compile-time in some cases, but it also makes it much worse in others:


Top 5 improvements:
  cvc5/kind.cpp.ll 600352836 -> 215689291 -64.07%
  php/parse_date.ll 46278212535 -> 21177805250 -54.24%
  llvm/X86EncodingOptimization.cpp.ll 908626738 -> 436539081 -51.96%
  llvm/TokenKinds.cpp.ll 319626734 -> 156038687 -51.18%
  libquic/net_errors.cc.ll 234434780 -> 126031135 -46.24%
Top 5 regressions:
  php/pcre2_match.ll 10618116865 -> 997582429963 +9295.10%
  llvm/X86FixupInstTuning.cpp.ll 1011415693 -> 1822547362 +80.20%
  tokenizers-rs/4hn9gefsll13qr1r.ll 18284264366 -> 28850854248 +57.79%
  cpython/socketmodule.ll 1538073290 -> 1868852500 +21.51%
  openusd/write.c.ll 1139618620 -> 1299783905 +14.05%

dtcxzyw · 2024-10-02T08:18:46Z

Could you please explain in more detail where the compile-time issue comes from? Just many predecessors is not a problem by itself, something must be causing quadratic complexity if this is an issue.

128 is really not particularly high in this context, that's something you'll easily see with a large switch statement.

llvm-opt-benchmark results indicate that this does improve compile-time in some cases, but it also makes it much worse in others:

Reproduced locally.
Before:

time bin/opt -O3 ../../llvm-opt-benchmark/bench/php/original/pcre2_match.ll -disable-output

real    0m1.499s
user    0m1.479s
sys     0m0.019s

After:

time bin/opt -O3 ../../llvm-opt-benchmark/bench/php/original/pcre2_match.ll -disable-output

real    0m37.475s
user    0m37.451s
sys     0m0.021s

…FromEmptyBlock optimization. In some pathological cases this optimization can spend an unreasonable amount of time populating the set for predecessors of the successor block. This change sinks some of that initializing to the point where it's actually necessary so we can take advantage of the existing early-exits. rdar://137063034

aemerson · 2024-10-02T19:36:45Z

Could you please explain in more detail where the compile-time issue comes from? Just many predecessors is not a problem by itself, something must be causing quadratic complexity if this is an issue.

128 is really not particularly high in this context, that's something you'll easily see with a large switch statement.

llvm-opt-benchmark results indicate that this does improve compile-time in some cases, but it also makes it much worse in others:
Top 5 improvements:
  cvc5/kind.cpp.ll 600352836 -> 215689291 -64.07%
  php/parse_date.ll 46278212535 -> 21177805250 -54.24%
  llvm/X86EncodingOptimization.cpp.ll 908626738 -> 436539081 -51.96%
  llvm/TokenKinds.cpp.ll 319626734 -> 156038687 -51.18%
  libquic/net_errors.cc.ll 234434780 -> 126031135 -46.24%
Top 5 regressions:
  php/pcre2_match.ll 10618116865 -> 997582429963 +9295.10%
  llvm/X86FixupInstTuning.cpp.ll 1011415693 -> 1822547362 +80.20%
  tokenizers-rs/4hn9gefsll13qr1r.ll 18284264366 -> 28850854248 +57.79%
  cpython/socketmodule.ll 1538073290 -> 1868852500 +21.51%
  openusd/write.c.ll 1139618620 -> 1299783905 +14.05%

Thanks for the data, I didn't know that benchmark suite existed, very cool!

Our pathological case is quite dumb, it's not particularly any quadratic behavior I think (unless I'm mistaken). The issue was introduced by fc6bdb8 which added an unconditional initialization of the successor pointer set. In our test case that pointer set ends up being filled hundreds of thousands of times with 50k elements each time.

In any case, I realized that this initial fix was the wrong thing to do anyway because we should only be trying to fix the new regression introduced by that change, the optimization before fc6bdb8 was fine by itself. I've changed the fix to instead try to sink the initialization of that set as late as possible and to re-use the predecessor walk done by CanRedirectPredsOfEmptyBBToSucc(). It makes the implementation messier unfortunately, but it's now an NFC change.

aemerson · 2024-10-07T18:08:40Z

ping @nikic does the new approach seem reasonable?

nikic · 2024-10-07T18:59:18Z

llvm/lib/Transforms/Utils/Local.cpp

@@ -1042,7 +1043,8 @@ CanRedirectPredsOfEmptyBBToSucc(BasicBlock *BB, BasicBlock *Succ,

  // Get the single common predecessor of both BB and Succ. Return false
  // when there are more than one common predecessors.
-  for (BasicBlock *SuccPred : SuccPreds) {
+  for (BasicBlock *SuccPred : predecessors(Succ)) {
+    SuccPredsOut.insert(SuccPred);


Which part is important to short-circuit in your case? Is it the checks above or the check in this loop?

I think it was the checks in this loop.

nikic · 2024-10-07T19:00:31Z

llvm/lib/Transforms/Utils/Local.cpp

@@ -1182,6 +1184,10 @@ bool llvm::TryToSimplifyUncondBranchFromEmptyBlock(BasicBlock *BB,
  if ((!BBKillable && !BBPhisMergeable) || introduceTooManyPhiEntries(BB, Succ))
    return false;

+  // SuccPreds may not have been fully filled by CanRedirectPredsOfEmptyBBToSucc
+  // so finish it here.
+  SuccPreds.insert(pred_begin(Succ), pred_end(Succ));


Given that you iterate over all predecessors again here anyway, I think it would be cleaner to just only construct SuccPreds here and not try to partially cache it in CanRedirectPredsOfEmptyBBToSucc. It doesn't seem like we're actually saying any work, but it makes the API contract rather confusing.

Sure, I think you're right.

nikic

LGTM

nikic · 2024-10-08T18:24:31Z

llvm/lib/Transforms/Utils/Local.cpp


  if ((!BBKillable && !BBPhisMergeable) || introduceTooManyPhiEntries(BB, Succ))
    return false;

+  SmallPtrSet<BasicBlock *, 16> SuccPreds(pred_begin(Succ), pred_end(Succ));


I'd move this further down to where it's actually needed now.

aemerson requested a review from nikic October 1, 2024 18:08

llvmbot added the llvm:transforms label Oct 1, 2024

aemerson requested a review from XChy October 1, 2024 18:09

aemerson requested a review from fhahn October 1, 2024 18:11

aemerson added the slow-compile label Oct 1, 2024

nikic mentioned this pull request Oct 2, 2024

Task submission dtcxzyw/llvm-opt-benchmark#1312

Open

dtcxzyw mentioned this pull request Oct 2, 2024

pre-commit: PR110715 dtcxzyw/llvm-opt-benchmark#1418

Closed

aemerson force-pushed the fix-compile-time-simplifycfg branch from 0637e43 to 03d9ec6 Compare October 2, 2024 19:32

aemerson changed the title ~~[SimplifyCFG] Add predecessor threshold for TryToSimplifyUncondBranchFromEmptyBlock optimization.~~ [SimplifyCFG][NFC] Improve compile time for TryToSimplifyUncondBranchFromEmptyBlock optimization. Oct 2, 2024

nikic reviewed Oct 7, 2024

View reviewed changes

Don't unnecessarily cache.

7a4c521

aemerson force-pushed the fix-compile-time-simplifycfg branch from 89a64fa to 7a4c521 Compare October 7, 2024 20:49

nikic approved these changes Oct 8, 2024

View reviewed changes

Sink SuccPreds further.

1e602db

aemerson merged commit 18d655f into llvm:main Oct 9, 2024
3 of 5 checks passed

aemerson deleted the fix-compile-time-simplifycfg branch October 9, 2024 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SimplifyCFG][NFC] Improve compile time for TryToSimplifyUncondBranchFromEmptyBlock optimization. #110715

[SimplifyCFG][NFC] Improve compile time for TryToSimplifyUncondBranchFromEmptyBlock optimization. #110715

Uh oh!

aemerson commented Oct 1, 2024 •

edited

Loading

Uh oh!

llvmbot commented Oct 1, 2024

Uh oh!

nikic commented Oct 2, 2024

Uh oh!

dtcxzyw commented Oct 2, 2024

Uh oh!

aemerson commented Oct 2, 2024

Uh oh!

aemerson commented Oct 7, 2024

Uh oh!

nikic Oct 7, 2024

Uh oh!

aemerson Oct 7, 2024

Uh oh!

nikic Oct 7, 2024

Uh oh!

aemerson Oct 7, 2024

Uh oh!

nikic left a comment

Uh oh!

nikic Oct 8, 2024

Uh oh!

Uh oh!

Uh oh!

[SimplifyCFG][NFC] Improve compile time for TryToSimplifyUncondBranchFromEmptyBlock optimization. #110715

[SimplifyCFG][NFC] Improve compile time for TryToSimplifyUncondBranchFromEmptyBlock optimization. #110715

Uh oh!

Conversation

aemerson commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 1, 2024

Uh oh!

nikic commented Oct 2, 2024

Uh oh!

dtcxzyw commented Oct 2, 2024

Uh oh!

aemerson commented Oct 2, 2024

Uh oh!

aemerson commented Oct 7, 2024

Uh oh!

nikic Oct 7, 2024

Choose a reason for hiding this comment

Uh oh!

aemerson Oct 7, 2024

Choose a reason for hiding this comment

Uh oh!

nikic Oct 7, 2024

Choose a reason for hiding this comment

Uh oh!

aemerson Oct 7, 2024

Choose a reason for hiding this comment

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

nikic Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aemerson commented Oct 1, 2024 •

edited

Loading