Skip to content

[SimplifyCFG][NFC] Improve compile time for TryToSimplifyUncondBranchFromEmptyBlock optimization. #110715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 9, 2024

Conversation

aemerson
Copy link
Contributor

@aemerson aemerson commented Oct 1, 2024

In some pathological cases this optimization can spend an unreasonable amount of time populating the set for predecessors of the successor block. This change sinks some of that initializing to the point where it's actually necessary so we can take advantage of the existing early-exits.

rdar://137063034

@aemerson aemerson requested a review from nikic October 1, 2024 18:08
@aemerson aemerson requested a review from XChy October 1, 2024 18:09
@llvmbot
Copy link
Member

llvmbot commented Oct 1, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Amara Emerson (aemerson)

Changes

In some pathological cases this optimization can spend an unreasonable amount of time.
Specifically, there could be a very large number of predecessor blocks of the unconditional
branch target. This change adds a threshold bailout to mitigate this.

rdar://137063034


Full diff: https://github.com/llvm/llvm-project/pull/110715.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Utils/Local.cpp (+10)
  • (added) llvm/test/Transforms/SimplifyCFG/branch-common-pred-threshold.ll (+44)
diff --git a/llvm/lib/Transforms/Utils/Local.cpp b/llvm/lib/Transforms/Utils/Local.cpp
index 7659fc69196151..3e00a63e5f2cf9 100644
--- a/llvm/lib/Transforms/Utils/Local.cpp
+++ b/llvm/lib/Transforms/Utils/Local.cpp
@@ -122,6 +122,11 @@ static cl::opt<unsigned> MaxPhiEntriesIncreaseAfterRemovingEmptyBlock(
 // bitreverse idioms.
 static const unsigned BitPartRecursionMaxDepth = 48;
 
+static cl::opt<unsigned> MaxNumPredsThreshold(
+    "simplifycfg-uncondbr-max-num-preds-threshold", cl::init(128), cl::Hidden,
+    cl::desc("The maximum number of predecessors a block can have to simplify "
+             "uncond-brs from empty blocks."));
+
 //===----------------------------------------------------------------------===//
 //  Local constant propagation.
 //
@@ -1165,6 +1170,11 @@ bool llvm::TryToSimplifyUncondBranchFromEmptyBlock(BasicBlock *BB,
   if (BB == Succ)
     return false;
 
+  // Some pathological IR can have extremely high numbers of predecessor blocks.
+  // Bail out early if so.
+  if (Succ->hasNPredecessorsOrMore(MaxNumPredsThreshold))
+    return false;
+
   SmallPtrSet<BasicBlock *, 16> BBPreds(pred_begin(BB), pred_end(BB));
   SmallPtrSet<BasicBlock *, 16> SuccPreds(pred_begin(Succ), pred_end(Succ));
 
diff --git a/llvm/test/Transforms/SimplifyCFG/branch-common-pred-threshold.ll b/llvm/test/Transforms/SimplifyCFG/branch-common-pred-threshold.ll
new file mode 100644
index 00000000000000..3f1c93762e34eb
--- /dev/null
+++ b/llvm/test/Transforms/SimplifyCFG/branch-common-pred-threshold.ll
@@ -0,0 +1,44 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt < %s -passes=simplifycfg -simplifycfg-uncondbr-max-num-preds-threshold=3 -S | FileCheck %s
+
+; This test checks that we don't optimize the edge from Pred -> BB to Pred -> Succ when
+; the number of predecessors of Succ is greater than the threshold.
+
+define i8 @succ_has_3_preds(i8 noundef %arg, i1 %c1, i1 %c2) {
+; CHECK-LABEL: @succ_has_3_preds(
+; CHECK-NEXT:  Pred:
+; CHECK-NEXT:    call void @dummy()
+; CHECK-NEXT:    br i1 [[C1:%.*]], label [[COMMONPRED:%.*]], label [[EXTRA_BB:%.*]]
+; CHECK:       extra_bb:
+; CHECK-NEXT:    br label [[SUCC:%.*]]
+; CHECK:       CommonPred:
+; CHECK-NEXT:    call void @dummy()
+; CHECK-NEXT:    br i1 [[C2:%.*]], label [[SUCC]], label [[BB:%.*]]
+; CHECK:       BB:
+; CHECK-NEXT:    [[PHI1:%.*]] = phi i8 [ 1, [[COMMONPRED]] ]
+; CHECK-NEXT:    br label [[SUCC]]
+; CHECK:       Succ:
+; CHECK-NEXT:    [[PHI2:%.*]] = phi i8 [ [[PHI1]], [[BB]] ], [ 4, [[COMMONPRED]] ], [ 0, [[EXTRA_BB]] ]
+; CHECK-NEXT:    ret i8 [[PHI2]]
+;
+Pred:
+call void @dummy()
+  br i1 %c1, label %CommonPred, label %extra_bb
+
+extra_bb:
+  br i1 %c1, label %CommonPred, label %Succ
+
+CommonPred:
+call void @dummy()
+  br i1 %c2, label %Succ, label %BB
+
+BB:
+  %phi1 = phi i8 [1, %CommonPred]
+  br label %Succ
+
+Succ:
+  %phi2 = phi i8 [ %phi1, %BB ], [ 4, %CommonPred ], [0, %extra_bb]
+  ret i8 %phi2
+}
+
+declare void @dummy()

@nikic
Copy link
Contributor

nikic commented Oct 2, 2024

Could you please explain in more detail where the compile-time issue comes from? Just many predecessors is not a problem by itself, something must be causing quadratic complexity if this is an issue.

128 is really not particularly high in this context, that's something you'll easily see with a large switch statement.

llvm-opt-benchmark results indicate that this does improve compile-time in some cases, but it also makes it much worse in others:


Top 5 improvements:
  cvc5/kind.cpp.ll 600352836 -> 215689291 -64.07%
  php/parse_date.ll 46278212535 -> 21177805250 -54.24%
  llvm/X86EncodingOptimization.cpp.ll 908626738 -> 436539081 -51.96%
  llvm/TokenKinds.cpp.ll 319626734 -> 156038687 -51.18%
  libquic/net_errors.cc.ll 234434780 -> 126031135 -46.24%
Top 5 regressions:
  php/pcre2_match.ll 10618116865 -> 997582429963 +9295.10%
  llvm/X86FixupInstTuning.cpp.ll 1011415693 -> 1822547362 +80.20%
  tokenizers-rs/4hn9gefsll13qr1r.ll 18284264366 -> 28850854248 +57.79%
  cpython/socketmodule.ll 1538073290 -> 1868852500 +21.51%
  openusd/write.c.ll 1139618620 -> 1299783905 +14.05%

@dtcxzyw
Copy link
Member

dtcxzyw commented Oct 2, 2024

Could you please explain in more detail where the compile-time issue comes from? Just many predecessors is not a problem by itself, something must be causing quadratic complexity if this is an issue.

128 is really not particularly high in this context, that's something you'll easily see with a large switch statement.

llvm-opt-benchmark results indicate that this does improve compile-time in some cases, but it also makes it much worse in others:

Reproduced locally.
Before:

time bin/opt -O3 ../../llvm-opt-benchmark/bench/php/original/pcre2_match.ll -disable-output

real    0m1.499s
user    0m1.479s
sys     0m0.019s

After:

time bin/opt -O3 ../../llvm-opt-benchmark/bench/php/original/pcre2_match.ll -disable-output

real    0m37.475s
user    0m37.451s
sys     0m0.021s

…FromEmptyBlock optimization.

In some pathological cases this optimization can spend an unreasonable amount of
time populating the set for predecessors of the successor block. This change
sinks some of that initializing to the point where it's actually necessary so we
can take advantage of the existing early-exits.

rdar://137063034
@aemerson aemerson force-pushed the fix-compile-time-simplifycfg branch from 0637e43 to 03d9ec6 Compare October 2, 2024 19:32
@aemerson
Copy link
Contributor Author

aemerson commented Oct 2, 2024

Could you please explain in more detail where the compile-time issue comes from? Just many predecessors is not a problem by itself, something must be causing quadratic complexity if this is an issue.

128 is really not particularly high in this context, that's something you'll easily see with a large switch statement.

llvm-opt-benchmark results indicate that this does improve compile-time in some cases, but it also makes it much worse in others:


Top 5 improvements:
  cvc5/kind.cpp.ll 600352836 -> 215689291 -64.07%
  php/parse_date.ll 46278212535 -> 21177805250 -54.24%
  llvm/X86EncodingOptimization.cpp.ll 908626738 -> 436539081 -51.96%
  llvm/TokenKinds.cpp.ll 319626734 -> 156038687 -51.18%
  libquic/net_errors.cc.ll 234434780 -> 126031135 -46.24%
Top 5 regressions:
  php/pcre2_match.ll 10618116865 -> 997582429963 +9295.10%
  llvm/X86FixupInstTuning.cpp.ll 1011415693 -> 1822547362 +80.20%
  tokenizers-rs/4hn9gefsll13qr1r.ll 18284264366 -> 28850854248 +57.79%
  cpython/socketmodule.ll 1538073290 -> 1868852500 +21.51%
  openusd/write.c.ll 1139618620 -> 1299783905 +14.05%

Thanks for the data, I didn't know that benchmark suite existed, very cool!

Our pathological case is quite dumb, it's not particularly any quadratic behavior I think (unless I'm mistaken). The issue was introduced by fc6bdb8 which added an unconditional initialization of the successor pointer set. In our test case that pointer set ends up being filled hundreds of thousands of times with 50k elements each time.

In any case, I realized that this initial fix was the wrong thing to do anyway because we should only be trying to fix the new regression introduced by that change, the optimization before fc6bdb8 was fine by itself. I've changed the fix to instead try to sink the initialization of that set as late as possible and to re-use the predecessor walk done by CanRedirectPredsOfEmptyBBToSucc(). It makes the implementation messier unfortunately, but it's now an NFC change.

@aemerson aemerson changed the title [SimplifyCFG] Add predecessor threshold for TryToSimplifyUncondBranchFromEmptyBlock optimization. [SimplifyCFG][NFC] Improve compile time for TryToSimplifyUncondBranchFromEmptyBlock optimization. Oct 2, 2024
@aemerson
Copy link
Contributor Author

aemerson commented Oct 7, 2024

ping @nikic does the new approach seem reasonable?

@@ -1042,7 +1043,8 @@ CanRedirectPredsOfEmptyBBToSucc(BasicBlock *BB, BasicBlock *Succ,

// Get the single common predecessor of both BB and Succ. Return false
// when there are more than one common predecessors.
for (BasicBlock *SuccPred : SuccPreds) {
for (BasicBlock *SuccPred : predecessors(Succ)) {
SuccPredsOut.insert(SuccPred);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part is important to short-circuit in your case? Is it the checks above or the check in this loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was the checks in this loop.

@@ -1182,6 +1184,10 @@ bool llvm::TryToSimplifyUncondBranchFromEmptyBlock(BasicBlock *BB,
if ((!BBKillable && !BBPhisMergeable) || introduceTooManyPhiEntries(BB, Succ))
return false;

// SuccPreds may not have been fully filled by CanRedirectPredsOfEmptyBBToSucc
// so finish it here.
SuccPreds.insert(pred_begin(Succ), pred_end(Succ));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you iterate over all predecessors again here anyway, I think it would be cleaner to just only construct SuccPreds here and not try to partially cache it in CanRedirectPredsOfEmptyBBToSucc. It doesn't seem like we're actually saying any work, but it makes the API contract rather confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think you're right.

@aemerson aemerson force-pushed the fix-compile-time-simplifycfg branch from 89a64fa to 7a4c521 Compare October 7, 2024 20:49
Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


if ((!BBKillable && !BBPhisMergeable) || introduceTooManyPhiEntries(BB, Succ))
return false;

SmallPtrSet<BasicBlock *, 16> SuccPreds(pred_begin(Succ), pred_end(Succ));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this further down to where it's actually needed now.

@aemerson aemerson merged commit 18d655f into llvm:main Oct 9, 2024
3 of 5 checks passed
@aemerson aemerson deleted the fix-compile-time-simplifycfg branch October 9, 2024 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants