Skip to content

[Pipelines] Move IPSCCP after inliner pipeline #96620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sihuan
Copy link
Contributor

@sihuan sihuan commented Jun 25, 2024

This patch significantly improves the performance of LLVM for the SPEC2017:548.exchange2_r benchmark, with a performance uplift of over 40% on the rv64gc.

During our investigation into the significant performance disparity between GCC and LLVM on the SPEC2017:548.exchange2_r benchmark on RISC-V, we identified that the primary difference stems from constant propagation optimization.
In GCC, the hotspot function digits_2 is split into several parts:

$ objdump -D exchange2_r_gcc | grep "digits_2.*:$"
000000000001d480 <__brute_force_MOD_digits_2.isra.0>:
000000000001f0f6 <__brute_force_MOD_digits_2.constprop.7.isra.0>:
000000000001fdd0 <__brute_force_MOD_digits_2.constprop.6.isra.0>:
0000000000020900 <__brute_force_MOD_digits_2.constprop.5.isra.0>:
00000000000211c4 <__brute_force_MOD_digits_2.constprop.4.isra.0>:
0000000000022002 <__brute_force_MOD_digits_2.constprop.3.isra.0>:
0000000000022d6a <__brute_force_MOD_digits_2.constprop.2.isra.0>:
0000000000023898 <__brute_force_MOD_digits_2.constprop.1.isra.0>:

However, in LLVM, this function is not split:

$ objdump -D exchange2_r_llvm | grep "digits_2.*:$"
00000000000115a0 <_QMbrute_forcePdigits_2>:

By applying this patch, LLVM now exhibits similar behavior, resulting in a substantial performance uplift.

$ objdump -D exchange2_r_patched_llvm | grep "digits_2.*:$"
0000000000011ab0 <_QMbrute_forcePdigits_2>:
0000000000018a4e <_QMbrute_forcePdigits_2.specialized.1>:
0000000000019820 <_QMbrute_forcePdigits_2.specialized.2>:
000000000001a436 <_QMbrute_forcePdigits_2.specialized.3>:
000000000001ae78 <_QMbrute_forcePdigits_2.specialized.4>:
000000000001ba8e <_QMbrute_forcePdigits_2.specialized.5>:
000000000001c7e6 <_QMbrute_forcePdigits_2.specialized.6>:
000000000001d072 <_QMbrute_forcePdigits_2.specialized.7>:
000000000001dad0 <_QMbrute_forcePdigits_2.specialized.8>:

And we used perf stat to measure the instruction count for exchange2_r 0 on rv64gc, as shown in the table below:

Compiler Instructions
GCC #d28ea8e5 55,965,728,914
LLVM #62d44fbd 105,416,890,241
LLVM #62d44fbd with this patch 62,693,427,761

Additionally, I performed tests on x86_64, yielding similar results:

Compiler cpu_atom instructions
LLVM #62d44fbd 100,147,914,793
LLVM #62d44fbd with this patch 53,077,337,115

Moving the Interprocedural Constant Propagation (IPSCCP) pass to run after the
inliner pipeline can enhance optimization effectiveness. Performance uplift
for SPEC2017:548.exchange2_r on rv64gc is over 40%.
@dtcxzyw
Copy link
Member

dtcxzyw commented Jun 25, 2024

Failed Tests (14):
LLVM :: Other/new-pm-defaults.ll
LLVM :: Other/new-pm-thinlto-postlink-defaults.ll
LLVM :: Other/new-pm-thinlto-postlink-pgo-defaults.ll
LLVM :: Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
LLVM :: Other/new-pm-thinlto-prelink-defaults.ll
LLVM :: Other/new-pm-thinlto-prelink-pgo-defaults.ll
LLVM :: Other/new-pm-thinlto-prelink-samplepgo-defaults.ll
LLVM :: ThinLTO/X86/devirt2.ll
LLVM :: ThinLTO/X86/devirt_promote.ll
LLVM :: ThinLTO/X86/devirt_promote_legacy.ll
LLVM :: Transforms/PhaseOrdering/AArch64/constraint-elimination-placement.ll
LLVM :: Transforms/PhaseOrdering/dce-after-argument-promotion.ll
LLVM :: Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll
LLVM :: Transforms/PhaseOrdering/icmp-ashr-breaking-select-idiom.ll

Please update these tests.

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently rely on IPSCCP happening early, so it can make use of dominating conditions before they have been converted to selects.

@dtcxzyw
Copy link
Member

dtcxzyw commented Jun 25, 2024

This patch causes some significant performance regressions on llvm-test-suite (rv64gc-O3-thinlto):

Name Before After Ratio
SingleSource/Benchmarks/Shootout/Shootout-random 2.150161677 3.300161641 + 53.5%
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trisolv/trisolv 0.111845159 0.145389494 +30.0%
SingleSource/Benchmarks/Adobe-C++/functionobjects 5.489498263 6.827863965 +24.4%

@llvmbot llvmbot added clang Clang issues not falling into any other category llvm:transforms labels Jun 27, 2024
@llvmbot
Copy link
Member

llvmbot commented Jun 27, 2024

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-clang

Author: SiHuaN (sihuan)

Changes

This patch significantly improves the performance of LLVM for the SPEC2017:548.exchange2_r benchmark, with a performance uplift of over 40% on the rv64gc.

During our investigation into the significant performance disparity between GCC and LLVM on the SPEC2017:548.exchange2_r benchmark on RISC-V, we identified that the primary difference stems from constant propagation optimization.
In GCC, the hotspot function digits_2 is split into several parts:

$ objdump -D exchange2_r_gcc | grep "digits_2.*:$"
000000000001d480 &lt;__brute_force_MOD_digits_2.isra.0&gt;:
000000000001f0f6 &lt;__brute_force_MOD_digits_2.constprop.7.isra.0&gt;:
000000000001fdd0 &lt;__brute_force_MOD_digits_2.constprop.6.isra.0&gt;:
0000000000020900 &lt;__brute_force_MOD_digits_2.constprop.5.isra.0&gt;:
00000000000211c4 &lt;__brute_force_MOD_digits_2.constprop.4.isra.0&gt;:
0000000000022002 &lt;__brute_force_MOD_digits_2.constprop.3.isra.0&gt;:
0000000000022d6a &lt;__brute_force_MOD_digits_2.constprop.2.isra.0&gt;:
0000000000023898 &lt;__brute_force_MOD_digits_2.constprop.1.isra.0&gt;:

However, in LLVM, this function is not split:

$ objdump -D exchange2_r_llvm | grep "digits_2.*:$"
00000000000115a0 &lt;_QMbrute_forcePdigits_2&gt;:

By applying this patch, LLVM now exhibits similar behavior, resulting in a substantial performance uplift.

$ objdump -D exchange2_r_patched_llvm | grep "digits_2.*:$"
0000000000011ab0 &lt;_QMbrute_forcePdigits_2&gt;:
0000000000018a4e &lt;_QMbrute_forcePdigits_2.specialized.1&gt;:
0000000000019820 &lt;_QMbrute_forcePdigits_2.specialized.2&gt;:
000000000001a436 &lt;_QMbrute_forcePdigits_2.specialized.3&gt;:
000000000001ae78 &lt;_QMbrute_forcePdigits_2.specialized.4&gt;:
000000000001ba8e &lt;_QMbrute_forcePdigits_2.specialized.5&gt;:
000000000001c7e6 &lt;_QMbrute_forcePdigits_2.specialized.6&gt;:
000000000001d072 &lt;_QMbrute_forcePdigits_2.specialized.7&gt;:
000000000001dad0 &lt;_QMbrute_forcePdigits_2.specialized.8&gt;:

And we used perf stat to measure the instruction count for exchange2_r 0 on rv64gc, as shown in the table below:

Compiler Instructions
GCC #d28ea8e5 55,965,728,914
LLVM #62d44fbd 105,416,890,241
LLVM #62d44fbd with this patch 62,693,427,761

Additionally, I performed tests on x86_64, yielding similar results:

Compiler cpu_atom instructions
LLVM #62d44fbd 100,147,914,793
LLVM #62d44fbd with this patch 53,077,337,115

Full diff: https://github.com/llvm/llvm-project/pull/96620.diff

12 Files Affected:

  • (modified) clang/test/CodeGen/attr-counted-by.c (+2-2)
  • (modified) llvm/lib/Passes/PassBuilderPipelines.cpp (+6)
  • (modified) llvm/test/Other/new-pm-defaults.ll (+1)
  • (modified) llvm/test/Other/new-pm-thinlto-postlink-defaults.ll (+2-1)
  • (modified) llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll (+1)
  • (modified) llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll (+1)
  • (modified) llvm/test/Other/new-pm-thinlto-prelink-defaults.ll (+1)
  • (modified) llvm/test/Other/new-pm-thinlto-prelink-pgo-defaults.ll (+1)
  • (modified) llvm/test/Other/new-pm-thinlto-prelink-samplepgo-defaults.ll (+1)
  • (modified) llvm/test/Transforms/PhaseOrdering/AArch64/constraint-elimination-placement.ll (+7-25)
  • (modified) llvm/test/Transforms/PhaseOrdering/dce-after-argument-promotion.ll (+3-4)
  • (modified) llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll (+15-40)
diff --git a/clang/test/CodeGen/attr-counted-by.c b/clang/test/CodeGen/attr-counted-by.c
index 79922eb4159f1..8d0e39d0e3dad 100644
--- a/clang/test/CodeGen/attr-counted-by.c
+++ b/clang/test/CodeGen/attr-counted-by.c
@@ -639,7 +639,7 @@ void test6(struct anon_struct *p, int index) {
   p->array[index] = __builtin_dynamic_object_size(p->array, 1);
 }
 
-// SANITIZE-WITH-ATTR-LABEL: define dso_local i64 @test6_bdos(
+// SANITIZE-WITH-ATTR-LABEL: define dso_local range(i64 0, -3) i64 @test6_bdos(
 // SANITIZE-WITH-ATTR-SAME: ptr nocapture noundef readonly [[P:%.*]]) local_unnamed_addr #[[ATTR2]] {
 // SANITIZE-WITH-ATTR-NEXT:  entry:
 // SANITIZE-WITH-ATTR-NEXT:    [[DOT_COUNTED_BY_GEP:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 8
@@ -649,7 +649,7 @@ void test6(struct anon_struct *p, int index) {
 // SANITIZE-WITH-ATTR-NEXT:    [[TMP1:%.*]] = select i1 [[DOTINV]], i64 0, i64 [[TMP0]]
 // SANITIZE-WITH-ATTR-NEXT:    ret i64 [[TMP1]]
 //
-// NO-SANITIZE-WITH-ATTR-LABEL: define dso_local i64 @test6_bdos(
+// NO-SANITIZE-WITH-ATTR-LABEL: define dso_local range(i64 0, -3) i64 @test6_bdos(
 // NO-SANITIZE-WITH-ATTR-SAME: ptr nocapture noundef readonly [[P:%.*]]) local_unnamed_addr #[[ATTR2]] {
 // NO-SANITIZE-WITH-ATTR-NEXT:  entry:
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[DOT_COUNTED_BY_GEP:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 8
diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index 926515c9508a9..5659c116e9c95 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -1204,6 +1204,12 @@ PassBuilder::buildModuleSimplificationPipeline(OptimizationLevel Level,
   else
     MPM.addPass(buildInlinerPipeline(Level, Phase));
 
+  MPM.addPass(IPSCCPPass(
+              IPSCCPOptions(/*AllowFuncSpec=*/
+                            Level != OptimizationLevel::Os &&
+                            Level != OptimizationLevel::Oz &&
+                            !isLTOPreLink(Phase))));
+
   // Remove any dead arguments exposed by cleanups, constant folding globals,
   // and argument promotion.
   MPM.addPass(DeadArgumentEliminationPass());
diff --git a/llvm/test/Other/new-pm-defaults.ll b/llvm/test/Other/new-pm-defaults.ll
index 489aed40c190b..32fd2e2d3ffbf 100644
--- a/llvm/test/Other/new-pm-defaults.ll
+++ b/llvm/test/Other/new-pm-defaults.ll
@@ -229,6 +229,7 @@
 ; CHECK-O-NEXT: Running pass: InvalidateAnalysisPass<{{.*}}ShouldNotRunFunctionPassesAnalysis
 ; CHECK-O-NEXT: Invalidating analysis: ShouldNotRunFunctionPassesAnalysis
 ; CHECK-O-NEXT: Invalidating analysis: InlineAdvisorAnalysis
+; CHECK-O-NEXT: Running pass: IPSCCPPass
 ; CHECK-O-NEXT: Running pass: DeadArgumentEliminationPass
 ; CHECK-O-NEXT: Running pass: CoroCleanupPass
 ; CHECK-O-NEXT: Running pass: GlobalOptPass
diff --git a/llvm/test/Other/new-pm-thinlto-postlink-defaults.ll b/llvm/test/Other/new-pm-thinlto-postlink-defaults.ll
index 064362eabbf83..40468e917bc40 100644
--- a/llvm/test/Other/new-pm-thinlto-postlink-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-postlink-defaults.ll
@@ -1,4 +1,4 @@
-; The IR below was crafted so as:
+	; The IR below was crafted so as:
 ; 1) To have a loop, so we create a loop pass manager
 ; 2) To be "immutable" in the sense that no pass in the standard
 ;    pipeline will modify it.
@@ -156,6 +156,7 @@
 ; CHECK-O-NEXT: Running pass: InvalidateAnalysisPass<{{.*}}ShouldNotRunFunctionPassesAnalysis
 ; CHECK-O-NEXT: Invalidating analysis: ShouldNotRunFunctionPassesAnalysis
 ; CHECK-O-NEXT: Invalidating analysis: InlineAdvisorAnalysis
+; CHECK-O-NEXT: Running pass: IPSCCPPass
 ; CHECK-O-NEXT: Running pass: DeadArgumentEliminationPass
 ; CHECK-O-NEXT: Running pass: CoroCleanupPass
 ; CHECK-POSTLINK-O-NEXT: Running pass: GlobalOptPass
diff --git a/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll b/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
index 19a44867e434a..b4b63d8cb3d6f 100644
--- a/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
@@ -140,6 +140,7 @@
 ; CHECK-O-NEXT: Running pass: InvalidateAnalysisPass<{{.*}}ShouldNotRunFunctionPassesAnalysis
 ; CHECK-O-NEXT: Invalidating analysis: ShouldNotRunFunctionPassesAnalysis
 ; CHECK-O-NEXT: Invalidating analysis: InlineAdvisorAnalysis
+; CHECK-O-NEXT: Running pass: IPSCCPPass
 ; CHECK-O-NEXT: Running pass: DeadArgumentEliminationPass
 ; CHECK-O-NEXT: Running pass: CoroCleanupPass
 ; CHECK-O-NEXT: Running pass: GlobalOptPass
diff --git a/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll b/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
index ac80a31d8fd4b..72753a169d288 100644
--- a/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
@@ -148,6 +148,7 @@
 ; CHECK-O-NEXT: Running pass: InvalidateAnalysisPass<{{.*}}ShouldNotRunFunctionPassesAnalysis
 ; CHECK-O-NEXT: Invalidating analysis: ShouldNotRunFunctionPassesAnalysis
 ; CHECK-O-NEXT: Invalidating analysis: InlineAdvisorAnalysis
+; CHECK-O-NEXT: Running pass: IPSCCPPass
 ; CHECK-O-NEXT: Running pass: DeadArgumentEliminationPass
 ; CHECK-O-NEXT: Running pass: CoroCleanupPass
 ; CHECK-O-NEXT: Running pass: GlobalOptPass
diff --git a/llvm/test/Other/new-pm-thinlto-prelink-defaults.ll b/llvm/test/Other/new-pm-thinlto-prelink-defaults.ll
index 42ef49f8f7c7e..896feeb801f5f 100644
--- a/llvm/test/Other/new-pm-thinlto-prelink-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-prelink-defaults.ll
@@ -188,6 +188,7 @@
 ; CHECK-O-NEXT: Running pass: InvalidateAnalysisPass<{{.*}}ShouldNotRunFunctionPassesAnalysis
 ; CHECK-O-NEXT: Invalidating analysis: ShouldNotRunFunctionPassesAnalysis
 ; CHECK-O-NEXT: Invalidating analysis: InlineAdvisorAnalysis
+; CHECK-O-NEXT: Running pass: IPSCCPPass
 ; CHECK-O-NEXT: Running pass: DeadArgumentEliminationPass
 ; CHECK-O-NEXT: Running pass: CoroCleanupPass
 ; CHECK-O-NEXT: Running pass: GlobalOptPass
diff --git a/llvm/test/Other/new-pm-thinlto-prelink-pgo-defaults.ll b/llvm/test/Other/new-pm-thinlto-prelink-pgo-defaults.ll
index e74f88c1a3bf9..687e9d08ee31d 100644
--- a/llvm/test/Other/new-pm-thinlto-prelink-pgo-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-prelink-pgo-defaults.ll
@@ -187,6 +187,7 @@
 ; CHECK-O-NEXT: Running pass: InvalidateAnalysisPass<{{.*}}ShouldNotRunFunctionPassesAnalysis
 ; CHECK-O-NEXT: Invalidating analysis: ShouldNotRunFunctionPassesAnalysis
 ; CHECK-O-NEXT: Invalidating analysis: InlineAdvisorAnalysis
+; CHECK-O-NEXT: Running pass: IPSCCPPass
 ; CHECK-O-NEXT: Running pass: DeadArgumentEliminationPass
 ; CHECK-O-NEXT: Running pass: CoroCleanupPass
 ; CHECK-O-NEXT: Running pass: GlobalOptPass
diff --git a/llvm/test/Other/new-pm-thinlto-prelink-samplepgo-defaults.ll b/llvm/test/Other/new-pm-thinlto-prelink-samplepgo-defaults.ll
index 210a4ef1f7664..0f87f3a5b1615 100644
--- a/llvm/test/Other/new-pm-thinlto-prelink-samplepgo-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-prelink-samplepgo-defaults.ll
@@ -152,6 +152,7 @@
 ; CHECK-O-NEXT: Running pass: InvalidateAnalysisPass<{{.*}}ShouldNotRunFunctionPassesAnalysis
 ; CHECK-O-NEXT: Invalidating analysis: ShouldNotRunFunctionPassesAnalysis
 ; CHECK-O-NEXT: Invalidating analysis: InlineAdvisorAnalysis
+; CHECK-O-NEXT: Running pass: IPSCCPPass
 ; CHECK-O-NEXT: Running pass: DeadArgumentEliminationPass
 ; CHECK-O-NEXT: Running pass: CoroCleanupPass
 ; CHECK-O-NEXT: Running pass: GlobalOptPass
diff --git a/llvm/test/Transforms/PhaseOrdering/AArch64/constraint-elimination-placement.ll b/llvm/test/Transforms/PhaseOrdering/AArch64/constraint-elimination-placement.ll
index 335a850f1ec68..bd29141786ebb 100644
--- a/llvm/test/Transforms/PhaseOrdering/AArch64/constraint-elimination-placement.ll
+++ b/llvm/test/Transforms/PhaseOrdering/AArch64/constraint-elimination-placement.ll
@@ -105,29 +105,20 @@ define void @test2(ptr %this) #0 {
 ; CHECK-NEXT:      i64 17, label [[IF_END_I31:%.*]]
 ; CHECK-NEXT:    ]
 ; CHECK:       if.end.i:
-; CHECK-NEXT:    [[CALL8_I_I:%.*]] = tail call fastcc noundef i32 @test2_fn6()
-; CHECK-NEXT:    [[TRUNC_I_I:%.*]] = trunc i32 [[CALL8_I_I]] to i8
-; CHECK-NEXT:    [[CALL1_I1_I:%.*]] = tail call i1 @test2_fn4(i8 [[TRUNC_I_I]])
+; CHECK-NEXT:    [[CALL1_I1_I:%.*]] = tail call i1 @test2_fn4(i8 0)
 ; CHECK-NEXT:    [[TMP0:%.*]] = xor i1 [[CALL1_I1_I]], true
 ; CHECK-NEXT:    tail call void @llvm.assume(i1 [[TMP0]])
 ; CHECK-NEXT:    br label [[COMMON_RET]]
 ; CHECK:       test2_fn2.exit12:
-; CHECK-NEXT:    [[CALL8_I_I8:%.*]] = tail call fastcc noundef i32 @test2_fn6()
-; CHECK-NEXT:    [[TRUNC_I_I9:%.*]] = trunc i32 [[CALL8_I_I8]] to i8
-; CHECK-NEXT:    [[CALL1_I1_I10:%.*]] = tail call i1 @test2_fn4(i8 [[TRUNC_I_I9]])
+; CHECK-NEXT:    [[CALL1_I1_I10:%.*]] = tail call i1 @test2_fn4(i8 0)
 ; CHECK-NEXT:    [[TMP1:%.*]] = xor i1 [[CALL1_I1_I10]], true
 ; CHECK-NEXT:    tail call void @llvm.assume(i1 [[TMP1]])
-; CHECK-NEXT:    [[CMP4_I11:%.*]] = icmp eq i32 [[CALL8_I_I8]], 0
-; CHECK-NEXT:    br i1 [[CMP4_I11]], label [[TEST2_FN2_EXIT24:%.*]], label [[COMMON_RET]]
-; CHECK:       common.ret:
-; CHECK-NEXT:    ret void
-; CHECK:       test2_fn2.exit24:
 ; CHECK-NEXT:    store i8 0, ptr [[THIS]], align 4
 ; CHECK-NEXT:    br label [[COMMON_RET]]
+; CHECK:       common.ret:
+; CHECK-NEXT:    ret void
 ; CHECK:       if.end.i31:
-; CHECK-NEXT:    [[CALL8_I_I32:%.*]] = tail call fastcc noundef i32 @test2_fn6()
-; CHECK-NEXT:    [[TRUNC_I_I33:%.*]] = trunc i32 [[CALL8_I_I32]] to i8
-; CHECK-NEXT:    [[CALL1_I1_I34:%.*]] = tail call i1 @test2_fn4(i8 [[TRUNC_I_I33]])
+; CHECK-NEXT:    [[CALL1_I1_I34:%.*]] = tail call i1 @test2_fn4(i8 0)
 ; CHECK-NEXT:    [[TMP2:%.*]] = xor i1 [[CALL1_I1_I34]], true
 ; CHECK-NEXT:    tail call void @llvm.assume(i1 [[TMP2]])
 ; CHECK-NEXT:    br label [[COMMON_RET]]
@@ -162,16 +153,12 @@ define i1 @test2_fn2(ptr %__rhs) #0 {
 ; CHECK-NEXT:    [[CMP2_NOT:%.*]] = icmp eq i64 [[CALL]], [[COND_I]]
 ; CHECK-NEXT:    br i1 [[CMP2_NOT]], label [[IF_END:%.*]], label [[CLEANUP:%.*]]
 ; CHECK:       if.end:
-; CHECK-NEXT:    [[CALL8_I:%.*]] = tail call fastcc noundef i32 @test2_fn6()
-; CHECK-NEXT:    [[TRUNC_I:%.*]] = trunc i32 [[CALL8_I]] to i8
-; CHECK-NEXT:    [[CALL1_I1:%.*]] = tail call i1 @test2_fn4(i8 [[TRUNC_I]])
+; CHECK-NEXT:    [[CALL1_I1:%.*]] = tail call i1 @test2_fn4(i8 0)
 ; CHECK-NEXT:    [[TMP0:%.*]] = xor i1 [[CALL1_I1]], true
 ; CHECK-NEXT:    tail call void @llvm.assume(i1 [[TMP0]])
-; CHECK-NEXT:    [[CMP4:%.*]] = icmp eq i32 [[CALL8_I]], 0
 ; CHECK-NEXT:    br label [[CLEANUP]]
 ; CHECK:       cleanup:
-; CHECK-NEXT:    [[RETVAL_0:%.*]] = phi i1 [ [[CMP4]], [[IF_END]] ], [ false, [[ENTRY:%.*]] ]
-; CHECK-NEXT:    ret i1 [[RETVAL_0]]
+; CHECK-NEXT:    ret i1 [[CMP2_NOT]]
 ;
 entry:
   %call = call i64 @strlen(ptr %__rhs)
@@ -231,11 +218,6 @@ entry:
 }
 
 define internal i32 @test2_fn6() {
-; CHECK-LABEL: define internal fastcc noundef i32 @test2_fn6(
-; CHECK-SAME: ) unnamed_addr #[[ATTR5]] {
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    ret i32 0
-;
 entry:
   %call = call i32 @memcmp(ptr @.str.79, ptr @.str.79, i64 2)
   ret i32 %call
diff --git a/llvm/test/Transforms/PhaseOrdering/dce-after-argument-promotion.ll b/llvm/test/Transforms/PhaseOrdering/dce-after-argument-promotion.ll
index c33fcfbe6ed97..cc144a700d63a 100644
--- a/llvm/test/Transforms/PhaseOrdering/dce-after-argument-promotion.ll
+++ b/llvm/test/Transforms/PhaseOrdering/dce-after-argument-promotion.ll
@@ -9,10 +9,9 @@
 
 define internal void @f(ptr byval(%struct.ss) align 8 %b, ptr byval(i32) align 4 %X) noinline nounwind  {
 ; CHECK-LABEL: define {{[^@]+}}@f
-; CHECK-SAME: (i32 [[B_0:%.*]]){{[^#]*}} #[[ATTR0:[0-9]+]] {
+; CHECK-SAME: (){{[^#]*}} #[[ATTR0:[0-9]+]] {
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TEMP:%.*]] = add i32 [[B_0]], 1
-; CHECK-NEXT:    store i32 [[TEMP]], ptr [[DUMMY]], align 4
+; CHECK-NEXT:    store i32 2, ptr [[DUMMY]], align 4
 ; CHECK-NEXT:    ret void
 ;
 entry:
@@ -27,7 +26,7 @@ define i32 @test(ptr %X) {
 ; CHECK-LABEL: define {{[^@]+}}@test
 ; CHECK-SAME: (ptr {{[^%]*}} [[X:%.*]]){{[^#]*}} #[[ATTR1:[0-9]+]] {
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    tail call {{.*}}void @f(i32 1)
+; CHECK-NEXT:    tail call {{.*}}void @f()
 ; CHECK-NEXT:    ret i32 0
 ;
 entry:
diff --git a/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll b/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll
index 689f4a9798a75..09c7b1cebef8e 100644
--- a/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll
+++ b/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll
@@ -1,7 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
-; RUN: opt -passes='default<O3>' -S < %s  | FileCheck %s --check-prefixes=ALL,O3
-; RUN: opt -passes='default<O2>' -S < %s  | FileCheck %s --check-prefixes=ALL,O2
-; RUN: opt -passes='default<O1>' -S < %s  | FileCheck %s --check-prefixes=ALL,O1
+; RUN: opt -passes='default<O3>' -S < %s  | FileCheck %s
+; RUN: opt -passes='default<O2>' -S < %s  | FileCheck %s
+; RUN: opt -passes='default<O1>' -S < %s  | FileCheck %s
 
 ; All these tests should optimize to a single comparison
 ; of the original argument with null. There should be no loops.
@@ -9,10 +9,10 @@
 %struct.node = type { ptr, i32 }
 
 define dso_local zeroext i1 @is_not_empty_variant1(ptr %p) {
-; ALL-LABEL: @is_not_empty_variant1(
-; ALL-NEXT:  entry:
-; ALL-NEXT:    [[TOBOOL_NOT3_I:%.*]] = icmp ne ptr [[P:%.*]], null
-; ALL-NEXT:    ret i1 [[TOBOOL_NOT3_I]]
+; CHECK-LABEL: @is_not_empty_variant1(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TOBOOL_NOT3_I:%.*]] = icmp ne ptr [[P:%.*]], null
+; CHECK-NEXT:    ret i1 [[TOBOOL_NOT3_I]]
 ;
 entry:
   %p.addr = alloca ptr, align 8
@@ -51,10 +51,10 @@ while.end:
 }
 
 define dso_local zeroext i1 @is_not_empty_variant2(ptr %p) {
-; ALL-LABEL: @is_not_empty_variant2(
-; ALL-NEXT:  entry:
-; ALL-NEXT:    [[TOBOOL_NOT4_I:%.*]] = icmp ne ptr [[P:%.*]], null
-; ALL-NEXT:    ret i1 [[TOBOOL_NOT4_I]]
+; CHECK-LABEL: @is_not_empty_variant2(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TOBOOL_NOT4_I:%.*]] = icmp ne ptr [[P:%.*]], null
+; CHECK-NEXT:    ret i1 [[TOBOOL_NOT4_I]]
 ;
 entry:
   %p.addr = alloca ptr, align 8
@@ -96,35 +96,10 @@ while.end:
 }
 
 define dso_local zeroext i1 @is_not_empty_variant3(ptr %p) {
-; O3-LABEL: @is_not_empty_variant3(
-; O3-NEXT:  entry:
-; O3-NEXT:    [[TOBOOL_NOT4_I:%.*]] = icmp ne ptr [[P:%.*]], null
-; O3-NEXT:    ret i1 [[TOBOOL_NOT4_I]]
-;
-; O2-LABEL: @is_not_empty_variant3(
-; O2-NEXT:  entry:
-; O2-NEXT:    [[TOBOOL_NOT4_I:%.*]] = icmp ne ptr [[P:%.*]], null
-; O2-NEXT:    ret i1 [[TOBOOL_NOT4_I]]
-;
-; O1-LABEL: @is_not_empty_variant3(
-; O1-NEXT:  entry:
-; O1-NEXT:    [[TOBOOL_NOT4_I:%.*]] = icmp eq ptr [[P:%.*]], null
-; O1-NEXT:    br i1 [[TOBOOL_NOT4_I]], label [[COUNT_NODES_VARIANT3_EXIT:%.*]], label [[WHILE_BODY_I:%.*]]
-; O1:       while.body.i:
-; O1-NEXT:    [[SIZE_06_I:%.*]] = phi i64 [ [[INC_I:%.*]], [[WHILE_BODY_I]] ], [ 0, [[ENTRY:%.*]] ]
-; O1-NEXT:    [[P_ADDR_05_I:%.*]] = phi ptr [ [[TMP0:%.*]], [[WHILE_BODY_I]] ], [ [[P]], [[ENTRY]] ]
-; O1-NEXT:    [[CMP_I:%.*]] = icmp ne i64 [[SIZE_06_I]], -1
-; O1-NEXT:    call void @llvm.assume(i1 [[CMP_I]])
-; O1-NEXT:    [[TMP0]] = load ptr, ptr [[P_ADDR_05_I]], align 8
-; O1-NEXT:    [[INC_I]] = add i64 [[SIZE_06_I]], 1
-; O1-NEXT:    [[TOBOOL_NOT_I:%.*]] = icmp eq ptr [[TMP0]], null
-; O1-NEXT:    br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT3_EXIT_LOOPEXIT:%.*]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP0:![0-9]+]]
-; O1:       count_nodes_variant3.exit.loopexit:
-; O1-NEXT:    [[PHI_CMP:%.*]] = icmp ne i64 [[INC_I]], 0
-; O1-NEXT:    br label [[COUNT_NODES_VARIANT3_EXIT]]
-; O1:       count_nodes_variant3.exit:
-; O1-NEXT:    [[SIZE_0_LCSSA_I:%.*]] = phi i1 [ false, [[ENTRY]] ], [ [[PHI_CMP]], [[COUNT_NODES_VARIANT3_EXIT_LOOPEXIT]] ]
-; O1-NEXT:    ret i1 [[SIZE_0_LCSSA_I]]
+; CHECK-LABEL: @is_not_empty_variant3(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TOBOOL_NOT4_I:%.*]] = icmp ne ptr [[P:%.*]], null
+; CHECK-NEXT:    ret i1 [[TOBOOL_NOT4_I]]
 ;
 entry:
   %p.addr = alloca ptr, align 8

@dtcxzyw
Copy link
Member

dtcxzyw commented Jun 27, 2024

This patch causes some significant performance regressions on llvm-test-suite (rv64gc-O3-thinlto):

Name Before After Ratio
SingleSource/Benchmarks/Shootout/Shootout-random 2.150161677 3.300161641 + 53.5%
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trisolv/trisolv 0.111845159 0.145389494 +30.0%
SingleSource/Benchmarks/Adobe-C++/functionobjects 5.489498263 6.827863965 +24.4%

It has been fixed. But this patch didn't show a positive net effect :(

@goldsteinn
Copy link
Contributor

This patch causes some significant performance regressions on llvm-test-suite (rv64gc-O3-thinlto):
Name Before After Ratio
SingleSource/Benchmarks/Shootout/Shootout-random 2.150161677 3.300161641 + 53.5%
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trisolv/trisolv 0.111845159 0.145389494 +30.0%
SingleSource/Benchmarks/Adobe-C++/functionobjects 5.489498263 6.827863965 +24.4%

It has been fixed. But this patch didn't show a positive net effect :(

Does that mean it has a negative net effect, or its neutral (in which case the original motivating case should be enough).

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running IPSCCP twice seems like quite a heavy hammer, I'd expect a noticeable compile-time impact.

I'd recommend to try to extract a reproducer from your motivating use case and check why IPSCCP cannot perform the desired optimization before inlining. Note that we run SCCP after inlining I think, which is the non-IP version of IPSCCP

@dtcxzyw
Copy link
Member

dtcxzyw commented Jun 27, 2024

Does that mean it has a negative net effect, or its neutral

It is neutral.

(in which case the original motivating case should be enough).

But this patch may have an impact on compile time.

@nikic
Copy link
Contributor

nikic commented Jun 27, 2024

@joshua-arch1
Copy link
Contributor

joshua-arch1 commented Jan 16, 2025

@sihuan Are you testing your patch on LTO? Before applying your patch, the function is split into two patches:

0000000000071270 <_QMbrute_forcePdigits_2.specialized.1>:
0000000000074630 <_QMbrute_forcePdigits_2.specialized.5>:

With your patch, it exhibits the same behavior. Do you know how I can get the same behavior as gcc where digits_2 is split into eight parts? Are you testing with other options?

@sihuan
Copy link
Contributor Author

sihuan commented Jan 16, 2025

@joshua-arch1
I tested this patch on LTO and it is consistent with what you describe.
With the -flto option enabled is that digits_2 has two specializations whether the patch is applied or not.
However with LTO enabled, LLVM performance is comparable to GCC (in terms of number of instructions ). Here are some figures:

on x86_64 ArchLinux
via perf stat ./exchange2_r 0

Compiler Instructions
gfortran O3 (archlinux gcc-fortran 14.2.1+r134+gab884fffe3fc-2) 54,865,827,741
gfortran O3 lto (archlinux gcc-fortran 14.2.1+r134+gab884fffe3fc-2) 54,186,161,980
flang O3 (#62d44fbd) 107,953,750,439
flang O3 lto (#62d44fbd) 53,391,922,257
flang O3 (patched #62d44fbd) 53,146,581,727
flang O3 lto (patched #62d44fbd) 53,391,760,016

When using LTO, I don't think there is a problem that performance is worse than GCC, so this patch is not specific to LTO.

@joshua-arch1
Copy link
Contributor

@joshua-arch1 I tested this patch on LTO and it is consistent with what you describe. With the -flto option enabled is that digits_2 has two specializations whether the patch is applied or not. However with LTO enabled, LLVM performance is comparable to GCC (in terms of number of instructions ). Here are some figures:

on x86_64 ArchLinux
via perf stat ./exchange2_r 0

Compiler Instructions
gfortran O3 (archlinux gcc-fortran 14.2.1+r134+gab884fffe3fc-2) 54,865,827,741
gfortran O3 lto (archlinux gcc-fortran 14.2.1+r134+gab884fffe3fc-2) 54,186,161,980
flang O3 (#62d44fbd) 107,953,750,439
flang O3 lto (#62d44fbd) 53,391,922,257
flang O3 (patched #62d44fbd) 53,146,581,727
flang O3 lto (patched #62d44fbd) 53,391,760,016
When using LTO, I don't think there is a problem that performance is worse than GCC, so this patch is not specific to LTO.

Hi @sihuan, I have tried to apply yout initial design where IPSCCP is directly moved after inliner. It can work with LTO and digits_2 has eight specializations just like gfortran.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang Clang issues not falling into any other category llvm:transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants