Skip to content

[RISCV] Enable merging of external globals by default #117880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

asb
Copy link
Contributor

@asb asb commented Nov 27, 2024

This follows up #115495 by enabling merging of external globals by default, which had been left as a next step in order to make the previous change more incremental and so we can more easily narrow down on any identified regressions.

Enabling merging of external globals matches what Arm does (for non mach-o targets), though AArch64 doesn't as there were some concerns it might cause regressions in some cases.

Here are the results I have from specrate benchmarks (train workload, figures are execution time in seconds):

oldmerge mean oldmerge stdev newmerge mean newmerge stdev abs old vs new change old vs new change as %
500.perlbench_r 159.6 0.4 159.7 0.6 0.1 0.06%
502.gcc_r 408.8 2.3 409.7 2.4 0.9 0.22%
505.mcf_r 619.4 3.6 620.6 4.1 1.2 0.19%
520.omnetpp_r 573.7 5 569.1 2.4 -4.6 -0.80%
525.x264_r 183.9 0.2 184 0.4 0.1 0.05%
531.deepsjeng_r 280.6 0.9 280 0.7 -0.6 -0.21%
541.leela_r 284.6 1.1 284.5 0.9 -0.1 -0.04%
557.xz_r 208.8 3 209.5 3.2 0.7 0.34%

This is on the Spacemit X60, with essentially rva22+v. omnetpp is the main one that sees a potentially positive change, though there was an annoyingly high standard deviation on that benchmark.

Looking at dhyrstone we see a positive codegen change in Proc0 which translates to a reduced instruction count, though not decreased runtime:

$ perf stat -r100 ./dry.oldmerge

 Performance counter stats for './dry.oldmerge' (100 runs):

            815.02 msec task-clock:u                     #    0.999 CPUs utilized            ( +-  0.02% )
                 3      context-switches:u               #    3.681 /sec                     ( +-  2.75% )
                 0      cpu-migrations:u                 #    0.000 /sec
                46      page-faults:u                    #   56.443 /sec
     1,304,052,614      cycles:u                         #    1.600 GHz                      ( +-  0.02% )
     1,601,705,197      instructions:u                   #    1.23  insn per cycle           ( +-  0.00% )
       100,021,180      branches:u                       #  122.728 M/sec                    ( +-  0.00% )
             3,427      branch-misses:u                  #    0.00% of all branches          ( +-  0.68% )

          0.815862 +- 0.000137 seconds time elapsed  ( +-  0.02% )

$ perf stat -r100 ./dry.newmerge

 Performance counter stats for './dry.newmerge' (100 runs):

            815.29 msec task-clock:u                     #    0.999 CPUs utilized            ( +-  0.00% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                46      page-faults:u                    #   56.429 /sec
     1,304,461,619      cycles:u                         #    1.600 GHz                      ( +-  0.00% )
     1,501,837,638      instructions:u                   #    1.15  insn per cycle           ( +-  0.00% )
       100,021,173      branches:u                       #  122.697 M/sec                    ( +-  0.00% )
             2,813      branch-misses:u                  #    0.00% of all branches          ( +-  0.82% )

         0.8161623 +- 0.0000465 seconds time elapsed  ( +-  0.01% )

EDIT: I should elaborate on my overall view here / what feedback I'm looking for. At least for the tests I've run, the differences seem mainly in the noise and the reduction in instructions in dhrystone is meaningful enough (even if execution time isn't affected for the spacemit x60 microarch) that it seems sensible to move forwards. But I'd welcome any opposing thoughts / concerns, or indeed further input from people who've enabled globals merging downstream. I'd understand this probably included external globals merging too as enabled in this patch, but it would be helpful to confirm.

@llvmbot
Copy link
Member

llvmbot commented Nov 27, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Alex Bradbury (asb)

Changes

This follows up #115495 by enabling merging of external globals by default, which had been left as a next step in order to make the previous change more incremental and so we can more easily narrow down on any identified regressions.

Enabling merging of external globals matches what Arm does (for non mach-o targets), though AArch64 doesn't as there were some concerns it might cause regressions in some cases.

Here are the results I have from specrate benchmarks:

oldmerge mean oldmerge stdev newmerge mean newmerge stdev abs old vs new change old vs new change as %
500.perlbench_r 159.6 0.4 159.7 0.6 0.1 0.06%
502.gcc_r 408.8 2.3 409.7 2.4 0.9 0.22%
505.mcf_r 619.4 3.6 620.6 4.1 1.2 0.19%
520.omnetpp_r 573.7 5 569.1 2.4 -4.6 -0.80%
525.x264_r 183.9 0.2 184 0.4 0.1 0.05%
531.deepsjeng_r 280.6 0.9 280 0.7 -0.6 -0.21%
541.leela_r 284.6 1.1 284.5 0.9 -0.1 -0.04%
557.xz_r 208.8 3 209.5 3.2 0.7 0.34%

This is on the Spacemit X60, with essentially rva22+v. omnetpp is the main one that sees a potentially positive change, though there was an annoyingly high standard deviation on that benchmark.

Looking at dhyrstone we see a positive codegen change in Proc0 which translates to a reduced instruction count, though not decreased runtime:

$ perf stat -r100 ./dry.oldmerge

 Performance counter stats for './dry.oldmerge' (100 runs):

            815.02 msec task-clock:u                     #    0.999 CPUs utilized            ( +-  0.02% )
                 3      context-switches:u               #    3.681 /sec                     ( +-  2.75% )
                 0      cpu-migrations:u                 #    0.000 /sec
                46      page-faults:u                    #   56.443 /sec
     1,304,052,614      cycles:u                         #    1.600 GHz                      ( +-  0.02% )
     1,601,705,197      instructions:u                   #    1.23  insn per cycle           ( +-  0.00% )
       100,021,180      branches:u                       #  122.728 M/sec                    ( +-  0.00% )
             3,427      branch-misses:u                  #    0.00% of all branches          ( +-  0.68% )

          0.815862 +- 0.000137 seconds time elapsed  ( +-  0.02% )

$ perf stat -r100 ./dry.newmerge

 Performance counter stats for './dry.newmerge' (100 runs):

            815.29 msec task-clock:u                     #    0.999 CPUs utilized            ( +-  0.00% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                46      page-faults:u                    #   56.429 /sec
     1,304,461,619      cycles:u                         #    1.600 GHz                      ( +-  0.00% )
     1,501,837,638      instructions:u                   #    1.15  insn per cycle           ( +-  0.00% )
       100,021,173      branches:u                       #  122.697 M/sec                    ( +-  0.00% )
             2,813      branch-misses:u                  #    0.00% of all branches          ( +-  0.82% )

         0.8161623 +- 0.0000465 seconds time elapsed  ( +-  0.01% )

Full diff: https://github.com/llvm/llvm-project/pull/117880.diff

6 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVTargetMachine.cpp (+4-7)
  • (modified) llvm/test/CodeGen/RISCV/global-merge-minsize-smalldata-nonzero.ll (+3-4)
  • (modified) llvm/test/CodeGen/RISCV/global-merge-minsize-smalldata-zero.ll (+3-4)
  • (modified) llvm/test/CodeGen/RISCV/global-merge-minsize.ll (+2-2)
  • (modified) llvm/test/CodeGen/RISCV/global-merge-offset.ll (+4-4)
  • (modified) llvm/test/CodeGen/RISCV/global-merge.ll (+6-6)
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index fa507653264ccd..7a2e2a90699553 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -51,12 +51,9 @@ static cl::opt<cl::boolOrDefault>
     EnableGlobalMerge("riscv-enable-global-merge", cl::Hidden,
                       cl::desc("Enable the global merge pass"));
 
-static cl::opt<bool> ForceEnableGlobalMergeExternalGlobals(
-    "riscv-force-enable-global-merge-external-globals", cl::Hidden,
-    cl::init(false),
-    cl::desc(
-        "If the global merge pass is enabled, force enable global merging of "
-        "external globals (overriding any logic that might disable it)"));
+static cl::opt<bool> DisableGlobalMergeExternalGlobals(
+    "riscv-disable-global-merge-external-globals", cl::Hidden, cl::init(false),
+    cl::desc("Disable global merging of external globals"));
 
 static cl::opt<bool>
     EnableMachineCombiner("riscv-enable-machine-combiner",
@@ -484,7 +481,7 @@ bool RISCVPassConfig::addPreISel() {
     addPass(createGlobalMergePass(TM, /* MaxOffset */ 2047,
                                   /* OnlyOptimizeForSize */ false,
                                   /* MergeExternalByDefault */
-                                  ForceEnableGlobalMergeExternalGlobals));
+                                  !DisableGlobalMergeExternalGlobals));
   }
 
   return false;
diff --git a/llvm/test/CodeGen/RISCV/global-merge-minsize-smalldata-nonzero.ll b/llvm/test/CodeGen/RISCV/global-merge-minsize-smalldata-nonzero.ll
index 156a34db827456..77fe1783bb5d5e 100644
--- a/llvm/test/CodeGen/RISCV/global-merge-minsize-smalldata-nonzero.ll
+++ b/llvm/test/CodeGen/RISCV/global-merge-minsize-smalldata-nonzero.ll
@@ -1,8 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv64 -riscv-force-enable-global-merge-external-globals \
-; RUN:   -verify-machineinstrs < %s | FileCheck %s -check-prefix=SMALL-DATA
-; RUN: llc -mtriple=riscv64 -global-merge-min-data-size=0 \
-; RUN:   -riscv-force-enable-global-merge-external-globals -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv64 -verify-machineinstrs < %s \
+; RUN:   | FileCheck %s -check-prefix=SMALL-DATA
+; RUN: llc -mtriple=riscv64 -global-merge-min-data-size=0 -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s -check-prefix=MINSIZE
 
 @ig1 = internal global i32 0, align 4
diff --git a/llvm/test/CodeGen/RISCV/global-merge-minsize-smalldata-zero.ll b/llvm/test/CodeGen/RISCV/global-merge-minsize-smalldata-zero.ll
index e41f14d394b7ca..c29749c17a5b5c 100644
--- a/llvm/test/CodeGen/RISCV/global-merge-minsize-smalldata-zero.ll
+++ b/llvm/test/CodeGen/RISCV/global-merge-minsize-smalldata-zero.ll
@@ -1,8 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv32 -riscv-force-enable-global-merge-external-globals \
-; RUN:   -verify-machineinstrs < %s | FileCheck %s -check-prefix=SMALL-DATA
-; RUN: llc -mtriple=riscv64 -global-merge-min-data-size=5 \
-; RUN:   -riscv-force-enable-global-merge-external-globals  -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv32 -verify-machineinstrs < %s \
+; RUN:   | FileCheck %s -check-prefix=SMALL-DATA
+; RUN: llc -mtriple=riscv64 -global-merge-min-data-size=5 -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s -check-prefix=MINSIZE
 
 @ig1 = internal global i32 0, align 4
diff --git a/llvm/test/CodeGen/RISCV/global-merge-minsize.ll b/llvm/test/CodeGen/RISCV/global-merge-minsize.ll
index 1d65d9d1732ba3..915dde388cffdc 100644
--- a/llvm/test/CodeGen/RISCV/global-merge-minsize.ll
+++ b/llvm/test/CodeGen/RISCV/global-merge-minsize.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv32 -riscv-force-enable-global-merge-external-globals \
+; RUN: llc -mtriple=riscv32 \
 ; RUN:   -verify-machineinstrs < %s | FileCheck %s -check-prefix=RV32
 ; RUN: llc -mtriple=riscv32 -global-merge-min-data-size=5 \
-; RUN:   -riscv-force-enable-global-merge-external-globals -verify-machineinstrs < %s | FileCheck %s -check-prefix=RV32-MINSIZE
+; RUN:   -verify-machineinstrs < %s | FileCheck %s -check-prefix=RV32-MINSIZE
 
 @ig1 = internal global i32 0, align 4
 @ig2 = internal global i32 0, align 4
diff --git a/llvm/test/CodeGen/RISCV/global-merge-offset.ll b/llvm/test/CodeGen/RISCV/global-merge-offset.ll
index bb8264ee438545..c1074bc8ca97ed 100644
--- a/llvm/test/CodeGen/RISCV/global-merge-offset.ll
+++ b/llvm/test/CodeGen/RISCV/global-merge-offset.ll
@@ -1,12 +1,12 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: sed 's/ArrSize/100/g' %s | llc -mtriple=riscv32 \
-; RUN:   -riscv-force-enable-global-merge-external-globals -verify-machineinstrs | FileCheck %s
+; RUN:   -verify-machineinstrs | FileCheck %s
 ; RUN: sed 's/ArrSize/100/g' %s | llc -mtriple=riscv64 \
-; RUN:   -riscv-force-enable-global-merge-external-globals -verify-machineinstrs | FileCheck %s
+; RUN:   -verify-machineinstrs | FileCheck %s
 ; RUN: sed 's/ArrSize/101/g' %s | llc -mtriple=riscv32 \
-; RUN:   -riscv-force-enable-global-merge-external-globals -verify-machineinstrs | FileCheck %s -check-prefix=CHECK-TOOBIG
+; RUN:   -verify-machineinstrs | FileCheck %s -check-prefix=CHECK-TOOBIG
 ; RUN: sed 's/ArrSize/101/g' %s | llc -mtriple=riscv64 \
-; RUN:   -riscv-force-enable-global-merge-external-globals -verify-machineinstrs | FileCheck %s -check-prefix=CHECK-TOOBIG
+; RUN:   -verify-machineinstrs | FileCheck %s -check-prefix=CHECK-TOOBIG
 
 ; This test demonstrates that the MaxOffset is set correctly for RISC-V by
 ; constructing an input that is at the limit and comparing.
diff --git a/llvm/test/CodeGen/RISCV/global-merge.ll b/llvm/test/CodeGen/RISCV/global-merge.ll
index 327a6b54f4be25..fa0dda3736906b 100644
--- a/llvm/test/CodeGen/RISCV/global-merge.ll
+++ b/llvm/test/CodeGen/RISCV/global-merge.ll
@@ -1,11 +1,11 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv32 -verify-machineinstrs < %s | FileCheck %s
-; RUN: llc -mtriple=riscv64 -verify-machineinstrs < %s | FileCheck %s
-; RUN: llc -mtriple=riscv32 \
-; RUN:     -riscv-force-enable-global-merge-external-globals -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv32 -riscv-disable-global-merge-external-globals \
+; RUN:     -verify-machineinstrs < %s | FileCheck %s
+; RUN: llc -mtriple=riscv64 -riscv-disable-global-merge-external-globals \
+; RUN:     -verify-machineinstrs < %s | FileCheck %s
+; RUN: llc -mtriple=riscv32 -verify-machineinstrs < %s \
 ; RUN:   | FileCheck -check-prefix=CHECK-WEXTERN %s
-; RUN: llc -mtriple=riscv64 \
-; RUN:     -riscv-force-enable-global-merge-external-globals -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv64 -verify-machineinstrs < %s \
 ; RUN:   | FileCheck -check-prefix=CHECK-WEXTERN %s
 
 @ig1 = internal global i32 0, align 4

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thank you for the performance data. I agree the performance is pretty much in the noise, but this makes sense in concept. Given that, I think moving forward here is the right answer.

Copy link
Collaborator

@topperc topperc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@asb asb merged commit 3787fbf into llvm:main Dec 11, 2024
5 of 7 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Dec 11, 2024

LLVM Buildbot has detected a new failure on builder openmp-offload-libc-amdgpu-runtime running on omp-vega20-1 while building llvm at step 6 "test-openmp".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/73/builds/10066

Here is the relevant piece of the build log for the reference
Step 6 (test-openmp) failure: test (failure)
******************** TEST 'libomp :: tasking/issue-94260-2.c' FAILED ********************
Exit Code: -11

Command Output (stdout):
--
# RUN: at line 1
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp   -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/openmp/runtime/test -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -fno-omit-frame-pointer -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/openmp/runtime/test/ompt /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/openmp/runtime/test/tasking/issue-94260-2.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/test/tasking/Output/issue-94260-2.c.tmp -lm -latomic && /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/test/tasking/Output/issue-94260-2.c.tmp
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/openmp/runtime/test -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -fno-omit-frame-pointer -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/openmp/runtime/test/ompt /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/openmp/runtime/test/tasking/issue-94260-2.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/test/tasking/Output/issue-94260-2.c.tmp -lm -latomic
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/test/tasking/Output/issue-94260-2.c.tmp
# note: command had no output on stdout or stderr
# error: command failed with exit status: -11

--

********************


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants