Skip to content

[AArch64] Disable Pre-RA Scheduler for Neoverse V2 #127784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sjoerdmeijer
Copy link
Collaborator

We would like to disable the pre-RA machine scheduler for the Neoverse V2 because we have a key workload that massively benefits from this (25% uplift). Despite the machine scheduler being register pressure aware, it results in spills for this workload. Disabling the scheduler seems a lot more attractive than trying to tweak regalloc heuristics:

  • We see no benefit of scheduling anyway on this big core, and have never seen this. I.e., when we added the V2 scheduling model, this wasn't for perf reasons, only to enable LLVM-MCA.
  • Scheduling can consume significant compile-time, not resulting in any perf gains. This is a bad deal.

FWIW: the GCC folks realised the same not that long ago, and did exactly the same, see also:
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/667074.html

I guess other big cores could benefit from this too, but I would like to leave that decision to folks with more experience on those cores, so that's why I propose to change this for the V2 here only.

Numbers:

  • We know the Eigen library is somewhat sensitive to scheduling, but I found one kernel to regress with ~2%, and another to improve with ~2%. They cancel each other out, and overall the result is neutral.
  • SPEC FP and INT seem totally unaffected.
  • LLVM test-suite: a little bit up and down, all within noise levels I think, so is neutral.
  • Compile-time numbers: I see a geomean 3% improvement for the LLVM test-suite, and a very decent one for the sqlite amalgamation version.

I haven't looked at the post-RA scheduling, maybe that's interesting as a follow up.

@llvmbot
Copy link
Member

llvmbot commented Feb 19, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Sjoerd Meijer (sjoerdmeijer)

Changes

We would like to disable the pre-RA machine scheduler for the Neoverse V2 because we have a key workload that massively benefits from this (25% uplift). Despite the machine scheduler being register pressure aware, it results in spills for this workload. Disabling the scheduler seems a lot more attractive than trying to tweak regalloc heuristics:

  • We see no benefit of scheduling anyway on this big core, and have never seen this. I.e., when we added the V2 scheduling model, this wasn't for perf reasons, only to enable LLVM-MCA.
  • Scheduling can consume significant compile-time, not resulting in any perf gains. This is a bad deal.

FWIW: the GCC folks realised the same not that long ago, and did exactly the same, see also:
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/667074.html

I guess other big cores could benefit from this too, but I would like to leave that decision to folks with more experience on those cores, so that's why I propose to change this for the V2 here only.

Numbers:

  • We know the Eigen library is somewhat sensitive to scheduling, but I found one kernel to regress with ~2%, and another to improve with ~2%. They cancel each other out, and overall the result is neutral.
  • SPEC FP and INT seem totally unaffected.
  • LLVM test-suite: a little bit up and down, all within noise levels I think, so is neutral.
  • Compile-time numbers: I see a geomean 3% improvement for the LLVM test-suite, and a very decent one for the sqlite amalgamation version.

I haven't looked at the post-RA scheduling, maybe that's interesting as a follow up.


Patch is 31.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127784.diff

6 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64Features.td (+3)
  • (modified) llvm/lib/Target/AArch64/AArch64Processors.td (+1)
  • (modified) llvm/lib/Target/AArch64/AArch64Subtarget.h (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/Atomics/aarch64-atomic-load-rcpc_immo.ll (+142-142)
  • (modified) llvm/test/CodeGen/AArch64/selectopt-const.ll (+10-10)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-zext.ll (+6-6)
diff --git a/llvm/lib/Target/AArch64/AArch64Features.td b/llvm/lib/Target/AArch64/AArch64Features.td
index 357f526d5e308..13c1a386149b7 100644
--- a/llvm/lib/Target/AArch64/AArch64Features.td
+++ b/llvm/lib/Target/AArch64/AArch64Features.td
@@ -669,6 +669,9 @@ def FeatureExynosCheapAsMoveHandling : SubtargetFeature<"exynos-cheap-as-move",
     "HasExynosCheapAsMoveHandling", "true",
     "Use Exynos specific handling of cheap instructions">;
 
+def FeatureDisablePreRAScheduler : SubtargetFeature<"use-prera-scheduler",
+    "DisablePreRAScheduler", "true", "Disable scheduling before register allocation">;
+
 def FeaturePostRAScheduler : SubtargetFeature<"use-postra-scheduler",
     "UsePostRAScheduler", "true", "Schedule again after register allocation">;
 
diff --git a/llvm/lib/Target/AArch64/AArch64Processors.td b/llvm/lib/Target/AArch64/AArch64Processors.td
index b977b6aaaf619..401a2637fa9f3 100644
--- a/llvm/lib/Target/AArch64/AArch64Processors.td
+++ b/llvm/lib/Target/AArch64/AArch64Processors.td
@@ -540,6 +540,7 @@ def TuneNeoverseV2 : SubtargetFeature<"neoversev2", "ARMProcFamily", "NeoverseV2
                                       FeatureCmpBccFusion,
                                       FeatureFuseAdrpAdd,
                                       FeatureALULSLFast,
+                                      FeatureDisablePreRAScheduler,
                                       FeaturePostRAScheduler,
                                       FeatureEnableSelectOptimize,
                                       FeatureUseFixedOverScalableIfEqualCost,
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index c6eb77e3bc3ba..9e8c50e376272 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -156,7 +156,7 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
   const LegalizerInfo *getLegalizerInfo() const override;
   const RegisterBankInfo *getRegBankInfo() const override;
   const Triple &getTargetTriple() const { return TargetTriple; }
-  bool enableMachineScheduler() const override { return true; }
+  bool enableMachineScheduler() const override { return !disablePreRAScheduler(); }
   bool enablePostRAScheduler() const override { return usePostRAScheduler(); }
   bool enableSubRegLiveness() const override { return EnableSubregLiveness; }
 
diff --git a/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomic-load-rcpc_immo.ll b/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomic-load-rcpc_immo.ll
index 02ff12c27fcda..d29fdd23863a6 100644
--- a/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomic-load-rcpc_immo.ll
+++ b/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomic-load-rcpc_immo.ll
@@ -48,12 +48,12 @@ define i8 @load_atomic_i8_aligned_acquire(ptr %ptr) {
 ; GISEL:    add x8, x0, #4
 ; GISEL:    ldaprb w0, [x8]
 ;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire:
-; SDAG-NOAVOIDLDAPUR:    ldapurb w0, [x0, #4]
-;
 ; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire:
 ; SDAG-AVOIDLDAPUR:    add x8, x0, #4
 ; SDAG-AVOIDLDAPUR:    ldaprb w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire:
+; SDAG-NOAVOIDLDAPUR:    ldapurb w0, [x0, #4]
     %gep = getelementptr inbounds i8, ptr %ptr, i32 4
     %r = load atomic i8, ptr %gep acquire, align 1
     ret i8 %r
@@ -64,12 +64,12 @@ define i8 @load_atomic_i8_aligned_acquire_const(ptr readonly %ptr) {
 ; GISEL:    add x8, x0, #4
 ; GISEL:    ldaprb w0, [x8]
 ;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire_const:
-; SDAG-NOAVOIDLDAPUR:    ldapurb w0, [x0, #4]
-;
 ; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire_const:
 ; SDAG-AVOIDLDAPUR:    add x8, x0, #4
 ; SDAG-AVOIDLDAPUR:    ldaprb w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR:    ldapurb w0, [x0, #4]
     %gep = getelementptr inbounds i8, ptr %ptr, i32 4
     %r = load atomic i8, ptr %gep acquire, align 1
     ret i8 %r
@@ -130,12 +130,12 @@ define i16 @load_atomic_i16_aligned_acquire(ptr %ptr) {
 ; GISEL:    add x8, x0, #8
 ; GISEL:    ldaprh w0, [x8]
 ;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire:
-; SDAG-NOAVOIDLDAPUR:    ldapurh w0, [x0, #8]
-;
 ; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire:
 ; SDAG-AVOIDLDAPUR:    add x8, x0, #8
 ; SDAG-AVOIDLDAPUR:    ldaprh w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire:
+; SDAG-NOAVOIDLDAPUR:    ldapurh w0, [x0, #8]
     %gep = getelementptr inbounds i16, ptr %ptr, i32 4
     %r = load atomic i16, ptr %gep acquire, align 2
     ret i16 %r
@@ -146,12 +146,12 @@ define i16 @load_atomic_i16_aligned_acquire_const(ptr readonly %ptr) {
 ; GISEL:    add x8, x0, #8
 ; GISEL:    ldaprh w0, [x8]
 ;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire_const:
-; SDAG-NOAVOIDLDAPUR:    ldapurh w0, [x0, #8]
-;
 ; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire_const:
 ; SDAG-AVOIDLDAPUR:    add x8, x0, #8
 ; SDAG-AVOIDLDAPUR:    ldaprh w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR:    ldapurh w0, [x0, #8]
     %gep = getelementptr inbounds i16, ptr %ptr, i32 4
     %r = load atomic i16, ptr %gep acquire, align 2
     ret i16 %r
@@ -211,12 +211,12 @@ define i32 @load_atomic_i32_aligned_acquire(ptr %ptr) {
 ; GISEL-LABEL: load_atomic_i32_aligned_acquire:
 ; GISEL:    ldapur w0, [x0, #16]
 ;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire:
-; SDAG-NOAVOIDLDAPUR:    ldapur w0, [x0, #16]
-;
 ; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire:
 ; SDAG-AVOIDLDAPUR:    add x8, x0, #16
 ; SDAG-AVOIDLDAPUR:    ldapr w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire:
+; SDAG-NOAVOIDLDAPUR:    ldapur w0, [x0, #16]
     %gep = getelementptr inbounds i32, ptr %ptr, i32 4
     %r = load atomic i32, ptr %gep acquire, align 4
     ret i32 %r
@@ -226,12 +226,12 @@ define i32 @load_atomic_i32_aligned_acquire_const(ptr readonly %ptr) {
 ; GISEL-LABEL: load_atomic_i32_aligned_acquire_const:
 ; GISEL:    ldapur w0, [x0, #16]
 ;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire_const:
-; SDAG-NOAVOIDLDAPUR:    ldapur w0, [x0, #16]
-;
 ; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire_const:
 ; SDAG-AVOIDLDAPUR:    add x8, x0, #16
 ; SDAG-AVOIDLDAPUR:    ldapr w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR:    ldapur w0, [x0, #16]
     %gep = getelementptr inbounds i32, ptr %ptr, i32 4
     %r = load atomic i32, ptr %gep acquire, align 4
     ret i32 %r
@@ -291,12 +291,12 @@ define i64 @load_atomic_i64_aligned_acquire(ptr %ptr) {
 ; GISEL-LABEL: load_atomic_i64_aligned_acquire:
 ; GISEL:    ldapur x0, [x0, #32]
 ;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire:
-; SDAG-NOAVOIDLDAPUR:    ldapur x0, [x0, #32]
-;
 ; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire:
 ; SDAG-AVOIDLDAPUR:    add x8, x0, #32
 ; SDAG-AVOIDLDAPUR:    ldapr x0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire:
+; SDAG-NOAVOIDLDAPUR:    ldapur x0, [x0, #32]
     %gep = getelementptr inbounds i64, ptr %ptr, i32 4
     %r = load atomic i64, ptr %gep acquire, align 8
     ret i64 %r
@@ -306,12 +306,12 @@ define i64 @load_atomic_i64_aligned_acquire_const(ptr readonly %ptr) {
 ; GISEL-LABEL: load_atomic_i64_aligned_acquire_const:
 ; GISEL:    ldapur x0, [x0, #32]
 ;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire_const:
-; SDAG-NOAVOIDLDAPUR:    ldapur x0, [x0, #32]
-;
 ; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire_const:
 ; SDAG-AVOIDLDAPUR:    add x8, x0, #32
 ; SDAG-AVOIDLDAPUR:    ldapr x0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR:    ldapur x0, [x0, #32]
     %gep = getelementptr inbounds i64, ptr %ptr, i32 4
     %r = load atomic i64, ptr %gep acquire, align 8
     ret i64 %r
@@ -440,12 +440,12 @@ define i8 @load_atomic_i8_unaligned_acquire(ptr %ptr) {
 ; GISEL:    add x8, x0, #4
 ; GISEL:    ldaprb w0, [x8]
 ;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire:
-; SDAG-NOAVOIDLDAPUR:    ldapurb w0, [x0, #4]
-;
 ; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire:
 ; SDAG-AVOIDLDAPUR:    add x8, x0, #4
 ; SDAG-AVOIDLDAPUR:    ldaprb w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire:
+; SDAG-NOAVOIDLDAPUR:    ldapurb w0, [x0, #4]
     %gep = getelementptr inbounds i8, ptr %ptr, i32 4
     %r = load atomic i8, ptr %gep acquire, align 1
     ret i8 %r
@@ -456,12 +456,12 @@ define i8 @load_atomic_i8_unaligned_acquire_const(ptr readonly %ptr) {
 ; GISEL:    add x8, x0, #4
 ; GISEL:    ldaprb w0, [x8]
 ;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire_const:
-; SDAG-NOAVOIDLDAPUR:    ldapurb w0, [x0, #4]
-;
 ; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire_const:
 ; SDAG-AVOIDLDAPUR:    add x8, x0, #4
 ; SDAG-AVOIDLDAPUR:    ldaprb w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR:    ldapurb w0, [x0, #4]
     %gep = getelementptr inbounds i8, ptr %ptr, i32 4
     %r = load atomic i8, ptr %gep acquire, align 1
     ret i8 %r
@@ -490,9 +490,9 @@ define i16 @load_atomic_i16_unaligned_unordered(ptr %ptr) {
 ; GISEL:    add x1, x8, #4
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i16_unaligned_unordered:
-; SDAG:    add x1, x0, #4
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_unordered:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #4
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i8, ptr %ptr, i32 4
     %r = load atomic i16, ptr %gep unordered, align 1
     ret i16 %r
@@ -503,9 +503,9 @@ define i16 @load_atomic_i16_unaligned_unordered_const(ptr readonly %ptr) {
 ; GISEL:    add x1, x8, #4
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i16_unaligned_unordered_const:
-; SDAG:    add x1, x0, #4
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_unordered_const:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #4
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i8, ptr %ptr, i32 4
     %r = load atomic i16, ptr %gep unordered, align 1
     ret i16 %r
@@ -516,9 +516,9 @@ define i16 @load_atomic_i16_unaligned_monotonic(ptr %ptr) {
 ; GISEL:    add x1, x8, #8
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i16_unaligned_monotonic:
-; SDAG:    add x1, x0, #8
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_monotonic:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #8
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i16, ptr %ptr, i32 4
     %r = load atomic i16, ptr %gep monotonic, align 1
     ret i16 %r
@@ -529,9 +529,9 @@ define i16 @load_atomic_i16_unaligned_monotonic_const(ptr readonly %ptr) {
 ; GISEL:    add x1, x8, #8
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i16_unaligned_monotonic_const:
-; SDAG:    add x1, x0, #8
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_monotonic_const:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #8
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i16, ptr %ptr, i32 4
     %r = load atomic i16, ptr %gep monotonic, align 1
     ret i16 %r
@@ -542,9 +542,9 @@ define i16 @load_atomic_i16_unaligned_acquire(ptr %ptr) {
 ; GISEL:    add x1, x8, #8
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i16_unaligned_acquire:
-; SDAG:    add x1, x0, #8
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_acquire:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #8
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i16, ptr %ptr, i32 4
     %r = load atomic i16, ptr %gep acquire, align 1
     ret i16 %r
@@ -555,9 +555,9 @@ define i16 @load_atomic_i16_unaligned_acquire_const(ptr readonly %ptr) {
 ; GISEL:    add x1, x8, #8
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i16_unaligned_acquire_const:
-; SDAG:    add x1, x0, #8
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #8
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i16, ptr %ptr, i32 4
     %r = load atomic i16, ptr %gep acquire, align 1
     ret i16 %r
@@ -568,9 +568,9 @@ define i16 @load_atomic_i16_unaligned_seq_cst(ptr %ptr) {
 ; GISEL:    add x1, x8, #8
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i16_unaligned_seq_cst:
-; SDAG:    add x1, x0, #8
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_seq_cst:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #8
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i16, ptr %ptr, i32 4
     %r = load atomic i16, ptr %gep seq_cst, align 1
     ret i16 %r
@@ -581,9 +581,9 @@ define i16 @load_atomic_i16_unaligned_seq_cst_const(ptr readonly %ptr) {
 ; GISEL:    add x1, x8, #8
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i16_unaligned_seq_cst_const:
-; SDAG:    add x1, x0, #8
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_seq_cst_const:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #8
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i16, ptr %ptr, i32 4
     %r = load atomic i16, ptr %gep seq_cst, align 1
     ret i16 %r
@@ -594,9 +594,9 @@ define i32 @load_atomic_i32_unaligned_unordered(ptr %ptr) {
 ; GISEL:    add x1, x8, #16
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i32_unaligned_unordered:
-; SDAG:    add x1, x0, #16
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_unordered:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i32, ptr %ptr, i32 4
     %r = load atomic i32, ptr %gep unordered, align 1
     ret i32 %r
@@ -607,9 +607,9 @@ define i32 @load_atomic_i32_unaligned_unordered_const(ptr readonly %ptr) {
 ; GISEL:    add x1, x8, #16
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i32_unaligned_unordered_const:
-; SDAG:    add x1, x0, #16
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_unordered_const:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i32, ptr %ptr, i32 4
     %r = load atomic i32, ptr %gep unordered, align 1
     ret i32 %r
@@ -620,9 +620,9 @@ define i32 @load_atomic_i32_unaligned_monotonic(ptr %ptr) {
 ; GISEL:    add x1, x8, #16
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i32_unaligned_monotonic:
-; SDAG:    add x1, x0, #16
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_monotonic:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i32, ptr %ptr, i32 4
     %r = load atomic i32, ptr %gep monotonic, align 1
     ret i32 %r
@@ -633,9 +633,9 @@ define i32 @load_atomic_i32_unaligned_monotonic_const(ptr readonly %ptr) {
 ; GISEL:    add x1, x8, #16
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i32_unaligned_monotonic_const:
-; SDAG:    add x1, x0, #16
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_monotonic_const:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i32, ptr %ptr, i32 4
     %r = load atomic i32, ptr %gep monotonic, align 1
     ret i32 %r
@@ -646,9 +646,9 @@ define i32 @load_atomic_i32_unaligned_acquire(ptr %ptr) {
 ; GISEL:    add x1, x8, #16
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i32_unaligned_acquire:
-; SDAG:    add x1, x0, #16
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_acquire:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i32, ptr %ptr, i32 4
     %r = load atomic i32, ptr %gep acquire, align 1
     ret i32 %r
@@ -659,9 +659,9 @@ define i32 @load_atomic_i32_unaligned_acquire_const(ptr readonly %ptr) {
 ; GISEL:    add x1, x8, #16
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i32_unaligned_acquire_const:
-; SDAG:    add x1, x0, #16
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i32, ptr %ptr, i32 4
     %r = load atomic i32, ptr %gep acquire, align 1
     ret i32 %r
@@ -672,9 +672,9 @@ define i32 @load_atomic_i32_unaligned_seq_cst(ptr %ptr) {
 ; GISEL:    add x1, x8, #16
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i32_unaligned_seq_cst:
-; SDAG:    add x1, x0, #16
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_seq_cst:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i32, ptr %ptr, i32 4
     %r = load atomic i32, ptr %gep seq_cst, align 1
     ret i32 %r
@@ -685,9 +685,9 @@ define i32 @load_atomic_i32_unaligned_seq_cst_const(ptr readonly %ptr) {
 ; GISEL:    add x1, x8, #16
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i32_unaligned_seq_cst_const:
-; SDAG:    add x1, x0, #16
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_seq_cst_const:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i32, ptr %ptr, i32 4
     %r = load atomic i32, ptr %gep seq_cst, align 1
     ret i32 %r
@@ -698,9 +698,9 @@ define i64 @load_atomic_i64_unaligned_unordered(ptr %ptr) {
 ; GISEL:    add x1, x8, #32
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i64_unaligned_unordered:
-; SDAG:    add x1, x0, #32
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_unaligned_unordered:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #32
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i64, ptr %ptr, i32 4
     %r = load atomic i64, ptr %gep unordered, align 1
     ret i64 %r
@@ -711,9 +711,9 @@ define i64 @load_atomic_i64_unaligned_unordered_const(ptr readonly %ptr) {
 ; GISEL:    add x1, x8, #32
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i64_unaligned_unordered_const:
-; SDAG:    add x1, x0, #32
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_unaligned_unordered_const:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #32
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i64, ptr %ptr, i32 4
     %r = load atomic i64, ptr %gep unordered, align 1
     ret i64 %r
@@ -724,9 +724,9 @@ define i64 @load_atomic_i64_unaligned_monotonic(ptr %ptr) {
 ; GISEL:    add x1, x8, #32
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i64_unaligned_monotonic:
-; SDAG:    add x1, x0, #32
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_unaligned_monotonic:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #32
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i64, ptr %ptr, i32 4
     %r = load atomic i64, ptr %gep monotonic, align 1
     ret i64 %r
@@ -737,9 +737,9 @@ define i64 @load_atomic_i64_unaligned_monotonic_const(ptr readonly %ptr) {
 ; GISEL:    add x1, x8, #32
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i64_unaligned_monotonic_const:
-; SDAG:    add x1, x0, #32
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_unaligned_monotonic_const:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #32
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i64, ptr %ptr, i32 4
     %r = load atomic i64, ptr %gep monotonic, align 1
     ret i64 %r
@@ -750,9 +750,9 @@ define i64 @load_atomic_i64_unaligned_acquire(ptr %ptr) {
 ; GISEL:    add x1, x8, #32
 ; GISEL:    bl __atomic_load
 ;
-; SDAG-LABEL: load_atomic_i64_unaligned_acquire:
-; SDAG:    add x1, x0, #32
-; SDAG:    bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_unaligned_acquire:
+; SDAG-NOAVOIDLDAPUR:    add x1, x0, #32
+; SDAG-NOAVOIDLDAPUR:    bl __atomic_load
     %gep = getelementptr inbounds i64, ptr %ptr, i32 4
     %r = load a...
[truncated]

Copy link

github-actions bot commented Feb 19, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@sjoerdmeijer
Copy link
Collaborator Author

Ah, now that #127620 was merged, I need to set this for Grace too.

@sjoerdmeijer
Copy link
Collaborator Author

Ah, now that #127620 was merged, I need to set this for Grace too.

Ignore this, this isn't necessary the way grace is defined.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel to me like the right approach - in that if it was randomly right or wrong before it will still be randomly right or wrong, not always right (or more likely to be right, heuristics being what they are). I agree that scheduling for register pressure is the most important thing to consider, but there are times when scheduling on OoO can be helpful (out of a mis-predict for example, where you need to get instructions through the pipeline as efficiently as possible).

But it appears this might be running into problems at the moment. Disabling enableMachineScheduler() starts to enable some scheduling out of SDAG, and that might be hitting crashed now. (You could say that if it was scheduling out of SDAG for register pressure then we don't need to alter the scheduling later on, it looked like it was trying hybrid at the moment. But this wouldn't help with GISel). There is a chance this works differently to specifying the option, can you give it another test?

The best way to test perf might be to check the codesize (at -O3). It is the number of dynamically executed spills that is likely the most important, but codesize can be a proxy for the number of spills. Is it possible to add a test case for the problem you saw where no scheduling was better than with? We have seen a case recently where in-order scheduling was accidentally better than OoO, but I don't believe that the default order worked well there either.

@davemgreen davemgreen requested a review from c-rhodes February 20, 2025 09:56
@c-rhodes
Copy link
Collaborator

But it appears this might be running into problems at the moment. Disabling enableMachineScheduler() starts to enable some scheduling out of SDAG, and that might be hitting crashed now.

Could you clarify what you mean by "might be hitting crashed"?

@sjoerdmeijer
Copy link
Collaborator Author

but there are times when scheduling on OoO can be helpful

But we don't have the evidence for this. At least, not in the workloads that we see. Now, we might find one such a case, but is that then representative? The whole hypothesis is that scheduling isn't a win overall and is just eating up compile time (this idea is shared by gcc folks, fwiw).

Disabling enableMachineScheduler() starts to enable some scheduling out of SDAG, and that might be hitting crashed now.

I also don't know what this crash is.

There is a chance this works differently to specifying the option, can you give it another test?

Sorry, the experiment is unclear to me, what is it exactly?

Is it possible to add a test case for the problem you saw where no scheduling was better than with?

I am afraid it will be difficult to reduce and come up with something small, but can give it a try.

@davemgreen
Copy link
Collaborator

I also don't know what this crash is.

I was running the llvm-test-suite (+spec's) with.. I think -stdlib=libc++ --rtlib=compiler-rt -fuse-ld=lld -flto -O3 -mcpu=native. It is hitting a crash that looks like below, it looks like the scheduling out of SDAG. It was hitting it in enough cases that I assumed it was easy enough to reproduce, let me know if it isn't and I can try to check further.

clang: error: unable to execute command: Segmentation fault
clang: error: linker command failed due to signal (use -v to see invocation)
 #0 0x00000000008e0040 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (ld.lld+0x8e0040)
 #1 0x00000000008ddc5c llvm::sys::RunSignalHandlers() (ld.lld+0x8ddc5c)
 #2 0x00000000008dddb4 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #3 0x0000ffffb5bc67f0 (linux-vdso.so.1+0x7f0)
 #4 0x00000000017b34ec GetCostForDef(llvm::ScheduleDAGSDNodes::RegDefIter const&, llvm::TargetLowering const*, llvm::TargetInstrInfo const*, llvm::TargetRegisterInfo const*, unsigned int&, unsigned int&, llvm::MachineFunction const&) ScheduleDAGRRList.cpp:0:0
 #5 0x00000000017b4c74 (anonymous namespace)::RegReductionPQBase::HighRegPressure(llvm::SUnit const*) const ScheduleDAGRRList.cpp:0:0
 #6 0x00000000017b68d8 (anonymous namespace)::hybrid_ls_rr_sort::operator()(llvm::SUnit*, llvm::SUnit*) const (.part.0) ScheduleDAGRRList.cpp:0:0
 #7 0x00000000017b75f4 llvm::SUnit* (anonymous namespace)::popFromQueueImpl<(anonymous namespace)::hybrid_ls_rr_sort>(std::vector<llvm::SUnit*, std::allocator<llvm::SUnit*>>&, (anonymous namespace)::hybrid_ls_rr_sort&) ScheduleDAGRRList.cpp:0:0
 #8 0x00000000017b7684 (anonymous namespace)::RegReductionPriorityQueue<(anonymous namespace)::hybrid_ls_rr_sort>::pop() ScheduleDAGRRList.cpp:0:0
 #9 0x00000000017c0394 (anonymous namespace)::ScheduleDAGRRList::Schedule() ScheduleDAGRRList.cpp:0:0
#10 0x00000000016e17f4 llvm::SelectionDAGISel::CodeGenAndEmitDAG() (ld.lld+0x16e17f4)
#11 0x00000000016e4398 llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) (ld.lld+0x16e4398)
#12 0x00000000016e582c llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) (ld.lld+0x16e582c)
#13 0x00000000016d7308 llvm::SelectionDAGISelLegacy::runOnMachineFunction(llvm::MachineFunction&) (ld.lld+0x16d7308)
#14 0x0000000001c55208 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (.part.0) MachineFunctionPass.cpp:0:0
#15 0x000000000383476c llvm::FPPassManager::runOnFunction(llvm::Function&) (ld.lld+0x383476c)
#16 0x0000000003834a20 llvm::FPPassManager::runOnModule(llvm::Module&) (ld.lld+0x3834a20)
#17 0x00000000038353e8 llvm::legacy::PassManagerImpl::run(llvm::Module&) (ld.lld+0x38353e8)
#18 0x0000000001915800 codegen(llvm::lto::Config const&, llvm::TargetMachine*, std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, unsigned int, llvm::Module&, llvm::ModuleSummaryIndex const&) LTOBackend.cpp:0:0
#19 0x0000000001916d70 llvm::lto::backend(llvm::lto::Config const&, std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, unsigned int, llvm::Module&, llvm::ModuleSummaryIndex&) (ld.lld+0x1916d70)

There is a chance this works differently to specifying the option, can you give it another test?

Sorry, the experiment is unclear to me, what is it exactly?

Sorry, I just meant test it doesn't crash. I was guessing that you might have been testing with -enable-misched=0, but that might have worked a little differently.

@sjoerdmeijer
Copy link
Collaborator Author

Sorry, I just meant test it doesn't crash. I was guessing that you might have been testing with -enable-misched=0, but that might have worked a little differently.

Ohhh, I see, thanks.
Yes, I have done all my testing with -mllvm -enable-misched=false, then made my this change and didn't rerun things with it. So you mean that with this patch you see crashes, that's interesting! Will look.

We would like to disable the pre-RA machine scheduler for the Neoverse
V2 because we have a key workload that massively benefits from this (25%
uplift). Despite the machine scheduler being register pressure aware, it
results in spills for this workload. Disabling the scheduler seems a lot
more attractive than trying to tweak regalloc heuristics:
- We see no benefit of scheduling anyway on this big core, and have
  never seen this. I.e., when we added the V2 scheduling model, this
  wasn't for perf reasons, only to enable LLVM-MCA.
- Scheduling can consume significant compile-time, not resulting in any
  perf gains. This is a bad deal.

FWIW: the GCC folks realised the same not that long ago, and did exactly
the same, see also:
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/667074.html

I guess other big cores could benefit from this too, but I would like
to leave that decision to folks with more experience on those cores, so
that's why I propose to change this for the V2 here only.

Numbers:
* We know the Eigen library is somewhat sensitive to scheduling, but
  I found one kernel to regress with ~2%, and another to improve with
  ~2%. They cancel each other out, and overall the result is neutral.
* SPEC FP and INT seem totally unaffected.
* LLVM test-suite: a little bit up and down, all within noise levels I
  think, so is neutral.
* Compile-time numbers: I see a geomean 3% improvement for the LLVM
  test-suite, and a very decent one for the sqlite amalgamation version.

I haven't looked at the post-RA scheduling, maybe that's interesting
as a follow up.
@sjoerdmeijer
Copy link
Collaborator Author

sjoerdmeijer commented Feb 26, 2025

This should fix the crash. There were some interesting things going on: despite the machine scheduler being disabled, there was scheduling for register pressure going on for the SelectionDAG. The reason was that the default scheduling policy was set to Hybrid, which causes this. If the MachineScheduler is disabled, this now chances the default sched policy to Source. This is the behaviour that we want.

There is something to fix for the combination "disabled MIScheduler + Hybrid SelectionDAG scheduling" but I can't trigger that (with options); I do have a fix, but can't add a test.

Anyway, with this fix, I will take these next steps:

  • I need to rerun perf numbers,
  • I will see if I can create a reproducer that isn't too big for the code that starts spilling.

@sjoerdmeijer
Copy link
Collaborator Author

I am going to abandon this because this is not going to work. I still haven't seen any evidence that instruction scheduling on itself is beneficial, but turns out it is the other things that the scheduler is also doing that is making a difference:

We found regressions disabling the scheduler, and the reason I have seen so far is because of MOV instructions that are no longer eliminated. I am not entirely sure yet about the loads/stores and if they influence performance, but it is a difference.

One idea therefore for some light weight scheduling is to separate out these optimisations from the other scheduling business.

To address the regression that we are seeing I will now pursue this direction:

  • a heuristic to bail out from the machine scheduler when the function is extremely large. I forgot the exact number, but the function we look at has more than 7000 instructions, so the intuition is that for such for large functions we eat up compile-time with a very high risk of getting it wrong.
  • Better would be to skip scheduling but still perform copy-elimination, but that needs some investigation how easy it is to run these things separately.

@davemgreen: let me know what you think or if you have any objections.

@davemgreen
Copy link
Collaborator

Hi - Another of the things that is possibly quite important is to schedule for register pressure. That might be related to the copy elimination you mention, I wasn't sure, it might be different. It's one those things that could happen to be right from the original order of the instructions (people tend to write code which uses variables close together), but if it is needed then something should be improving the order if it can.

@c-rhodes has been looking at a couple of cases recently where we have found scheduling has not been performing as well as it could. Either the inorder scheduling model or no scheduling was better for specific cases, but when we tried that on a larger selection of benchmarks the results ended up as a wash I believe, with some things getting better and some worse in about equal proportions. The GCC team recently tried this (they have very old scheduling models too), but ended up going back on it I believe. It still worries me to remove it entirely but at the end of the day it is the data that matters if we can show it reliably.

sjoerdmeijer added a commit to sjoerdmeijer/llvm-project that referenced this pull request May 12, 2025
… vector intrinsic codes

Skip the Pre-RA MachineScheduler for large hand-written vector intrinsic
codes when targetting the Neoverse V2. The motivation to skip the
scheduler is the same as this abandoned patch:

llvm#127784

But this reimplementation is much more focused and fine-grained and
based on the following heuristic:
- only skip the pre-ra machine scheduler for large (hand-written) vector
  intrinsic code,
- do this only for the Neoverse V2 (a wide micro-architecture).

The intuition of this patch is that:
- scheduling based on instruction latency isn't useful for a very wide
  micro-architecture (which is why GCC also partly stopped doing this),
- however, the machine scheduler also performs some optimisations:
  i) load/store clusttering, and ii) copy elimination. These are useful
  optimisations, and that's why disabling the machine scheduler in
  general isn't a good idea, i.e. this results in some regressions.
- but the function where the machine scheduler and register allocator
  are not working well together is a large, hand-written vector code.
  Thus, one could argue that scheduling this kind of code is against the
  programmer's intent, so let's not do that, which avoids complications
  later down in the optimisation pipeline.

The heuristic is trying to recognise large hand-written intrinsic code
by calculating a percentage of vector code and other instructions in a
function and skips the machine scheduler if certain treshold values are
exceeded. I.e., if a function is more than 70% vector code, contains
more than 2800 IR instructions and 425 intrinsics, don't schedule this
function.

This obviously is a heuristic, but is hopefully narrow enough to not
cause regressions (I haven't found any). The alternative is to look
into regalloc, which is where the problems occur with the placement
of spill/reload code. However, there will be heuristics involved there
too, and so this seems like a valid heuristic and looking into
regalloc is an orthogonal exercise.
sjoerdmeijer added a commit to sjoerdmeijer/llvm-project that referenced this pull request May 21, 2025
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core
tuning description. This gives us a 20% improvement on a key workload,
some other minor improvements here and there, and no real regressions;
nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI
scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to
not schedule hand-written vector code. This solution is preferred
because it avoids another heuristic and achieves what we want, and for
what is worth, there is a lot of precedent for setting this feature.

Thanks to:
- Ricardo Jesus for pointing out this subtarget feature, and
- Cameron McInally for the extensive performance testing.
sjoerdmeijer added a commit that referenced this pull request Jun 6, 2025
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core
tuning description. This gives us a 20% improvement on a key workload,
some other minor improvements here and there, and no real regressions;
nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI
scheduler entirely (#127784), and #139557 was about a heuristic to not
schedule hand-written vector code. This solution is preferred because it
avoids another heuristic and achieves what we want, and for what is
worth, there is a lot of precedent for setting this feature.

Thanks to:
- Ricardo Jesus for pointing out this subtarget feature, and
- Cameron McInally for the extensive performance testing.
rorth pushed a commit to rorth/llvm-project that referenced this pull request Jun 11, 2025
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core
tuning description. This gives us a 20% improvement on a key workload,
some other minor improvements here and there, and no real regressions;
nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI
scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not
schedule hand-written vector code. This solution is preferred because it
avoids another heuristic and achieves what we want, and for what is
worth, there is a lot of precedent for setting this feature.

Thanks to:
- Ricardo Jesus for pointing out this subtarget feature, and
- Cameron McInally for the extensive performance testing.
DhruvSrivastavaX pushed a commit to DhruvSrivastavaX/lldb-for-aix that referenced this pull request Jun 12, 2025
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core
tuning description. This gives us a 20% improvement on a key workload,
some other minor improvements here and there, and no real regressions;
nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI
scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not
schedule hand-written vector code. This solution is preferred because it
avoids another heuristic and achieves what we want, and for what is
worth, there is a lot of precedent for setting this feature.

Thanks to:
- Ricardo Jesus for pointing out this subtarget feature, and
- Cameron McInally for the extensive performance testing.
tomtor pushed a commit to tomtor/llvm-project that referenced this pull request Jun 14, 2025
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core
tuning description. This gives us a 20% improvement on a key workload,
some other minor improvements here and there, and no real regressions;
nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI
scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not
schedule hand-written vector code. This solution is preferred because it
avoids another heuristic and achieves what we want, and for what is
worth, there is a lot of precedent for setting this feature.

Thanks to:
- Ricardo Jesus for pointing out this subtarget feature, and
- Cameron McInally for the extensive performance testing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants