-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AArch64] Disable Pre-RA Scheduler for Neoverse V2 #127784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-aarch64 Author: Sjoerd Meijer (sjoerdmeijer) ChangesWe would like to disable the pre-RA machine scheduler for the Neoverse V2 because we have a key workload that massively benefits from this (25% uplift). Despite the machine scheduler being register pressure aware, it results in spills for this workload. Disabling the scheduler seems a lot more attractive than trying to tweak regalloc heuristics:
FWIW: the GCC folks realised the same not that long ago, and did exactly the same, see also: I guess other big cores could benefit from this too, but I would like to leave that decision to folks with more experience on those cores, so that's why I propose to change this for the V2 here only. Numbers:
I haven't looked at the post-RA scheduling, maybe that's interesting as a follow up. Patch is 31.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127784.diff 6 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64Features.td b/llvm/lib/Target/AArch64/AArch64Features.td
index 357f526d5e308..13c1a386149b7 100644
--- a/llvm/lib/Target/AArch64/AArch64Features.td
+++ b/llvm/lib/Target/AArch64/AArch64Features.td
@@ -669,6 +669,9 @@ def FeatureExynosCheapAsMoveHandling : SubtargetFeature<"exynos-cheap-as-move",
"HasExynosCheapAsMoveHandling", "true",
"Use Exynos specific handling of cheap instructions">;
+def FeatureDisablePreRAScheduler : SubtargetFeature<"use-prera-scheduler",
+ "DisablePreRAScheduler", "true", "Disable scheduling before register allocation">;
+
def FeaturePostRAScheduler : SubtargetFeature<"use-postra-scheduler",
"UsePostRAScheduler", "true", "Schedule again after register allocation">;
diff --git a/llvm/lib/Target/AArch64/AArch64Processors.td b/llvm/lib/Target/AArch64/AArch64Processors.td
index b977b6aaaf619..401a2637fa9f3 100644
--- a/llvm/lib/Target/AArch64/AArch64Processors.td
+++ b/llvm/lib/Target/AArch64/AArch64Processors.td
@@ -540,6 +540,7 @@ def TuneNeoverseV2 : SubtargetFeature<"neoversev2", "ARMProcFamily", "NeoverseV2
FeatureCmpBccFusion,
FeatureFuseAdrpAdd,
FeatureALULSLFast,
+ FeatureDisablePreRAScheduler,
FeaturePostRAScheduler,
FeatureEnableSelectOptimize,
FeatureUseFixedOverScalableIfEqualCost,
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index c6eb77e3bc3ba..9e8c50e376272 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -156,7 +156,7 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
const LegalizerInfo *getLegalizerInfo() const override;
const RegisterBankInfo *getRegBankInfo() const override;
const Triple &getTargetTriple() const { return TargetTriple; }
- bool enableMachineScheduler() const override { return true; }
+ bool enableMachineScheduler() const override { return !disablePreRAScheduler(); }
bool enablePostRAScheduler() const override { return usePostRAScheduler(); }
bool enableSubRegLiveness() const override { return EnableSubregLiveness; }
diff --git a/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomic-load-rcpc_immo.ll b/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomic-load-rcpc_immo.ll
index 02ff12c27fcda..d29fdd23863a6 100644
--- a/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomic-load-rcpc_immo.ll
+++ b/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomic-load-rcpc_immo.ll
@@ -48,12 +48,12 @@ define i8 @load_atomic_i8_aligned_acquire(ptr %ptr) {
; GISEL: add x8, x0, #4
; GISEL: ldaprb w0, [x8]
;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire:
-; SDAG-NOAVOIDLDAPUR: ldapurb w0, [x0, #4]
-;
; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire:
; SDAG-AVOIDLDAPUR: add x8, x0, #4
; SDAG-AVOIDLDAPUR: ldaprb w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire:
+; SDAG-NOAVOIDLDAPUR: ldapurb w0, [x0, #4]
%gep = getelementptr inbounds i8, ptr %ptr, i32 4
%r = load atomic i8, ptr %gep acquire, align 1
ret i8 %r
@@ -64,12 +64,12 @@ define i8 @load_atomic_i8_aligned_acquire_const(ptr readonly %ptr) {
; GISEL: add x8, x0, #4
; GISEL: ldaprb w0, [x8]
;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire_const:
-; SDAG-NOAVOIDLDAPUR: ldapurb w0, [x0, #4]
-;
; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire_const:
; SDAG-AVOIDLDAPUR: add x8, x0, #4
; SDAG-AVOIDLDAPUR: ldaprb w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR: ldapurb w0, [x0, #4]
%gep = getelementptr inbounds i8, ptr %ptr, i32 4
%r = load atomic i8, ptr %gep acquire, align 1
ret i8 %r
@@ -130,12 +130,12 @@ define i16 @load_atomic_i16_aligned_acquire(ptr %ptr) {
; GISEL: add x8, x0, #8
; GISEL: ldaprh w0, [x8]
;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire:
-; SDAG-NOAVOIDLDAPUR: ldapurh w0, [x0, #8]
-;
; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire:
; SDAG-AVOIDLDAPUR: add x8, x0, #8
; SDAG-AVOIDLDAPUR: ldaprh w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire:
+; SDAG-NOAVOIDLDAPUR: ldapurh w0, [x0, #8]
%gep = getelementptr inbounds i16, ptr %ptr, i32 4
%r = load atomic i16, ptr %gep acquire, align 2
ret i16 %r
@@ -146,12 +146,12 @@ define i16 @load_atomic_i16_aligned_acquire_const(ptr readonly %ptr) {
; GISEL: add x8, x0, #8
; GISEL: ldaprh w0, [x8]
;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire_const:
-; SDAG-NOAVOIDLDAPUR: ldapurh w0, [x0, #8]
-;
; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire_const:
; SDAG-AVOIDLDAPUR: add x8, x0, #8
; SDAG-AVOIDLDAPUR: ldaprh w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR: ldapurh w0, [x0, #8]
%gep = getelementptr inbounds i16, ptr %ptr, i32 4
%r = load atomic i16, ptr %gep acquire, align 2
ret i16 %r
@@ -211,12 +211,12 @@ define i32 @load_atomic_i32_aligned_acquire(ptr %ptr) {
; GISEL-LABEL: load_atomic_i32_aligned_acquire:
; GISEL: ldapur w0, [x0, #16]
;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire:
-; SDAG-NOAVOIDLDAPUR: ldapur w0, [x0, #16]
-;
; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire:
; SDAG-AVOIDLDAPUR: add x8, x0, #16
; SDAG-AVOIDLDAPUR: ldapr w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire:
+; SDAG-NOAVOIDLDAPUR: ldapur w0, [x0, #16]
%gep = getelementptr inbounds i32, ptr %ptr, i32 4
%r = load atomic i32, ptr %gep acquire, align 4
ret i32 %r
@@ -226,12 +226,12 @@ define i32 @load_atomic_i32_aligned_acquire_const(ptr readonly %ptr) {
; GISEL-LABEL: load_atomic_i32_aligned_acquire_const:
; GISEL: ldapur w0, [x0, #16]
;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire_const:
-; SDAG-NOAVOIDLDAPUR: ldapur w0, [x0, #16]
-;
; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire_const:
; SDAG-AVOIDLDAPUR: add x8, x0, #16
; SDAG-AVOIDLDAPUR: ldapr w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR: ldapur w0, [x0, #16]
%gep = getelementptr inbounds i32, ptr %ptr, i32 4
%r = load atomic i32, ptr %gep acquire, align 4
ret i32 %r
@@ -291,12 +291,12 @@ define i64 @load_atomic_i64_aligned_acquire(ptr %ptr) {
; GISEL-LABEL: load_atomic_i64_aligned_acquire:
; GISEL: ldapur x0, [x0, #32]
;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire:
-; SDAG-NOAVOIDLDAPUR: ldapur x0, [x0, #32]
-;
; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire:
; SDAG-AVOIDLDAPUR: add x8, x0, #32
; SDAG-AVOIDLDAPUR: ldapr x0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire:
+; SDAG-NOAVOIDLDAPUR: ldapur x0, [x0, #32]
%gep = getelementptr inbounds i64, ptr %ptr, i32 4
%r = load atomic i64, ptr %gep acquire, align 8
ret i64 %r
@@ -306,12 +306,12 @@ define i64 @load_atomic_i64_aligned_acquire_const(ptr readonly %ptr) {
; GISEL-LABEL: load_atomic_i64_aligned_acquire_const:
; GISEL: ldapur x0, [x0, #32]
;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire_const:
-; SDAG-NOAVOIDLDAPUR: ldapur x0, [x0, #32]
-;
; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire_const:
; SDAG-AVOIDLDAPUR: add x8, x0, #32
; SDAG-AVOIDLDAPUR: ldapr x0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR: ldapur x0, [x0, #32]
%gep = getelementptr inbounds i64, ptr %ptr, i32 4
%r = load atomic i64, ptr %gep acquire, align 8
ret i64 %r
@@ -440,12 +440,12 @@ define i8 @load_atomic_i8_unaligned_acquire(ptr %ptr) {
; GISEL: add x8, x0, #4
; GISEL: ldaprb w0, [x8]
;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire:
-; SDAG-NOAVOIDLDAPUR: ldapurb w0, [x0, #4]
-;
; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire:
; SDAG-AVOIDLDAPUR: add x8, x0, #4
; SDAG-AVOIDLDAPUR: ldaprb w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire:
+; SDAG-NOAVOIDLDAPUR: ldapurb w0, [x0, #4]
%gep = getelementptr inbounds i8, ptr %ptr, i32 4
%r = load atomic i8, ptr %gep acquire, align 1
ret i8 %r
@@ -456,12 +456,12 @@ define i8 @load_atomic_i8_unaligned_acquire_const(ptr readonly %ptr) {
; GISEL: add x8, x0, #4
; GISEL: ldaprb w0, [x8]
;
-; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire_const:
-; SDAG-NOAVOIDLDAPUR: ldapurb w0, [x0, #4]
-;
; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire_const:
; SDAG-AVOIDLDAPUR: add x8, x0, #4
; SDAG-AVOIDLDAPUR: ldaprb w0, [x8]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR: ldapurb w0, [x0, #4]
%gep = getelementptr inbounds i8, ptr %ptr, i32 4
%r = load atomic i8, ptr %gep acquire, align 1
ret i8 %r
@@ -490,9 +490,9 @@ define i16 @load_atomic_i16_unaligned_unordered(ptr %ptr) {
; GISEL: add x1, x8, #4
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i16_unaligned_unordered:
-; SDAG: add x1, x0, #4
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_unordered:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #4
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i8, ptr %ptr, i32 4
%r = load atomic i16, ptr %gep unordered, align 1
ret i16 %r
@@ -503,9 +503,9 @@ define i16 @load_atomic_i16_unaligned_unordered_const(ptr readonly %ptr) {
; GISEL: add x1, x8, #4
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i16_unaligned_unordered_const:
-; SDAG: add x1, x0, #4
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_unordered_const:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #4
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i8, ptr %ptr, i32 4
%r = load atomic i16, ptr %gep unordered, align 1
ret i16 %r
@@ -516,9 +516,9 @@ define i16 @load_atomic_i16_unaligned_monotonic(ptr %ptr) {
; GISEL: add x1, x8, #8
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i16_unaligned_monotonic:
-; SDAG: add x1, x0, #8
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_monotonic:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #8
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i16, ptr %ptr, i32 4
%r = load atomic i16, ptr %gep monotonic, align 1
ret i16 %r
@@ -529,9 +529,9 @@ define i16 @load_atomic_i16_unaligned_monotonic_const(ptr readonly %ptr) {
; GISEL: add x1, x8, #8
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i16_unaligned_monotonic_const:
-; SDAG: add x1, x0, #8
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_monotonic_const:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #8
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i16, ptr %ptr, i32 4
%r = load atomic i16, ptr %gep monotonic, align 1
ret i16 %r
@@ -542,9 +542,9 @@ define i16 @load_atomic_i16_unaligned_acquire(ptr %ptr) {
; GISEL: add x1, x8, #8
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i16_unaligned_acquire:
-; SDAG: add x1, x0, #8
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_acquire:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #8
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i16, ptr %ptr, i32 4
%r = load atomic i16, ptr %gep acquire, align 1
ret i16 %r
@@ -555,9 +555,9 @@ define i16 @load_atomic_i16_unaligned_acquire_const(ptr readonly %ptr) {
; GISEL: add x1, x8, #8
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i16_unaligned_acquire_const:
-; SDAG: add x1, x0, #8
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #8
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i16, ptr %ptr, i32 4
%r = load atomic i16, ptr %gep acquire, align 1
ret i16 %r
@@ -568,9 +568,9 @@ define i16 @load_atomic_i16_unaligned_seq_cst(ptr %ptr) {
; GISEL: add x1, x8, #8
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i16_unaligned_seq_cst:
-; SDAG: add x1, x0, #8
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_seq_cst:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #8
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i16, ptr %ptr, i32 4
%r = load atomic i16, ptr %gep seq_cst, align 1
ret i16 %r
@@ -581,9 +581,9 @@ define i16 @load_atomic_i16_unaligned_seq_cst_const(ptr readonly %ptr) {
; GISEL: add x1, x8, #8
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i16_unaligned_seq_cst_const:
-; SDAG: add x1, x0, #8
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_unaligned_seq_cst_const:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #8
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i16, ptr %ptr, i32 4
%r = load atomic i16, ptr %gep seq_cst, align 1
ret i16 %r
@@ -594,9 +594,9 @@ define i32 @load_atomic_i32_unaligned_unordered(ptr %ptr) {
; GISEL: add x1, x8, #16
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i32_unaligned_unordered:
-; SDAG: add x1, x0, #16
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_unordered:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i32, ptr %ptr, i32 4
%r = load atomic i32, ptr %gep unordered, align 1
ret i32 %r
@@ -607,9 +607,9 @@ define i32 @load_atomic_i32_unaligned_unordered_const(ptr readonly %ptr) {
; GISEL: add x1, x8, #16
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i32_unaligned_unordered_const:
-; SDAG: add x1, x0, #16
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_unordered_const:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i32, ptr %ptr, i32 4
%r = load atomic i32, ptr %gep unordered, align 1
ret i32 %r
@@ -620,9 +620,9 @@ define i32 @load_atomic_i32_unaligned_monotonic(ptr %ptr) {
; GISEL: add x1, x8, #16
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i32_unaligned_monotonic:
-; SDAG: add x1, x0, #16
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_monotonic:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i32, ptr %ptr, i32 4
%r = load atomic i32, ptr %gep monotonic, align 1
ret i32 %r
@@ -633,9 +633,9 @@ define i32 @load_atomic_i32_unaligned_monotonic_const(ptr readonly %ptr) {
; GISEL: add x1, x8, #16
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i32_unaligned_monotonic_const:
-; SDAG: add x1, x0, #16
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_monotonic_const:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i32, ptr %ptr, i32 4
%r = load atomic i32, ptr %gep monotonic, align 1
ret i32 %r
@@ -646,9 +646,9 @@ define i32 @load_atomic_i32_unaligned_acquire(ptr %ptr) {
; GISEL: add x1, x8, #16
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i32_unaligned_acquire:
-; SDAG: add x1, x0, #16
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_acquire:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i32, ptr %ptr, i32 4
%r = load atomic i32, ptr %gep acquire, align 1
ret i32 %r
@@ -659,9 +659,9 @@ define i32 @load_atomic_i32_unaligned_acquire_const(ptr readonly %ptr) {
; GISEL: add x1, x8, #16
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i32_unaligned_acquire_const:
-; SDAG: add x1, x0, #16
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i32, ptr %ptr, i32 4
%r = load atomic i32, ptr %gep acquire, align 1
ret i32 %r
@@ -672,9 +672,9 @@ define i32 @load_atomic_i32_unaligned_seq_cst(ptr %ptr) {
; GISEL: add x1, x8, #16
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i32_unaligned_seq_cst:
-; SDAG: add x1, x0, #16
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_seq_cst:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i32, ptr %ptr, i32 4
%r = load atomic i32, ptr %gep seq_cst, align 1
ret i32 %r
@@ -685,9 +685,9 @@ define i32 @load_atomic_i32_unaligned_seq_cst_const(ptr readonly %ptr) {
; GISEL: add x1, x8, #16
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i32_unaligned_seq_cst_const:
-; SDAG: add x1, x0, #16
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_unaligned_seq_cst_const:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #16
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i32, ptr %ptr, i32 4
%r = load atomic i32, ptr %gep seq_cst, align 1
ret i32 %r
@@ -698,9 +698,9 @@ define i64 @load_atomic_i64_unaligned_unordered(ptr %ptr) {
; GISEL: add x1, x8, #32
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i64_unaligned_unordered:
-; SDAG: add x1, x0, #32
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_unaligned_unordered:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #32
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i64, ptr %ptr, i32 4
%r = load atomic i64, ptr %gep unordered, align 1
ret i64 %r
@@ -711,9 +711,9 @@ define i64 @load_atomic_i64_unaligned_unordered_const(ptr readonly %ptr) {
; GISEL: add x1, x8, #32
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i64_unaligned_unordered_const:
-; SDAG: add x1, x0, #32
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_unaligned_unordered_const:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #32
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i64, ptr %ptr, i32 4
%r = load atomic i64, ptr %gep unordered, align 1
ret i64 %r
@@ -724,9 +724,9 @@ define i64 @load_atomic_i64_unaligned_monotonic(ptr %ptr) {
; GISEL: add x1, x8, #32
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i64_unaligned_monotonic:
-; SDAG: add x1, x0, #32
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_unaligned_monotonic:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #32
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i64, ptr %ptr, i32 4
%r = load atomic i64, ptr %gep monotonic, align 1
ret i64 %r
@@ -737,9 +737,9 @@ define i64 @load_atomic_i64_unaligned_monotonic_const(ptr readonly %ptr) {
; GISEL: add x1, x8, #32
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i64_unaligned_monotonic_const:
-; SDAG: add x1, x0, #32
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_unaligned_monotonic_const:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #32
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i64, ptr %ptr, i32 4
%r = load atomic i64, ptr %gep monotonic, align 1
ret i64 %r
@@ -750,9 +750,9 @@ define i64 @load_atomic_i64_unaligned_acquire(ptr %ptr) {
; GISEL: add x1, x8, #32
; GISEL: bl __atomic_load
;
-; SDAG-LABEL: load_atomic_i64_unaligned_acquire:
-; SDAG: add x1, x0, #32
-; SDAG: bl __atomic_load
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_unaligned_acquire:
+; SDAG-NOAVOIDLDAPUR: add x1, x0, #32
+; SDAG-NOAVOIDLDAPUR: bl __atomic_load
%gep = getelementptr inbounds i64, ptr %ptr, i32 4
%r = load a...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
Ah, now that #127620 was merged, I need to set this for Grace too. |
Ignore this, this isn't necessary the way grace is defined. |
3c9df63
to
d38b98f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't feel to me like the right approach - in that if it was randomly right or wrong before it will still be randomly right or wrong, not always right (or more likely to be right, heuristics being what they are). I agree that scheduling for register pressure is the most important thing to consider, but there are times when scheduling on OoO can be helpful (out of a mis-predict for example, where you need to get instructions through the pipeline as efficiently as possible).
But it appears this might be running into problems at the moment. Disabling enableMachineScheduler() starts to enable some scheduling out of SDAG, and that might be hitting crashed now. (You could say that if it was scheduling out of SDAG for register pressure then we don't need to alter the scheduling later on, it looked like it was trying hybrid at the moment. But this wouldn't help with GISel). There is a chance this works differently to specifying the option, can you give it another test?
The best way to test perf might be to check the codesize (at -O3). It is the number of dynamically executed spills that is likely the most important, but codesize can be a proxy for the number of spills. Is it possible to add a test case for the problem you saw where no scheduling was better than with? We have seen a case recently where in-order scheduling was accidentally better than OoO, but I don't believe that the default order worked well there either.
Could you clarify what you mean by "might be hitting crashed"? |
But we don't have the evidence for this. At least, not in the workloads that we see. Now, we might find one such a case, but is that then representative? The whole hypothesis is that scheduling isn't a win overall and is just eating up compile time (this idea is shared by gcc folks, fwiw).
I also don't know what this crash is.
Sorry, the experiment is unclear to me, what is it exactly?
I am afraid it will be difficult to reduce and come up with something small, but can give it a try. |
I was running the llvm-test-suite (+spec's) with.. I think
Sorry, I just meant test it doesn't crash. I was guessing that you might have been testing with |
Ohhh, I see, thanks. |
We would like to disable the pre-RA machine scheduler for the Neoverse V2 because we have a key workload that massively benefits from this (25% uplift). Despite the machine scheduler being register pressure aware, it results in spills for this workload. Disabling the scheduler seems a lot more attractive than trying to tweak regalloc heuristics: - We see no benefit of scheduling anyway on this big core, and have never seen this. I.e., when we added the V2 scheduling model, this wasn't for perf reasons, only to enable LLVM-MCA. - Scheduling can consume significant compile-time, not resulting in any perf gains. This is a bad deal. FWIW: the GCC folks realised the same not that long ago, and did exactly the same, see also: https://gcc.gnu.org/pipermail/gcc-patches/2024-October/667074.html I guess other big cores could benefit from this too, but I would like to leave that decision to folks with more experience on those cores, so that's why I propose to change this for the V2 here only. Numbers: * We know the Eigen library is somewhat sensitive to scheduling, but I found one kernel to regress with ~2%, and another to improve with ~2%. They cancel each other out, and overall the result is neutral. * SPEC FP and INT seem totally unaffected. * LLVM test-suite: a little bit up and down, all within noise levels I think, so is neutral. * Compile-time numbers: I see a geomean 3% improvement for the LLVM test-suite, and a very decent one for the sqlite amalgamation version. I haven't looked at the post-RA scheduling, maybe that's interesting as a follow up.
d38b98f
to
f641df8
Compare
This should fix the crash. There were some interesting things going on: despite the machine scheduler being disabled, there was scheduling for register pressure going on for the SelectionDAG. The reason was that the default scheduling policy was set to Hybrid, which causes this. If the MachineScheduler is disabled, this now chances the default sched policy to Source. This is the behaviour that we want. There is something to fix for the combination "disabled MIScheduler + Hybrid SelectionDAG scheduling" but I can't trigger that (with options); I do have a fix, but can't add a test. Anyway, with this fix, I will take these next steps:
|
I am going to abandon this because this is not going to work. I still haven't seen any evidence that instruction scheduling on itself is beneficial, but turns out it is the other things that the scheduler is also doing that is making a difference:
We found regressions disabling the scheduler, and the reason I have seen so far is because of MOV instructions that are no longer eliminated. I am not entirely sure yet about the loads/stores and if they influence performance, but it is a difference. One idea therefore for some light weight scheduling is to separate out these optimisations from the other scheduling business. To address the regression that we are seeing I will now pursue this direction:
@davemgreen: let me know what you think or if you have any objections. |
Hi - Another of the things that is possibly quite important is to schedule for register pressure. That might be related to the copy elimination you mention, I wasn't sure, it might be different. It's one those things that could happen to be right from the original order of the instructions (people tend to write code which uses variables close together), but if it is needed then something should be improving the order if it can. @c-rhodes has been looking at a couple of cases recently where we have found scheduling has not been performing as well as it could. Either the inorder scheduling model or no scheduling was better for specific cases, but when we tried that on a larger selection of benchmarks the results ended up as a wash I believe, with some things getting better and some worse in about equal proportions. The GCC team recently tried this (they have very old scheduling models too), but ended up going back on it I believe. It still worries me to remove it entirely but at the end of the day it is the data that matters if we can show it reliably. |
… vector intrinsic codes Skip the Pre-RA MachineScheduler for large hand-written vector intrinsic codes when targetting the Neoverse V2. The motivation to skip the scheduler is the same as this abandoned patch: llvm#127784 But this reimplementation is much more focused and fine-grained and based on the following heuristic: - only skip the pre-ra machine scheduler for large (hand-written) vector intrinsic code, - do this only for the Neoverse V2 (a wide micro-architecture). The intuition of this patch is that: - scheduling based on instruction latency isn't useful for a very wide micro-architecture (which is why GCC also partly stopped doing this), - however, the machine scheduler also performs some optimisations: i) load/store clusttering, and ii) copy elimination. These are useful optimisations, and that's why disabling the machine scheduler in general isn't a good idea, i.e. this results in some regressions. - but the function where the machine scheduler and register allocator are not working well together is a large, hand-written vector code. Thus, one could argue that scheduling this kind of code is against the programmer's intent, so let's not do that, which avoids complications later down in the optimisation pipeline. The heuristic is trying to recognise large hand-written intrinsic code by calculating a percentage of vector code and other instructions in a function and skips the machine scheduler if certain treshold values are exceeded. I.e., if a function is more than 70% vector code, contains more than 2800 IR instructions and 425 intrinsics, don't schedule this function. This obviously is a heuristic, but is hopefully narrow enough to not cause regressions (I haven't found any). The alternative is to look into regalloc, which is where the problems occur with the placement of spill/reload code. However, there will be heuristics involved there too, and so this seems like a valid heuristic and looking into regalloc is an orthogonal exercise.
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core tuning description. This gives us a 20% improvement on a key workload, some other minor improvements here and there, and no real regressions; nothing outside the noise levels. Earlier attempts to solve this problems included disabling the MI scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not schedule hand-written vector code. This solution is preferred because it avoids another heuristic and achieves what we want, and for what is worth, there is a lot of precedent for setting this feature. Thanks to: - Ricardo Jesus for pointing out this subtarget feature, and - Cameron McInally for the extensive performance testing.
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core tuning description. This gives us a 20% improvement on a key workload, some other minor improvements here and there, and no real regressions; nothing outside the noise levels. Earlier attempts to solve this problems included disabling the MI scheduler entirely (#127784), and #139557 was about a heuristic to not schedule hand-written vector code. This solution is preferred because it avoids another heuristic and achieves what we want, and for what is worth, there is a lot of precedent for setting this feature. Thanks to: - Ricardo Jesus for pointing out this subtarget feature, and - Cameron McInally for the extensive performance testing.
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core tuning description. This gives us a 20% improvement on a key workload, some other minor improvements here and there, and no real regressions; nothing outside the noise levels. Earlier attempts to solve this problems included disabling the MI scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not schedule hand-written vector code. This solution is preferred because it avoids another heuristic and achieves what we want, and for what is worth, there is a lot of precedent for setting this feature. Thanks to: - Ricardo Jesus for pointing out this subtarget feature, and - Cameron McInally for the extensive performance testing.
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core tuning description. This gives us a 20% improvement on a key workload, some other minor improvements here and there, and no real regressions; nothing outside the noise levels. Earlier attempts to solve this problems included disabling the MI scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not schedule hand-written vector code. This solution is preferred because it avoids another heuristic and achieves what we want, and for what is worth, there is a lot of precedent for setting this feature. Thanks to: - Ricardo Jesus for pointing out this subtarget feature, and - Cameron McInally for the extensive performance testing.
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core tuning description. This gives us a 20% improvement on a key workload, some other minor improvements here and there, and no real regressions; nothing outside the noise levels. Earlier attempts to solve this problems included disabling the MI scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not schedule hand-written vector code. This solution is preferred because it avoids another heuristic and achieves what we want, and for what is worth, there is a lot of precedent for setting this feature. Thanks to: - Ricardo Jesus for pointing out this subtarget feature, and - Cameron McInally for the extensive performance testing.
We would like to disable the pre-RA machine scheduler for the Neoverse V2 because we have a key workload that massively benefits from this (25% uplift). Despite the machine scheduler being register pressure aware, it results in spills for this workload. Disabling the scheduler seems a lot more attractive than trying to tweak regalloc heuristics:
FWIW: the GCC folks realised the same not that long ago, and did exactly the same, see also:
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/667074.html
I guess other big cores could benefit from this too, but I would like to leave that decision to folks with more experience on those cores, so that's why I propose to change this for the V2 here only.
Numbers:
I haven't looked at the post-RA scheduling, maybe that's interesting as a follow up.