Skip to content

[RISCV] Enable bidirectional scheduling and tracking register pressure #115445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 15, 2024

Conversation

wangpc-pp
Copy link
Contributor

@wangpc-pp wangpc-pp commented Nov 8, 2024

This is based on other targets like PPC/AArch64 and some experiments.

This PR will only enable bidirectional scheduling and tracking register
pressure.

Disclaimer: I haven't tested it on many cores, maybe we should make
some options being features. I believe downstreams must have tried
this before, so feedbacks are welcome.

@llvmbot
Copy link
Member

llvmbot commented Nov 8, 2024

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-risc-v

Author: Pengcheng Wang (wangpc-pp)

Changes

This is based on other targets like PPC/AArch64 and some experiments.

Disclaimer: I haven't tested it on many cores, maybe we should make
some options being features. I believe downstreams must have tried
this before, so feedbacks are welcome.


Full diff: https://github.com/llvm/llvm-project/pull/115445.diff

2 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVSubtarget.cpp (+23)
  • (modified) llvm/lib/Target/RISCV/RISCVSubtarget.h (+3)
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.cpp b/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
index e7db1ededf383b..f43c520422f13d 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
@@ -16,6 +16,7 @@
 #include "RISCV.h"
 #include "RISCVFrameLowering.h"
 #include "RISCVTargetMachine.h"
+#include "llvm/CodeGen/MachineScheduler.h"
 #include "llvm/CodeGen/MacroFusion.h"
 #include "llvm/CodeGen/ScheduleDAGMutation.h"
 #include "llvm/MC/TargetRegistry.h"
@@ -199,3 +200,25 @@ unsigned RISCVSubtarget::getMinimumJumpTableEntries() const {
              ? RISCVMinimumJumpTableEntries
              : TuneInfo->MinimumJumpTableEntries;
 }
+
+void RISCVSubtarget::overrideSchedPolicy(MachineSchedPolicy &Policy,
+                                         unsigned NumRegionInstrs) const {
+  // Do bidirectional scheduling since it provides a more balanced scheduling
+  // leading to better performance. This will increase compile time.
+  Policy.OnlyTopDown = false;
+  Policy.OnlyBottomUp = false;
+
+  // Enabling or Disabling the latency heuristic is a close call: It seems to
+  // help nearly no benchmark on out-of-order architectures, on the other hand
+  // it regresses register pressure on a few benchmarking.
+  // FIXME: This is from AArch64, but we haven't evaluated it on RISC-V.
+  Policy.DisableLatencyHeuristic = true;
+
+  // Spilling is generally expensive on all RISC-V cores, so always enable
+  // register-pressure tracking. This will increase compile time.
+  Policy.ShouldTrackPressure = true;
+
+  // Enabling ShouldTrackLaneMasks when vector instructions are supported.
+  // TODO: Add extensions that need register pairs as well?
+  Policy.ShouldTrackLaneMasks = hasVInstructions();
+}
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.h b/llvm/lib/Target/RISCV/RISCVSubtarget.h
index f59a3737ae76f9..f2c0a3d85c998a 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.h
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.h
@@ -327,6 +327,9 @@ class RISCVSubtarget : public RISCVGenSubtargetInfo {
   unsigned getTailDupAggressiveThreshold() const {
     return TuneInfo->TailDupAggressiveThreshold;
   }
+
+  void overrideSchedPolicy(MachineSchedPolicy &Policy,
+                           unsigned NumRegionInstrs) const override;
 };
 } // End llvm namespace
 

@wangpc-pp
Copy link
Contributor Author

This causes more spills for some cases, which may be fixed by #113675.

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - This looks very clearly like a good idea - to the point where I am mildly shocked we haven't done this before. I think, in fact, that this probably fixes a bug I was chasing just yesterday. :)

I am not too worried about the noted increase in spills as any perturbation to scheduling is going to cause changes in register allocation. Looking through the test diffs, I see both improvements and regressions. I don't see evidence of systematic regressions.

It'd be nice if we had perf data (and/or stats) from the BP3 (@lukel97 , @mikhailramalho ), but I don't consider that blocking as this is so clearly a good idea from first principles.

Please do wait a day or so in case @topperc , @asb , @lukel97 or others have comments.

// help nearly no benchmark on out-of-order architectures, on the other hand
// it regresses register pressure on a few benchmarking.
// FIXME: This is from AArch64, but we haven't evaluated it on RISC-V.
Policy.DisableLatencyHeuristic = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do on in order cores?

Copy link
Member

@mshockwave mshockwave Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenericScheduler picks the next SU by inspecting heuristics (not an exhaustive list) in the following order:

  1. register pressure
  2. acyclic critical path in a loop
  3. load/store cluster
  4. resource pressure (probably the most important for out-of-order core)
  5. latency (critical path)
  6. program order

DisableLatencyHeuristic basically turns off (5), which means that at that point we're relying on program order. I don't think using program order will be more favorable than reducing critical path, especially for in-order cores. Disabling it won't save compile time either, turning on bidirectional scheduler has more impact on compile time I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should enable this for in-oder cores?

Copy link
Member

@mshockwave mshockwave Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should enable this for in-oder cores?

Personally I would not disabling it at all because I still don't think program order is better than reducing critical path, regardless of in-order or out-of-order cores.

That said, I'm fine with enabling it only for in-order cores for now and see how it goes.

// help nearly no benchmark on out-of-order architectures, on the other hand
// it regresses register pressure on a few benchmarking.
// FIXME: This is from AArch64, but we haven't evaluated it on RISC-V.
Policy.DisableLatencyHeuristic = true;
Copy link
Member

@mshockwave mshockwave Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenericScheduler picks the next SU by inspecting heuristics (not an exhaustive list) in the following order:

  1. register pressure
  2. acyclic critical path in a loop
  3. load/store cluster
  4. resource pressure (probably the most important for out-of-order core)
  5. latency (critical path)
  6. program order

DisableLatencyHeuristic basically turns off (5), which means that at that point we're relying on program order. I don't think using program order will be more favorable than reducing critical path, especially for in-order cores. Disabling it won't save compile time either, turning on bidirectional scheduler has more impact on compile time I believe.

@asb
Copy link
Contributor

asb commented Nov 11, 2024

Good spot! I just wonder about ShouldTrackLaneMasks and whether that might make sense as a follow-up rather than being enabled at first. It looks like of the in-tree targets, only AMDGPU is enabling it right now. Not a strong view though, so if you've delved into how it's used and put a reasonable amount of code through it I wouldn't object leaving it as-is.

Copy link
Member

@mshockwave mshockwave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wangpc-pp wangpc-pp force-pushed the main-riscv-override-sched-policy branch from 7d36dea to fef37f2 Compare November 12, 2024 08:44
@wangpc-pp
Copy link
Contributor Author

This PR will only enable bidirectional scheduling and tracking register pressure. And ShouldTrackLaneMasks/DisableLatencyHeuristic will be follow-ups (need more inputs).

I will land this in a few days if there is no further more discusstion.

@lukel97
Copy link
Contributor

lukel97 commented Nov 12, 2024

Non blocking, can you check that #107532 still spills? I have a feeling we still need to set microOpBufferSize otherwise latency always overrides register pressure

@wangpc-pp
Copy link
Contributor Author

Non blocking, can you check that #107532 still spills? I have a feeling we still need to set microOpBufferSize otherwise latency always overrides register pressure

Yeah, it still spills. I think we can add generic models as what I commented before. :-)

@mshockwave
Copy link
Member

mshockwave commented Nov 12, 2024

This PR will only enable bidirectional scheduling and tracking register pressure. And ShouldTrackLaneMasks/DisableLatencyHeuristic will be follow-ups (need more inputs).

Agree. Could you update the patch title to reflect this change?

@lukel97
Copy link
Contributor

lukel97 commented Nov 14, 2024

I've finished a run comparing this PR with tip of tree @ e887f82, and there's some really nice performance gains: 11% on 538.imagick_r and 10% on 508.namd_r! On the Banana Pi F3 with -mcpu=spacemit-x60 -O3 -flto.

https://lnt.lukelau.me/db_default/v4/nts/34?show_delta=yes&show_previous=yes&show_stddev=yes&show_mad=yes&show_all=yes&show_all_samples=yes&show_sample_counts=yes&show_small_diff=yes&num_comparison_runs=0&test_filter=&test_min_value_filter=&aggregation_fn=min&MW_confidence_lv=0.05&compare_to=35&submit=Update

I'm re-running imagick and namd just to double check, but otherwise this is great!

EDIT: I can reproduce the imagick and namd results on 9358905 vs fef37f2

@wangpc-pp
Copy link
Contributor Author

I will land this at the end of today (18:00 UTC+8, Friday) if there is no objection.

@wangpc-pp wangpc-pp changed the title [RISCV] Override default sched policy [RISCV] Enable bidirectional scheduling and tracking register pressure Nov 15, 2024
This is based on other targets like PPC/AArch64 and some experiments.

According to the evaluation, we can see 2-3% performance gain on
spacemit-x60.

Disclaimer: I haven't tested it on many cores, maybe we should make
some options being features. I believe downstreams must have tried
this before, so feedbacks are welcome.
@wangpc-pp wangpc-pp force-pushed the main-riscv-override-sched-policy branch from fef37f2 to 0f8ed88 Compare November 15, 2024 09:34
@wangpc-pp wangpc-pp merged commit 9122c52 into llvm:main Nov 15, 2024
5 of 7 checks passed
@wangpc-pp wangpc-pp deleted the main-riscv-override-sched-policy branch November 15, 2024 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants