-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[RISCV] Enable bidirectional scheduling and tracking register pressure #115445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Enable bidirectional scheduling and tracking register pressure #115445
Conversation
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-risc-v Author: Pengcheng Wang (wangpc-pp) ChangesThis is based on other targets like PPC/AArch64 and some experiments. Disclaimer: I haven't tested it on many cores, maybe we should make Full diff: https://github.com/llvm/llvm-project/pull/115445.diff 2 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.cpp b/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
index e7db1ededf383b..f43c520422f13d 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
@@ -16,6 +16,7 @@
#include "RISCV.h"
#include "RISCVFrameLowering.h"
#include "RISCVTargetMachine.h"
+#include "llvm/CodeGen/MachineScheduler.h"
#include "llvm/CodeGen/MacroFusion.h"
#include "llvm/CodeGen/ScheduleDAGMutation.h"
#include "llvm/MC/TargetRegistry.h"
@@ -199,3 +200,25 @@ unsigned RISCVSubtarget::getMinimumJumpTableEntries() const {
? RISCVMinimumJumpTableEntries
: TuneInfo->MinimumJumpTableEntries;
}
+
+void RISCVSubtarget::overrideSchedPolicy(MachineSchedPolicy &Policy,
+ unsigned NumRegionInstrs) const {
+ // Do bidirectional scheduling since it provides a more balanced scheduling
+ // leading to better performance. This will increase compile time.
+ Policy.OnlyTopDown = false;
+ Policy.OnlyBottomUp = false;
+
+ // Enabling or Disabling the latency heuristic is a close call: It seems to
+ // help nearly no benchmark on out-of-order architectures, on the other hand
+ // it regresses register pressure on a few benchmarking.
+ // FIXME: This is from AArch64, but we haven't evaluated it on RISC-V.
+ Policy.DisableLatencyHeuristic = true;
+
+ // Spilling is generally expensive on all RISC-V cores, so always enable
+ // register-pressure tracking. This will increase compile time.
+ Policy.ShouldTrackPressure = true;
+
+ // Enabling ShouldTrackLaneMasks when vector instructions are supported.
+ // TODO: Add extensions that need register pairs as well?
+ Policy.ShouldTrackLaneMasks = hasVInstructions();
+}
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.h b/llvm/lib/Target/RISCV/RISCVSubtarget.h
index f59a3737ae76f9..f2c0a3d85c998a 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.h
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.h
@@ -327,6 +327,9 @@ class RISCVSubtarget : public RISCVGenSubtargetInfo {
unsigned getTailDupAggressiveThreshold() const {
return TuneInfo->TailDupAggressiveThreshold;
}
+
+ void overrideSchedPolicy(MachineSchedPolicy &Policy,
+ unsigned NumRegionInstrs) const override;
};
} // End llvm namespace
|
074c3c3
to
6017bfc
Compare
This causes more spills for some cases, which may be fixed by #113675. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - This looks very clearly like a good idea - to the point where I am mildly shocked we haven't done this before. I think, in fact, that this probably fixes a bug I was chasing just yesterday. :)
I am not too worried about the noted increase in spills as any perturbation to scheduling is going to cause changes in register allocation. Looking through the test diffs, I see both improvements and regressions. I don't see evidence of systematic regressions.
It'd be nice if we had perf data (and/or stats) from the BP3 (@lukel97 , @mikhailramalho ), but I don't consider that blocking as this is so clearly a good idea from first principles.
Please do wait a day or so in case @topperc , @asb , @lukel97 or others have comments.
// help nearly no benchmark on out-of-order architectures, on the other hand | ||
// it regresses register pressure on a few benchmarking. | ||
// FIXME: This is from AArch64, but we haven't evaluated it on RISC-V. | ||
Policy.DisableLatencyHeuristic = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do on in order cores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GenericScheduler picks the next SU by inspecting heuristics (not an exhaustive list) in the following order:
- register pressure
- acyclic critical path in a loop
- load/store cluster
- resource pressure (probably the most important for out-of-order core)
- latency (critical path)
- program order
DisableLatencyHeuristic basically turns off (5), which means that at that point we're relying on program order. I don't think using program order will be more favorable than reducing critical path, especially for in-order cores. Disabling it won't save compile time either, turning on bidirectional scheduler has more impact on compile time I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should enable this for in-oder cores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should enable this for in-oder cores?
Personally I would not disabling it at all because I still don't think program order is better than reducing critical path, regardless of in-order or out-of-order cores.
That said, I'm fine with enabling it only for in-order cores for now and see how it goes.
// help nearly no benchmark on out-of-order architectures, on the other hand | ||
// it regresses register pressure on a few benchmarking. | ||
// FIXME: This is from AArch64, but we haven't evaluated it on RISC-V. | ||
Policy.DisableLatencyHeuristic = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GenericScheduler picks the next SU by inspecting heuristics (not an exhaustive list) in the following order:
- register pressure
- acyclic critical path in a loop
- load/store cluster
- resource pressure (probably the most important for out-of-order core)
- latency (critical path)
- program order
DisableLatencyHeuristic basically turns off (5), which means that at that point we're relying on program order. I don't think using program order will be more favorable than reducing critical path, especially for in-order cores. Disabling it won't save compile time either, turning on bidirectional scheduler has more impact on compile time I believe.
Good spot! I just wonder about ShouldTrackLaneMasks and whether that might make sense as a follow-up rather than being enabled at first. It looks like of the in-tree targets, only AMDGPU is enabling it right now. Not a strong view though, so if you've delved into how it's used and put a reasonable amount of code through it I wouldn't object leaving it as-is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
7d36dea
to
fef37f2
Compare
This PR will only enable bidirectional scheduling and tracking register pressure. And I will land this in a few days if there is no further more discusstion. |
Non blocking, can you check that #107532 still spills? I have a feeling we still need to set microOpBufferSize otherwise latency always overrides register pressure |
Yeah, it still spills. I think we can add generic models as what I commented before. :-) |
Agree. Could you update the patch title to reflect this change? |
I've finished a run comparing this PR with tip of tree @ e887f82, and there's some really nice performance gains: 11% on 538.imagick_r and 10% on 508.namd_r! On the Banana Pi F3 with -mcpu=spacemit-x60 -O3 -flto. I'm re-running imagick and namd just to double check, but otherwise this is great! EDIT: I can reproduce the imagick and namd results on 9358905 vs fef37f2 |
I will land this at the end of today (18:00 UTC+8, Friday) if there is no objection. |
This is based on other targets like PPC/AArch64 and some experiments. According to the evaluation, we can see 2-3% performance gain on spacemit-x60. Disclaimer: I haven't tested it on many cores, maybe we should make some options being features. I believe downstreams must have tried this before, so feedbacks are welcome.
fef37f2
to
0f8ed88
Compare
This is based on other targets like PPC/AArch64 and some experiments.
This PR will only enable bidirectional scheduling and tracking register
pressure.
Disclaimer: I haven't tested it on many cores, maybe we should make
some options being features. I believe downstreams must have tried
this before, so feedbacks are welcome.