-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[MachineLICM] Hoist COPY instruction only when user can be hoisted #81735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MachineLICM] Hoist COPY instruction only when user can be hoisted #81735
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write If you have received no comments on your PR for a week, you can request a review If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
✅ With the latest revision this PR passed the C/C++ code formatter. |
65f08ed
to
d39fd61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes make sense to me. @davemgreen might want to take a look as well.
Do you have a test demonstrating the problem you are running into? We usually prefer that code is accompanied by a test case to make sure we don't accidentally break it again in the future. |
I have an IR derived from the benchmark which is pretty big. I reduced it down to the point when only one spill reload still appears but yet it's 685 lines big. I'll see if I can make anything prettier. |
@llvm/pr-subscribers-backend-amdgpu Author: None (michaelselehov) Changesbefa925 introduced preliminary hoisting of COPY instructions when the user of the COPY is inside the same loop. That optimization appeared to be too aggressive and hoisted too many COPY's greatly increasing register pressure causing performance regressions for AMDGPU target. This is intended to fix the regression by hoisting COPY instruction only if either:
Full diff: https://github.com/llvm/llvm-project/pull/81735.diff 4 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/MachineLoopInfo.h b/llvm/include/llvm/CodeGen/MachineLoopInfo.h
index ae075bee1daf76..445c9b1c3bc00f 100644
--- a/llvm/include/llvm/CodeGen/MachineLoopInfo.h
+++ b/llvm/include/llvm/CodeGen/MachineLoopInfo.h
@@ -79,7 +79,10 @@ class MachineLoop : public LoopBase<MachineBasicBlock, MachineLoop> {
/// I.e., all virtual register operands are defined outside of the loop,
/// physical registers aren't accessed explicitly, and there are no side
/// effects that aren't captured by the operands or other flags.
- bool isLoopInvariant(MachineInstr &I) const;
+ /// ExcludeReg can be used to exclude the given register from the check
+ /// i.e. when we're considering hoisting it's definition but not hoisted it
+ /// yet
+ bool isLoopInvariant(MachineInstr &I, const Register ExcludeReg = 0) const;
void dump() const;
diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp
index efc19f8fdbf8c2..a3bfd97e5a90cb 100644
--- a/llvm/lib/CodeGen/MachineLICM.cpp
+++ b/llvm/lib/CodeGen/MachineLICM.cpp
@@ -1264,13 +1264,24 @@ bool MachineLICMBase::IsProfitableToHoist(MachineInstr &MI,
// If we have a COPY with other uses in the loop, hoist to allow the users to
// also be hoisted.
+ Register defReg;
if (MI.isCopy() && MI.getOperand(0).isReg() &&
- MI.getOperand(0).getReg().isVirtual() && MI.getOperand(1).isReg() &&
- MI.getOperand(1).getReg().isVirtual() &&
+ (defReg = MI.getOperand(0).getReg()).isVirtual() &&
+ MI.getOperand(1).isReg() && MI.getOperand(1).getReg().isVirtual() &&
IsLoopInvariantInst(MI, CurLoop) &&
any_of(MRI->use_nodbg_instructions(MI.getOperand(0).getReg()),
- [&CurLoop](MachineInstr &UseMI) {
- return CurLoop->contains(&UseMI);
+ [&CurLoop, this, defReg, Cost](MachineInstr &UseMI) {
+ if (!CurLoop->contains(&UseMI))
+ return false;
+
+ // COPY is a cheap instruction, but if moving it won't cause high
+ // RP we're fine to hoist it even if the user can't be hoisted
+ // later Otherwise we want to check the user if it's hoistable
+ if (CanCauseHighRegPressure(Cost, false) &&
+ !CurLoop->isLoopInvariant(UseMI, defReg))
+ return false;
+
+ return true;
}))
return true;
diff --git a/llvm/lib/CodeGen/MachineLoopInfo.cpp b/llvm/lib/CodeGen/MachineLoopInfo.cpp
index bdbc57099aa8d3..1492c8c366fb41 100644
--- a/llvm/lib/CodeGen/MachineLoopInfo.cpp
+++ b/llvm/lib/CodeGen/MachineLoopInfo.cpp
@@ -198,7 +198,8 @@ MDNode *MachineLoop::getLoopID() const {
return LoopID;
}
-bool MachineLoop::isLoopInvariant(MachineInstr &I) const {
+bool MachineLoop::isLoopInvariant(MachineInstr &I,
+ const Register ExcludeReg) const {
MachineFunction *MF = I.getParent()->getParent();
MachineRegisterInfo *MRI = &MF->getRegInfo();
const TargetSubtargetInfo &ST = MF->getSubtarget();
@@ -213,6 +214,9 @@ bool MachineLoop::isLoopInvariant(MachineInstr &I) const {
Register Reg = MO.getReg();
if (Reg == 0) continue;
+ if (ExcludeReg == Reg)
+ continue;
+
// An instruction that uses or defines a physical register can't e.g. be
// hoisted, so mark this as not invariant.
if (Reg.isPhysical()) {
diff --git a/llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll b/llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll
new file mode 100644
index 00000000000000..55904f44082ec9
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll
@@ -0,0 +1,94 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -vectorize-loops -vectorize-slp -amdgpu-early-inline-all=true -amdgpu-function-calls=false < %s | FileCheck %s
+target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9-p9:192:256:256:32"
+target triple = "amdgcn-amd-amdhsa"
+
+define amdgpu_kernel void @_ZN6Kokkos4ImplL32hip_parallel_launch_local_memoryINS0_11ParallelForIN4Test7HexGradINS_3HIPEdfEENS_11RangePolicyIJS5_EEES5_EELj1024ELj1EEEvT_(ptr %.sroa.1.0.copyload, ptr %0, ptr %1, ptr %2, ptr %3, ptr %4, ptr %5, ptr %6, ptr %7, ptr %8, ptr %9, ptr %10, ptr %11, ptr %12, ptr %13, ptr %14, ptr %15, ptr %16, ptr %17, ptr %18, ptr %19, ptr %20, ptr %21, ptr %22, ptr %23, ptr %24, ptr %25, ptr %26, ptr %27, ptr %28, ptr %29, ptr %30, ptr %31, ptr %32, ptr %33, double %34, double %35, double %36, float %37, float %38, float %39, float %40, ptr %41) {
+; CHECK-LABEL: _ZN6Kokkos4ImplL32hip_parallel_launch_local_memoryINS0_11ParallelForIN4Test7HexGradINS_3HIPEdfEENS_11RangePolicyIJS5_EEES5_EELj1024ELj1EEEvT_:
+; CHECK-LABEL: .LBB0_1
+; CHECK-NOT: buffer_load_dword {{v[0-9]+}}
+
+.lr.ph:
+ %.pre = load double, ptr null, align 8
+ br label %42
+
+42: ; preds = %42, %.lr.ph
+ %.0.i4402 = phi i32 [ 1, %.lr.ph ], [ 0, %42 ]
+ %43 = zext i32 %.0.i4402 to i64
+ %44 = load double, ptr %2, align 8
+ %45 = load double, ptr %4, align 8
+ %46 = load double, ptr %7, align 8
+ %47 = load double, ptr %13, align 8
+ %48 = load double, ptr %15, align 8
+ %49 = load double, ptr %17, align 8
+ %50 = load double, ptr %19, align 8
+ %51 = load double, ptr %18, align 8
+ %52 = load double, ptr %27, align 8
+ %53 = load double, ptr %23, align 8
+ %54 = load double, ptr %31, align 8
+ %55 = load double, ptr %33, align 8
+ %56 = load double, ptr %25, align 8
+ %57 = load double, ptr %16, align 8
+ %58 = fpext float %40 to double
+ %59 = fmul double %52, %58
+ %60 = fadd double %59, %51
+ %61 = fsub double %60, %48
+ %62 = fmul double 0.000000e+00, %36
+ %63 = fsub double %61, %62
+ %64 = fadd double %49, %63
+ %65 = fptrunc double %64 to float
+ %66 = fsub double 0.000000e+00, %34
+ %67 = fpext float %39 to double
+ %68 = fmul double %53, %67
+ %69 = fsub double %66, %68
+ %70 = fadd double %50, %69
+ %71 = fptrunc double %70 to float
+ store float 0.000000e+00, ptr %30, align 4
+ store float 0.000000e+00, ptr %26, align 4
+ %72 = getelementptr float, ptr %41, i64 %43
+ store float %38, ptr %72, align 4
+ store float %65, ptr %29, align 4
+ store float %71, ptr %14, align 4
+ store float %39, ptr %3, align 4
+ store float %39, ptr %11, align 4
+ %73 = fsub double %46, %44
+ %74 = fptrunc double %73 to float
+ %75 = fsub double %47, %45
+ %76 = fptrunc double %75 to float
+ %77 = fadd float %74, %76
+ %78 = fpext float %37 to double
+ %79 = fmul contract double %56, 0.000000e+00
+ %80 = fsub contract double %34, %79
+ %81 = fpext float %77 to double
+ %82 = fmul double %.pre, %81
+ %83 = fsub double %80, %82
+ %84 = fpext float %38 to double
+ %85 = fmul double %57, %84
+ %86 = fsub double %83, %85
+ %87 = fptrunc double %86 to float
+ %88 = fmul double %34, 0.000000e+00
+ %89 = fmul double %54, %78
+ %90 = fadd double %89, %88
+ %91 = fsub double %90, %55
+ %92 = fmul double 0.000000e+00, %35
+ %93 = fsub double %91, %92
+ %94 = fmul double %34, %34
+ %95 = fadd double %93, %94
+ %96 = fptrunc double %95 to float
+ store float %87, ptr %1, align 4
+ store float %37, ptr %21, align 4
+ store float %96, ptr %0, align 4
+ store float 0.000000e+00, ptr %9, align 4
+ store float 0.000000e+00, ptr %32, align 4
+ store float 0.000000e+00, ptr %20, align 4
+ store float 0.000000e+00, ptr %22, align 4
+ store float 0.000000e+00, ptr %5, align 4
+ store float 0.000000e+00, ptr %28, align 4
+ store float 0.000000e+00, ptr %12, align 4
+ store float 0.000000e+00, ptr %6, align 4
+ store float 0.000000e+00, ptr %8, align 4
+ store float 0.000000e+00, ptr %.sroa.1.0.copyload, align 4
+ store float %37, ptr %10, align 4
+ store float 0.000000e+00, ptr %24, align 4
+ br label %42
+}
|
I managed to reduce the test as much as I could. It's still 80+ lines of IR but the check is simple: there should be no spill reload in the loop. |
@@ -0,0 +1,94 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check lines do not look auto-generated, but they might be a bit fragile as-is. Would it be possible to auto-generate them? Or is that too large?
An explanation of what is going wrong in the test (some-such thing should not be hoisted and spilled?) might be useful in the long-run too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with autogenerated test, hence the artefact in the NOTE.
But I liked my way of checking much more than the autogeneration. The latter is going to fail from the every little sneeze in the compiler that changes sequence of instructions of register names. Instead my way of checking checks exactly what I want: spill reload in the loop. And when the test fails the error is very clear to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The code looks OK to me
3071728
to
47cb073
Compare
@Rin18 @davemgreen if you're fine with the change, would you please do me a favor and merge it? Thanks. |
There seems to be a weird error with windows/flang test that must not be related to my PR. Waiting for a fix. |
befa925 introduced preliminary hoisting of COPY instructions when the user of the COPY is inside the same loop. That optimization appeared to be too aggressive and hoisted too many COPY's greately increasing register pressure causing performance regressions for AMDGPU target. This is intended to fix the regression by hoisting COPY instruction only if either: - User of COPY can be hoisted (other args are invariant) or - Hoisting COPY doesn't bring high register pressure
cdc5e34
to
5790f23
Compare
The windows tests look like they should be unrelated. Lets get this in - thanks. |
@michaelselehov Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested Please check whether problems have been caused by your change specifically, as How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
Thank you for merging! |
…lvm#81735) befa925 introduced preliminary hoisting of COPY instructions when the user of the COPY is inside the same loop. That optimization appeared to be too aggressive and hoisted too many COPY's greatly increasing register pressure causing performance regressions for AMDGPU target. This is intended to fix the regression by hoisting COPY instruction only if either: - User of COPY can be hoisted (other args are invariant) or - Hoisting COPY doesn't bring high register pressure Change-Id: Ie60752951af599f8f3d6cf90c093e2307736d1ae
befa925 introduced preliminary hoisting of COPY instructions when the user of the COPY is inside the same loop. That optimization appeared to be too aggressive and hoisted too many COPY's greatly increasing register pressure causing performance regressions for AMDGPU target.
This is intended to fix the regression by hoisting COPY instruction only if either:
or