[MachineLICM] Hoist COPY instruction only when user can be hoisted #81735

michaelselehov · 2024-02-14T13:53:53Z

befa925 introduced preliminary hoisting of COPY instructions when the user of the COPY is inside the same loop. That optimization appeared to be too aggressive and hoisted too many COPY's greatly increasing register pressure causing performance regressions for AMDGPU target.

This is intended to fix the regression by hoisting COPY instruction only if either:

User of COPY can be hoisted (other args are invariant)
or
Hoisting COPY doesn't bring high register pressure

github-actions · 2024-02-14T13:54:13Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

github-actions · 2024-02-14T13:56:31Z

✅ With the latest revision this PR passed the C/C++ code formatter.

Asher8118

The changes make sense to me. @davemgreen might want to take a look as well.

davemgreen · 2024-02-15T13:25:48Z

Do you have a test demonstrating the problem you are running into? We usually prefer that code is accompanied by a test case to make sure we don't accidentally break it again in the future.

michaelselehov · 2024-02-19T12:28:03Z

Do you have a test demonstrating the problem you are running into? We usually prefer that code is accompanied by a test case to make sure we don't accidentally break it again in the future.

I have an IR derived from the benchmark which is pretty big. I reduced it down to the point when only one spill reload still appears but yet it's 685 lines big. I'll see if I can make anything prettier.

llvmbot · 2024-02-23T13:29:45Z

@llvm/pr-subscribers-backend-amdgpu

Author: None (michaelselehov)

Changes

befa925 introduced preliminary hoisting of COPY instructions when the user of the COPY is inside the same loop. That optimization appeared to be too aggressive and hoisted too many COPY's greatly increasing register pressure causing performance regressions for AMDGPU target.

This is intended to fix the regression by hoisting COPY instruction only if either:

User of COPY can be hoisted (other args are invariant)
or
Hoisting COPY doesn't bring high register pressure

Full diff: https://github.com/llvm/llvm-project/pull/81735.diff

4 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachineLoopInfo.h (+4-1)
(modified) llvm/lib/CodeGen/MachineLICM.cpp (+15-4)
(modified) llvm/lib/CodeGen/MachineLoopInfo.cpp (+5-1)
(added) llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll (+94)

diff --git a/llvm/include/llvm/CodeGen/MachineLoopInfo.h b/llvm/include/llvm/CodeGen/MachineLoopInfo.h
index ae075bee1daf76..445c9b1c3bc00f 100644
--- a/llvm/include/llvm/CodeGen/MachineLoopInfo.h
+++ b/llvm/include/llvm/CodeGen/MachineLoopInfo.h
@@ -79,7 +79,10 @@ class MachineLoop : public LoopBase<MachineBasicBlock, MachineLoop> {
   /// I.e., all virtual register operands are defined outside of the loop,
   /// physical registers aren't accessed explicitly, and there are no side
   /// effects that aren't captured by the operands or other flags.
-  bool isLoopInvariant(MachineInstr &I) const;
+  /// ExcludeReg can be used to exclude the given register from the check
+  /// i.e. when we're considering hoisting it's definition but not hoisted it
+  /// yet
+  bool isLoopInvariant(MachineInstr &I, const Register ExcludeReg = 0) const;
 
   void dump() const;
 
diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp
index efc19f8fdbf8c2..a3bfd97e5a90cb 100644
--- a/llvm/lib/CodeGen/MachineLICM.cpp
+++ b/llvm/lib/CodeGen/MachineLICM.cpp
@@ -1264,13 +1264,24 @@ bool MachineLICMBase::IsProfitableToHoist(MachineInstr &MI,
 
   // If we have a COPY with other uses in the loop, hoist to allow the users to
   // also be hoisted.
+  Register defReg;
   if (MI.isCopy() && MI.getOperand(0).isReg() &&
-      MI.getOperand(0).getReg().isVirtual() && MI.getOperand(1).isReg() &&
-      MI.getOperand(1).getReg().isVirtual() &&
+      (defReg = MI.getOperand(0).getReg()).isVirtual() &&
+      MI.getOperand(1).isReg() && MI.getOperand(1).getReg().isVirtual() &&
       IsLoopInvariantInst(MI, CurLoop) &&
       any_of(MRI->use_nodbg_instructions(MI.getOperand(0).getReg()),
-             [&CurLoop](MachineInstr &UseMI) {
-               return CurLoop->contains(&UseMI);
+             [&CurLoop, this, defReg, Cost](MachineInstr &UseMI) {
+               if (!CurLoop->contains(&UseMI))
+                 return false;
+
+               // COPY is a cheap instruction, but if moving it won't cause high
+               // RP we're fine to hoist it even if the user can't be hoisted
+               // later Otherwise we want to check the user if it's hoistable
+               if (CanCauseHighRegPressure(Cost, false) &&
+                   !CurLoop->isLoopInvariant(UseMI, defReg))
+                 return false;
+
+               return true;
              }))
     return true;
 
diff --git a/llvm/lib/CodeGen/MachineLoopInfo.cpp b/llvm/lib/CodeGen/MachineLoopInfo.cpp
index bdbc57099aa8d3..1492c8c366fb41 100644
--- a/llvm/lib/CodeGen/MachineLoopInfo.cpp
+++ b/llvm/lib/CodeGen/MachineLoopInfo.cpp
@@ -198,7 +198,8 @@ MDNode *MachineLoop::getLoopID() const {
   return LoopID;
 }
 
-bool MachineLoop::isLoopInvariant(MachineInstr &I) const {
+bool MachineLoop::isLoopInvariant(MachineInstr &I,
+                                  const Register ExcludeReg) const {
   MachineFunction *MF = I.getParent()->getParent();
   MachineRegisterInfo *MRI = &MF->getRegInfo();
   const TargetSubtargetInfo &ST = MF->getSubtarget();
@@ -213,6 +214,9 @@ bool MachineLoop::isLoopInvariant(MachineInstr &I) const {
     Register Reg = MO.getReg();
     if (Reg == 0) continue;
 
+    if (ExcludeReg == Reg)
+      continue;
+
     // An instruction that uses or defines a physical register can't e.g. be
     // hoisted, so mark this as not invariant.
     if (Reg.isPhysical()) {
diff --git a/llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll b/llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll
new file mode 100644
index 00000000000000..55904f44082ec9
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll
@@ -0,0 +1,94 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -vectorize-loops -vectorize-slp -amdgpu-early-inline-all=true -amdgpu-function-calls=false < %s | FileCheck %s
+target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9-p9:192:256:256:32"
+target triple = "amdgcn-amd-amdhsa"
+
+define amdgpu_kernel void @_ZN6Kokkos4ImplL32hip_parallel_launch_local_memoryINS0_11ParallelForIN4Test7HexGradINS_3HIPEdfEENS_11RangePolicyIJS5_EEES5_EELj1024ELj1EEEvT_(ptr %.sroa.1.0.copyload, ptr %0, ptr %1, ptr %2, ptr %3, ptr %4, ptr %5, ptr %6, ptr %7, ptr %8, ptr %9, ptr %10, ptr %11, ptr %12, ptr %13, ptr %14, ptr %15, ptr %16, ptr %17, ptr %18, ptr %19, ptr %20, ptr %21, ptr %22, ptr %23, ptr %24, ptr %25, ptr %26, ptr %27, ptr %28, ptr %29, ptr %30, ptr %31, ptr %32, ptr %33, double %34, double %35, double %36, float %37, float %38, float %39, float %40, ptr %41) {
+; CHECK-LABEL: _ZN6Kokkos4ImplL32hip_parallel_launch_local_memoryINS0_11ParallelForIN4Test7HexGradINS_3HIPEdfEENS_11RangePolicyIJS5_EEES5_EELj1024ELj1EEEvT_:
+; CHECK-LABEL: .LBB0_1
+; CHECK-NOT: buffer_load_dword {{v[0-9]+}}
+
+.lr.ph:
+  %.pre = load double, ptr null, align 8
+  br label %42
+
+42:                                               ; preds = %42, %.lr.ph
+  %.0.i4402 = phi i32 [ 1, %.lr.ph ], [ 0, %42 ]
+  %43 = zext i32 %.0.i4402 to i64
+  %44 = load double, ptr %2, align 8
+  %45 = load double, ptr %4, align 8
+  %46 = load double, ptr %7, align 8
+  %47 = load double, ptr %13, align 8
+  %48 = load double, ptr %15, align 8
+  %49 = load double, ptr %17, align 8
+  %50 = load double, ptr %19, align 8
+  %51 = load double, ptr %18, align 8
+  %52 = load double, ptr %27, align 8
+  %53 = load double, ptr %23, align 8
+  %54 = load double, ptr %31, align 8
+  %55 = load double, ptr %33, align 8
+  %56 = load double, ptr %25, align 8
+  %57 = load double, ptr %16, align 8
+  %58 = fpext float %40 to double
+  %59 = fmul double %52, %58
+  %60 = fadd double %59, %51
+  %61 = fsub double %60, %48
+  %62 = fmul double 0.000000e+00, %36
+  %63 = fsub double %61, %62
+  %64 = fadd double %49, %63
+  %65 = fptrunc double %64 to float
+  %66 = fsub double 0.000000e+00, %34
+  %67 = fpext float %39 to double
+  %68 = fmul double %53, %67
+  %69 = fsub double %66, %68
+  %70 = fadd double %50, %69
+  %71 = fptrunc double %70 to float
+  store float 0.000000e+00, ptr %30, align 4
+  store float 0.000000e+00, ptr %26, align 4
+  %72 = getelementptr float, ptr %41, i64 %43
+  store float %38, ptr %72, align 4
+  store float %65, ptr %29, align 4
+  store float %71, ptr %14, align 4
+  store float %39, ptr %3, align 4
+  store float %39, ptr %11, align 4
+  %73 = fsub double %46, %44
+  %74 = fptrunc double %73 to float
+  %75 = fsub double %47, %45
+  %76 = fptrunc double %75 to float
+  %77 = fadd float %74, %76
+  %78 = fpext float %37 to double
+  %79 = fmul contract double %56, 0.000000e+00
+  %80 = fsub contract double %34, %79
+  %81 = fpext float %77 to double
+  %82 = fmul double %.pre, %81
+  %83 = fsub double %80, %82
+  %84 = fpext float %38 to double
+  %85 = fmul double %57, %84
+  %86 = fsub double %83, %85
+  %87 = fptrunc double %86 to float
+  %88 = fmul double %34, 0.000000e+00
+  %89 = fmul double %54, %78
+  %90 = fadd double %89, %88
+  %91 = fsub double %90, %55
+  %92 = fmul double 0.000000e+00, %35
+  %93 = fsub double %91, %92
+  %94 = fmul double %34, %34
+  %95 = fadd double %93, %94
+  %96 = fptrunc double %95 to float
+  store float %87, ptr %1, align 4
+  store float %37, ptr %21, align 4
+  store float %96, ptr %0, align 4
+  store float 0.000000e+00, ptr %9, align 4
+  store float 0.000000e+00, ptr %32, align 4
+  store float 0.000000e+00, ptr %20, align 4
+  store float 0.000000e+00, ptr %22, align 4
+  store float 0.000000e+00, ptr %5, align 4
+  store float 0.000000e+00, ptr %28, align 4
+  store float 0.000000e+00, ptr %12, align 4
+  store float 0.000000e+00, ptr %6, align 4
+  store float 0.000000e+00, ptr %8, align 4
+  store float 0.000000e+00, ptr %.sroa.1.0.copyload, align 4
+  store float %37, ptr %10, align 4
+  store float 0.000000e+00, ptr %24, align 4
+  br label %42
+}

michaelselehov · 2024-02-23T13:31:23Z

Do you have a test demonstrating the problem you are running into? We usually prefer that code is accompanied by a test case to make sure we don't accidentally break it again in the future.

I managed to reduce the test as much as I could. It's still 80+ lines of IR but the check is simple: there should be no spill reload in the loop.

llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll

llvm/lib/CodeGen/MachineLICM.cpp

davemgreen · 2024-02-25T16:32:16Z

llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll

@@ -0,0 +1,94 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4


The check lines do not look auto-generated, but they might be a bit fragile as-is. Would it be possible to auto-generate them? Or is that too large?
An explanation of what is going wrong in the test (some-such thing should not be hoisted and spilled?) might be useful in the long-run too.

I started with autogenerated test, hence the artefact in the NOTE.
But I liked my way of checking much more than the autogeneration. The latter is going to fail from the every little sneeze in the compiler that changes sequence of instructions of register names. Instead my way of checking checks exactly what I want: spill reload in the loop. And when the test fails the error is very clear to read.

davemgreen

Thanks. The code looks OK to me

michaelselehov · 2024-02-26T13:05:08Z

@Rin18 @davemgreen if you're fine with the change, would you please do me a favor and merge it? Thanks.

llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll

michaelselehov · 2024-02-26T16:50:41Z

There seems to be a weird error with windows/flang test that must not be related to my PR. Waiting for a fix.

befa925 introduced preliminary hoisting of COPY instructions when the user of the COPY is inside the same loop. That optimization appeared to be too aggressive and hoisted too many COPY's greately increasing register pressure causing performance regressions for AMDGPU target. This is intended to fix the regression by hoisting COPY instruction only if either: - User of COPY can be hoisted (other args are invariant) or - Hoisting COPY doesn't bring high register pressure

davemgreen · 2024-02-27T12:31:20Z

The windows tests look like they should be unrelated. Lets get this in - thanks.

github-actions · 2024-02-27T12:31:46Z

@michaelselehov Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested
by our build bots. If there is a problem with a build, you may recieve a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as
the builds can include changes from many authors. It is not uncommon for your
change to be included in a build that fails due to someone else's changes, or
infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself.
This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

michaelselehov · 2024-02-27T12:34:45Z

The windows tests look like they should be unrelated. Lets get this in - thanks.

Thank you for merging!

…lvm#81735) befa925 introduced preliminary hoisting of COPY instructions when the user of the COPY is inside the same loop. That optimization appeared to be too aggressive and hoisted too many COPY's greatly increasing register pressure causing performance regressions for AMDGPU target. This is intended to fix the regression by hoisting COPY instruction only if either: - User of COPY can be hoisted (other args are invariant) or - Hoisting COPY doesn't bring high register pressure Change-Id: Ie60752951af599f8f3d6cf90c093e2307736d1ae

michaelselehov force-pushed the mselehov/fix-copy-hoisting branch from 65f08ed to d39fd61 Compare February 14, 2024 14:01

alex-t requested a review from Asher8118 February 14, 2024 14:03

Asher8118 requested a review from davemgreen February 15, 2024 13:15

Asher8118 approved these changes Feb 15, 2024

View reviewed changes

michaelselehov mentioned this pull request Feb 19, 2024

[MachineLICM][AArch64] Hoist COPY instructions with other uses in the loop #71403

Merged

llvmbot added the backend:AMDGPU label Feb 23, 2024

Asher8118 reviewed Feb 23, 2024

View reviewed changes

llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll Outdated Show resolved Hide resolved

davemgreen reviewed Feb 25, 2024

View reviewed changes

davemgreen approved these changes Feb 25, 2024

View reviewed changes

michaelselehov force-pushed the mselehov/fix-copy-hoisting branch from 3071728 to 47cb073 Compare February 26, 2024 13:03

arsenm reviewed Feb 26, 2024

View reviewed changes

llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll Outdated Show resolved Hide resolved

michaelselehov added 5 commits February 27, 2024 11:21

Added test

10dbe5e

Reduced function name in the test

5e99192

Addressed PR comments

67ccbd1

Addressed PR comments

5790f23

michaelselehov force-pushed the mselehov/fix-copy-hoisting branch from cdc5e34 to 5790f23 Compare February 27, 2024 10:21

davemgreen merged commit 56ad6d1 into llvm:main Feb 27, 2024

michaelselehov deleted the mselehov/fix-copy-hoisting branch February 27, 2024 12:34

		@@ -0,0 +1,94 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4

[MachineLICM] Hoist COPY instruction only when user can be hoisted #81735

[MachineLICM] Hoist COPY instruction only when user can be hoisted #81735

Uh oh!

Conversation

michaelselehov commented Feb 14, 2024

Uh oh!

github-actions bot commented Feb 14, 2024

Uh oh!

github-actions bot commented Feb 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Asher8118 left a comment

Choose a reason for hiding this comment

Uh oh!

davemgreen commented Feb 15, 2024

Uh oh!

michaelselehov commented Feb 19, 2024

Uh oh!

llvmbot commented Feb 23, 2024

Uh oh!

michaelselehov commented Feb 23, 2024

Uh oh!

Uh oh!

Uh oh!

davemgreen Feb 25, 2024

Choose a reason for hiding this comment

Uh oh!

michaelselehov Feb 26, 2024

Choose a reason for hiding this comment

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

michaelselehov commented Feb 26, 2024

Uh oh!

Uh oh!

michaelselehov commented Feb 26, 2024

Uh oh!

davemgreen commented Feb 27, 2024

Uh oh!

github-actions bot commented Feb 27, 2024

Uh oh!

michaelselehov commented Feb 27, 2024

Uh oh!

Uh oh!

github-actions bot commented Feb 14, 2024 •

edited

Loading