Skip to content

[MachineLICM] Hoist COPY instruction only when user can be hoisted #81735

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 27, 2024

Conversation

michaelselehov
Copy link
Contributor

befa925 introduced preliminary hoisting of COPY instructions when the user of the COPY is inside the same loop. That optimization appeared to be too aggressive and hoisted too many COPY's greatly increasing register pressure causing performance regressions for AMDGPU target.

This is intended to fix the regression by hoisting COPY instruction only if either:

  • User of COPY can be hoisted (other args are invariant)
    or
  • Hoisting COPY doesn't bring high register pressure

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

Copy link

github-actions bot commented Feb 14, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@michaelselehov michaelselehov force-pushed the mselehov/fix-copy-hoisting branch from 65f08ed to d39fd61 Compare February 14, 2024 14:01
@alex-t alex-t requested a review from Asher8118 February 14, 2024 14:03
@Asher8118 Asher8118 requested a review from davemgreen February 15, 2024 13:15
Copy link

@Asher8118 Asher8118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense to me. @davemgreen might want to take a look as well.

@davemgreen
Copy link
Collaborator

Do you have a test demonstrating the problem you are running into? We usually prefer that code is accompanied by a test case to make sure we don't accidentally break it again in the future.

@michaelselehov
Copy link
Contributor Author

Do you have a test demonstrating the problem you are running into? We usually prefer that code is accompanied by a test case to make sure we don't accidentally break it again in the future.

I have an IR derived from the benchmark which is pretty big. I reduced it down to the point when only one spill reload still appears but yet it's 685 lines big. I'll see if I can make anything prettier.

@llvmbot
Copy link
Member

llvmbot commented Feb 23, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: None (michaelselehov)

Changes

befa925 introduced preliminary hoisting of COPY instructions when the user of the COPY is inside the same loop. That optimization appeared to be too aggressive and hoisted too many COPY's greatly increasing register pressure causing performance regressions for AMDGPU target.

This is intended to fix the regression by hoisting COPY instruction only if either:

  • User of COPY can be hoisted (other args are invariant)
    or
  • Hoisting COPY doesn't bring high register pressure

Full diff: https://github.com/llvm/llvm-project/pull/81735.diff

4 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineLoopInfo.h (+4-1)
  • (modified) llvm/lib/CodeGen/MachineLICM.cpp (+15-4)
  • (modified) llvm/lib/CodeGen/MachineLoopInfo.cpp (+5-1)
  • (added) llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll (+94)
diff --git a/llvm/include/llvm/CodeGen/MachineLoopInfo.h b/llvm/include/llvm/CodeGen/MachineLoopInfo.h
index ae075bee1daf76..445c9b1c3bc00f 100644
--- a/llvm/include/llvm/CodeGen/MachineLoopInfo.h
+++ b/llvm/include/llvm/CodeGen/MachineLoopInfo.h
@@ -79,7 +79,10 @@ class MachineLoop : public LoopBase<MachineBasicBlock, MachineLoop> {
   /// I.e., all virtual register operands are defined outside of the loop,
   /// physical registers aren't accessed explicitly, and there are no side
   /// effects that aren't captured by the operands or other flags.
-  bool isLoopInvariant(MachineInstr &I) const;
+  /// ExcludeReg can be used to exclude the given register from the check
+  /// i.e. when we're considering hoisting it's definition but not hoisted it
+  /// yet
+  bool isLoopInvariant(MachineInstr &I, const Register ExcludeReg = 0) const;
 
   void dump() const;
 
diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp
index efc19f8fdbf8c2..a3bfd97e5a90cb 100644
--- a/llvm/lib/CodeGen/MachineLICM.cpp
+++ b/llvm/lib/CodeGen/MachineLICM.cpp
@@ -1264,13 +1264,24 @@ bool MachineLICMBase::IsProfitableToHoist(MachineInstr &MI,
 
   // If we have a COPY with other uses in the loop, hoist to allow the users to
   // also be hoisted.
+  Register defReg;
   if (MI.isCopy() && MI.getOperand(0).isReg() &&
-      MI.getOperand(0).getReg().isVirtual() && MI.getOperand(1).isReg() &&
-      MI.getOperand(1).getReg().isVirtual() &&
+      (defReg = MI.getOperand(0).getReg()).isVirtual() &&
+      MI.getOperand(1).isReg() && MI.getOperand(1).getReg().isVirtual() &&
       IsLoopInvariantInst(MI, CurLoop) &&
       any_of(MRI->use_nodbg_instructions(MI.getOperand(0).getReg()),
-             [&CurLoop](MachineInstr &UseMI) {
-               return CurLoop->contains(&UseMI);
+             [&CurLoop, this, defReg, Cost](MachineInstr &UseMI) {
+               if (!CurLoop->contains(&UseMI))
+                 return false;
+
+               // COPY is a cheap instruction, but if moving it won't cause high
+               // RP we're fine to hoist it even if the user can't be hoisted
+               // later Otherwise we want to check the user if it's hoistable
+               if (CanCauseHighRegPressure(Cost, false) &&
+                   !CurLoop->isLoopInvariant(UseMI, defReg))
+                 return false;
+
+               return true;
              }))
     return true;
 
diff --git a/llvm/lib/CodeGen/MachineLoopInfo.cpp b/llvm/lib/CodeGen/MachineLoopInfo.cpp
index bdbc57099aa8d3..1492c8c366fb41 100644
--- a/llvm/lib/CodeGen/MachineLoopInfo.cpp
+++ b/llvm/lib/CodeGen/MachineLoopInfo.cpp
@@ -198,7 +198,8 @@ MDNode *MachineLoop::getLoopID() const {
   return LoopID;
 }
 
-bool MachineLoop::isLoopInvariant(MachineInstr &I) const {
+bool MachineLoop::isLoopInvariant(MachineInstr &I,
+                                  const Register ExcludeReg) const {
   MachineFunction *MF = I.getParent()->getParent();
   MachineRegisterInfo *MRI = &MF->getRegInfo();
   const TargetSubtargetInfo &ST = MF->getSubtarget();
@@ -213,6 +214,9 @@ bool MachineLoop::isLoopInvariant(MachineInstr &I) const {
     Register Reg = MO.getReg();
     if (Reg == 0) continue;
 
+    if (ExcludeReg == Reg)
+      continue;
+
     // An instruction that uses or defines a physical register can't e.g. be
     // hoisted, so mark this as not invariant.
     if (Reg.isPhysical()) {
diff --git a/llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll b/llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll
new file mode 100644
index 00000000000000..55904f44082ec9
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/copy-hoist-no-spills.ll
@@ -0,0 +1,94 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -vectorize-loops -vectorize-slp -amdgpu-early-inline-all=true -amdgpu-function-calls=false < %s | FileCheck %s
+target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9-p9:192:256:256:32"
+target triple = "amdgcn-amd-amdhsa"
+
+define amdgpu_kernel void @_ZN6Kokkos4ImplL32hip_parallel_launch_local_memoryINS0_11ParallelForIN4Test7HexGradINS_3HIPEdfEENS_11RangePolicyIJS5_EEES5_EELj1024ELj1EEEvT_(ptr %.sroa.1.0.copyload, ptr %0, ptr %1, ptr %2, ptr %3, ptr %4, ptr %5, ptr %6, ptr %7, ptr %8, ptr %9, ptr %10, ptr %11, ptr %12, ptr %13, ptr %14, ptr %15, ptr %16, ptr %17, ptr %18, ptr %19, ptr %20, ptr %21, ptr %22, ptr %23, ptr %24, ptr %25, ptr %26, ptr %27, ptr %28, ptr %29, ptr %30, ptr %31, ptr %32, ptr %33, double %34, double %35, double %36, float %37, float %38, float %39, float %40, ptr %41) {
+; CHECK-LABEL: _ZN6Kokkos4ImplL32hip_parallel_launch_local_memoryINS0_11ParallelForIN4Test7HexGradINS_3HIPEdfEENS_11RangePolicyIJS5_EEES5_EELj1024ELj1EEEvT_:
+; CHECK-LABEL: .LBB0_1
+; CHECK-NOT: buffer_load_dword {{v[0-9]+}}
+
+.lr.ph:
+  %.pre = load double, ptr null, align 8
+  br label %42
+
+42:                                               ; preds = %42, %.lr.ph
+  %.0.i4402 = phi i32 [ 1, %.lr.ph ], [ 0, %42 ]
+  %43 = zext i32 %.0.i4402 to i64
+  %44 = load double, ptr %2, align 8
+  %45 = load double, ptr %4, align 8
+  %46 = load double, ptr %7, align 8
+  %47 = load double, ptr %13, align 8
+  %48 = load double, ptr %15, align 8
+  %49 = load double, ptr %17, align 8
+  %50 = load double, ptr %19, align 8
+  %51 = load double, ptr %18, align 8
+  %52 = load double, ptr %27, align 8
+  %53 = load double, ptr %23, align 8
+  %54 = load double, ptr %31, align 8
+  %55 = load double, ptr %33, align 8
+  %56 = load double, ptr %25, align 8
+  %57 = load double, ptr %16, align 8
+  %58 = fpext float %40 to double
+  %59 = fmul double %52, %58
+  %60 = fadd double %59, %51
+  %61 = fsub double %60, %48
+  %62 = fmul double 0.000000e+00, %36
+  %63 = fsub double %61, %62
+  %64 = fadd double %49, %63
+  %65 = fptrunc double %64 to float
+  %66 = fsub double 0.000000e+00, %34
+  %67 = fpext float %39 to double
+  %68 = fmul double %53, %67
+  %69 = fsub double %66, %68
+  %70 = fadd double %50, %69
+  %71 = fptrunc double %70 to float
+  store float 0.000000e+00, ptr %30, align 4
+  store float 0.000000e+00, ptr %26, align 4
+  %72 = getelementptr float, ptr %41, i64 %43
+  store float %38, ptr %72, align 4
+  store float %65, ptr %29, align 4
+  store float %71, ptr %14, align 4
+  store float %39, ptr %3, align 4
+  store float %39, ptr %11, align 4
+  %73 = fsub double %46, %44
+  %74 = fptrunc double %73 to float
+  %75 = fsub double %47, %45
+  %76 = fptrunc double %75 to float
+  %77 = fadd float %74, %76
+  %78 = fpext float %37 to double
+  %79 = fmul contract double %56, 0.000000e+00
+  %80 = fsub contract double %34, %79
+  %81 = fpext float %77 to double
+  %82 = fmul double %.pre, %81
+  %83 = fsub double %80, %82
+  %84 = fpext float %38 to double
+  %85 = fmul double %57, %84
+  %86 = fsub double %83, %85
+  %87 = fptrunc double %86 to float
+  %88 = fmul double %34, 0.000000e+00
+  %89 = fmul double %54, %78
+  %90 = fadd double %89, %88
+  %91 = fsub double %90, %55
+  %92 = fmul double 0.000000e+00, %35
+  %93 = fsub double %91, %92
+  %94 = fmul double %34, %34
+  %95 = fadd double %93, %94
+  %96 = fptrunc double %95 to float
+  store float %87, ptr %1, align 4
+  store float %37, ptr %21, align 4
+  store float %96, ptr %0, align 4
+  store float 0.000000e+00, ptr %9, align 4
+  store float 0.000000e+00, ptr %32, align 4
+  store float 0.000000e+00, ptr %20, align 4
+  store float 0.000000e+00, ptr %22, align 4
+  store float 0.000000e+00, ptr %5, align 4
+  store float 0.000000e+00, ptr %28, align 4
+  store float 0.000000e+00, ptr %12, align 4
+  store float 0.000000e+00, ptr %6, align 4
+  store float 0.000000e+00, ptr %8, align 4
+  store float 0.000000e+00, ptr %.sroa.1.0.copyload, align 4
+  store float %37, ptr %10, align 4
+  store float 0.000000e+00, ptr %24, align 4
+  br label %42
+}

@michaelselehov
Copy link
Contributor Author

Do you have a test demonstrating the problem you are running into? We usually prefer that code is accompanied by a test case to make sure we don't accidentally break it again in the future.

I managed to reduce the test as much as I could. It's still 80+ lines of IR but the check is simple: there should be no spill reload in the loop.

@@ -0,0 +1,94 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check lines do not look auto-generated, but they might be a bit fragile as-is. Would it be possible to auto-generate them? Or is that too large?
An explanation of what is going wrong in the test (some-such thing should not be hoisted and spilled?) might be useful in the long-run too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with autogenerated test, hence the artefact in the NOTE.
But I liked my way of checking much more than the autogeneration. The latter is going to fail from the every little sneeze in the compiler that changes sequence of instructions of register names. Instead my way of checking checks exactly what I want: spill reload in the loop. And when the test fails the error is very clear to read.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The code looks OK to me

@michaelselehov michaelselehov force-pushed the mselehov/fix-copy-hoisting branch from 3071728 to 47cb073 Compare February 26, 2024 13:03
@michaelselehov
Copy link
Contributor Author

@Rin18 @davemgreen if you're fine with the change, would you please do me a favor and merge it? Thanks.

@michaelselehov
Copy link
Contributor Author

There seems to be a weird error with windows/flang test that must not be related to my PR. Waiting for a fix.

befa925 introduced preliminary hoisting of COPY
instructions when the user of the COPY is inside the same loop.
That optimization appeared to be too aggressive and hoisted too many
COPY's greately increasing register pressure causing performance
regressions for AMDGPU target.

This is intended to fix the regression by hoisting COPY instruction only
if either:
 - User of COPY can be hoisted (other args are invariant)
or
 - Hoisting COPY doesn't bring high register pressure
@michaelselehov michaelselehov force-pushed the mselehov/fix-copy-hoisting branch from cdc5e34 to 5790f23 Compare February 27, 2024 10:21
@davemgreen
Copy link
Collaborator

The windows tests look like they should be unrelated. Lets get this in - thanks.

@davemgreen davemgreen merged commit 56ad6d1 into llvm:main Feb 27, 2024
Copy link

@michaelselehov Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested
by our build bots. If there is a problem with a build, you may recieve a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as
the builds can include changes from many authors. It is not uncommon for your
change to be included in a build that fails due to someone else's changes, or
infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself.
This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

@michaelselehov
Copy link
Contributor Author

The windows tests look like they should be unrelated. Lets get this in - thanks.

Thank you for merging!

@michaelselehov michaelselehov deleted the mselehov/fix-copy-hoisting branch February 27, 2024 12:34
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Apr 22, 2024
…lvm#81735)

befa925 introduced preliminary hoisting of COPY
instructions when the user of the COPY is inside the same loop. That
optimization appeared to be too aggressive and hoisted too many COPY's
greatly increasing register pressure causing performance regressions for
AMDGPU target.

This is intended to fix the regression by hoisting COPY instruction only
if either:
 - User of COPY can be hoisted (other args are invariant)
 or
 - Hoisting COPY doesn't bring high register pressure

Change-Id: Ie60752951af599f8f3d6cf90c093e2307736d1ae
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants