[InstCombine] Optimistically allow multiple shufflevector uses in foldOpPhi #114278

MatzeB · 2024-10-30T17:35:33Z

We would like to optimize situations of the form that happen after loop vectorization+SROA:

loop:
    %phi = phi zeroinitializer, %interleaved

    %deinterleave_a = shufflevector %phi, poison ; pick half of the lanes
    %deinterleave_b = shufflevector %phi, posion ; pick remaining lanes

    ... %a = ... %b = ...

    %interleaved = shufflevector %a, %b ; interleave lanes of a+b

where the interleave and de-interleave shuffle operations cancel each other out.
This could be handled by foldOpPhi but does not currently work because it does
not proceed when there are multiple uses of the Phi operation.

This extends foldOpPhi to allow multiple shufflevector uses when they are
shown to simplify for all Phi input values.

llvmbot · 2024-10-30T17:36:08Z

@llvm/pr-subscribers-llvm-transforms

Author: Matthias Braun (MatzeB)

Changes

We would like to optimize situations of the form that happen after loop vectorization+SROA:

loop:
    %phi = phi zeroinitializer, %interleaved

    %deinterleave_a = shufflevector %phi, poison ; pick half of the lanes
    %deinterleave_b = shufflevector %phi, posion ; pick remaining lanes

    ... %a = ... %b = ...

    %interleaved = shufflevector %a, %b ; interleave lanes of a+b

where the interleave and de-interleave shuffle operations cancel each other out.
This could be handled by foldOpPhi but does not currently work because it does
not proceed when there are multiple uses of the Phi operation.

This extends foldOpPhi to allow multiple shufflevector uses when they are
shown to simplify for all Phi input values.

Full diff: https://github.com/llvm/llvm-project/pull/114278.diff

2 Files Affected:

(modified) llvm/lib/Transforms/InstCombine/InstructionCombining.cpp (+40-15)
(added) llvm/test/Transforms/InstCombine/vec_shuffle-phi-multiuse.ll (+57)

diff --git a/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp b/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
index 2a54390c0f1882..9b2074e481f54f 100644
--- a/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
@@ -1773,17 +1773,33 @@ Instruction *InstCombinerImpl::foldOpIntoPhi(Instruction &I, PHINode *PN) {
   if (NumPHIValues == 0)
     return nullptr;
 
-  // We normally only transform phis with a single use.  However, if a PHI has
-  // multiple uses and they are all the same operation, we can fold *all* of the
-  // uses into the PHI.
+  // We normally only transform phis with a single use.
+  bool AllUsesIdentical = false;
+  bool MultipleUses = false;
   if (!PN->hasOneUse()) {
-    // Walk the use list for the instruction, comparing them to I.
+    // Exceptions:
+    //   - All uses are identical.
+    //   - All uses are shufflevector instructions that fully simplify; this
+    //     helps interleave -> phi -> 2x de-interleave+de patterns.
+    if (isa<ShuffleVectorInst>(I)) {
+      MultipleUses = true;
+    }
+    AllUsesIdentical = true;
+    unsigned NumUses = 0;
     for (User *U : PN->users()) {
+      ++NumUses;
       Instruction *UI = cast<Instruction>(U);
-      if (UI != &I && !I.isIdenticalTo(UI))
+      if (UI == &I)
+        continue;
+
+      if (!I.isIdenticalTo(UI))
+        AllUsesIdentical = false;
+      // Only inspect first 4 uses to avoid quadratic complexity.
+      if (!isa<ShuffleVectorInst>(UI) || NumUses > 4)
+        MultipleUses = false;
+      if (!AllUsesIdentical && !MultipleUses)
         return nullptr;
     }
-    // Otherwise, we can replace *all* users with the new PHI we form.
   }
 
   // Check that all operands are phi-translatable.
@@ -1834,6 +1850,11 @@ Instruction *InstCombinerImpl::foldOpIntoPhi(Instruction &I, PHINode *PN) {
       continue;
     }
 
+    // Be conservative in MultipleUses case and do not allow non-simplified
+    // vals.
+    if (MultipleUses)
+      return nullptr;
+
     if (SeenNonSimplifiedInVal)
       return nullptr; // More than one non-simplified value.
     SeenNonSimplifiedInVal = true;
@@ -1895,17 +1916,21 @@ Instruction *InstCombinerImpl::foldOpIntoPhi(Instruction &I, PHINode *PN) {
   for (unsigned i = 0; i != NumPHIValues; ++i)
     NewPN->addIncoming(NewPhiValues[i], PN->getIncomingBlock(i));
 
-  for (User *U : make_early_inc_range(PN->users())) {
-    Instruction *User = cast<Instruction>(U);
-    if (User == &I)
-      continue;
-    replaceInstUsesWith(*User, NewPN);
-    eraseInstFromFunction(*User);
+  if (AllUsesIdentical) {
+    for (User *U : make_early_inc_range(PN->users())) {
+      Instruction *User = cast<Instruction>(U);
+      if (User == &I)
+        continue;
+      replaceInstUsesWith(*User, NewPN);
+      eraseInstFromFunction(*User);
+    }
   }
 
-  replaceAllDbgUsesWith(const_cast<PHINode &>(*PN),
-                        const_cast<PHINode &>(*NewPN),
-                        const_cast<PHINode &>(*PN), DT);
+  if (!MultipleUses || AllUsesIdentical) {
+    replaceAllDbgUsesWith(const_cast<PHINode &>(*PN),
+                          const_cast<PHINode &>(*NewPN),
+                          const_cast<PHINode &>(*PN), DT);
+  }
   return replaceInstUsesWith(I, NewPN);
 }
 
diff --git a/llvm/test/Transforms/InstCombine/vec_shuffle-phi-multiuse.ll b/llvm/test/Transforms/InstCombine/vec_shuffle-phi-multiuse.ll
new file mode 100644
index 00000000000000..597eae101e9f2d
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/vec_shuffle-phi-multiuse.ll
@@ -0,0 +1,57 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -S -passes=instcombine | FileCheck %s
+
+define void @f(ptr %p_begin, ptr %p_end, ptr %out) {
+; CHECK-LABEL: define void @f(
+; CHECK-SAME: ptr [[P_BEGIN:%.*]], ptr [[P_END:%.*]], ptr [[OUT:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[ACC:%.*]] = phi <4 x float> [ zeroinitializer, %[[ENTRY]] ], [ [[SUM_LOWS:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[ODDS:%.*]] = phi <4 x float> [ zeroinitializer, %[[ENTRY]] ], [ [[SUM_HIGHS:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[P:%.*]] = phi ptr [ [[P_BEGIN]], %[[ENTRY]] ], [ [[P_INC:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[VAL:%.*]] = load <4 x i8>, ptr [[P]], align 4
+; CHECK-NEXT:    [[HIGHS:%.*]] = ashr <4 x i8> [[VAL]], <i8 4, i8 4, i8 4, i8 4>
+; CHECK-NEXT:    [[LOWS:%.*]] = and <4 x i8> [[VAL]], <i8 15, i8 15, i8 15, i8 15>
+; CHECK-NEXT:    [[HIGHS_F:%.*]] = sitofp <4 x i8> [[HIGHS]] to <4 x float>
+; CHECK-NEXT:    [[LOWS_F:%.*]] = uitofp nneg <4 x i8> [[LOWS]] to <4 x float>
+; CHECK-NEXT:    [[SUM_LOWS]] = fadd <4 x float> [[ACC]], [[LOWS_F]]
+; CHECK-NEXT:    [[SUM_HIGHS]] = fadd <4 x float> [[ODDS]], [[HIGHS_F]]
+; CHECK-NEXT:    [[P_INC]] = getelementptr inbounds i8, ptr [[P]], i64 4
+; CHECK-NEXT:    [[C:%.*]] = icmp eq ptr [[P_INC]], [[P_END]]
+; CHECK-NEXT:    br i1 [[C]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[INTERLEAVE:%.*]] = shufflevector <4 x float> [[SUM_LOWS]], <4 x float> [[SUM_HIGHS]], <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
+; CHECK-NEXT:    store <8 x float> [[INTERLEAVE]], ptr [[OUT]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %acc = phi <8 x float> [ zeroinitializer, %entry ], [ %interleave, %loop ]
+  %p = phi ptr [%p_begin, %entry ], [%p_inc, %loop]
+
+  %evens = shufflevector <8 x float> %acc, <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
+  %odds = shufflevector <8 x float> %acc, <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
+
+  %val = load <4 x i8>, ptr %p, align 4
+  %highs = ashr <4 x i8> %val, <i8 4, i8 4, i8 4, i8 4>
+  %lows = and <4 x i8> %val, <i8 15, i8 15, i8 15, i8 15>
+
+  %highs_f = sitofp <4 x i8> %highs to <4 x float>
+  %lows_f = sitofp <4 x i8> %lows to <4 x float>
+
+  %sum_lows = fadd <4 x float> %evens, %lows_f
+  %sum_highs = fadd <4 x float> %odds, %highs_f
+
+  %interleave = shufflevector <4 x float> %sum_lows, <4 x float> %sum_highs, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
+
+  %p_inc = getelementptr inbounds i8, ptr %p, i32 4
+  %c = icmp eq ptr %p_inc, %p_end
+  br i1 %c, label %exit, label %loop
+
+exit:
+  store <8 x float> %interleave, ptr %out, align 4
+  ret void
+}

MatzeB · 2024-10-30T17:38:19Z

This pattern happens when autovectorizing an important inner loop of fbgemm ( https://github.com/pytorch/FBGEMM/blob/119052f4dddc08064ea32d058c05055f1a25ec74/src/EmbeddingSpMDMAutovec.cc#L418 ); a simplified version of this is shown in the new test/Transforms/InstCombine/vec_shuffle-phi-multiuse.ll.

llvm/lib/Transforms/InstCombine/InstructionCombining.cpp

dtcxzyw

LGTM. Please wait for additional approval from other reviewers.

nikic

If I'm reading this correctly, then nothing here actually guarantees that all shufflevectors are going to simplify? You just check that all uses are shufflevectors, but then simplify each individually.

If we say that we consider it profitable to replace a shufflevector with a phi in general, then we shouldn't need the check on the uses for shufflevectors at all.

dtcxzyw · 2024-11-08T14:16:10Z

If I'm reading this correctly, then nothing here actually guarantees that all shufflevectors are going to simplify? You just check that all uses are shufflevectors, but then simplify each individually.

See L1851-1855:

     // Be conservative in cases with multiple uses and require all inputs to
     // simplify.

     if (MultipleShuffleVectorUses)
       return nullptr;

nikic · 2024-11-08T14:36:35Z

If I'm reading this correctly, then nothing here actually guarantees that all shufflevectors are going to simplify? You just check that all uses are shufflevectors, but then simplify each individually.

See L1851-1855:
     // Be conservative in cases with multiple uses and require all inputs to
     // simplify.

     if (MultipleShuffleVectorUses)
       return nullptr;

Right, but doesn't think just check that this shufflevector simplifies (for all phi inputs), not that all shufflevectors simplify?

llvm/test/Transforms/InstCombine/vec_shuffle-phi-multiuse.ll

llvm/lib/Transforms/InstCombine/InstructionCombining.cpp

MatzeB · 2024-11-11T20:27:44Z

Right, but doesn't think just check that this shufflevector simplifies (for all phi inputs), not that all shufflevectors simplify?

new versions checks that all phi users simplify.

MatzeB · 2024-11-11T20:30:50Z

If we say that we consider it profitable to replace a shufflevector with a phi in general, then we shouldn't need the check on the uses for shufflevectors at all.

My assumption is that turning one Phi into multiple Phi instructions is not necessarily an improvement.

I could see this being an improvement on average, but it's hard to measure make statements about average code. Given it's not a clear improvement I am not sure I want to deal with the complaints for the regressions and rather would be more conservative...

nikic · 2024-11-15T14:54:40Z

If we say that we consider it profitable to replace a shufflevector with a phi in general, then we shouldn't need the check on the uses for shufflevectors at all.

My assumption is that turning one Phi into multiple Phi instructions is not necessarily an improvement.

I could see this being an improvement on average, but it's hard to measure make statements about average code. Given it's not a clear improvement I am not sure I want to deal with the complaints for the regressions and rather would be more conservative...

I'd rather start with a simple implementation and then make it more complex if needed, rather than start with a complex implementation from the start "just in case".

If we really want to check all shufflevectors, I think the transform should be further separated from the existing one and simplify all the shufflevectors at once. It makes little sense to check that all of them simplify and compute the replacement values, then discard all of that and replace a single one of them, and then repeat the whole process again for the next shufflevector.

MatzeB · 2024-11-18T19:55:38Z

If we really want to check all shufflevectors, I think the transform should be further separated from the existing one and simplify all the shufflevectors at once. It makes little sense to check that all of them simplify and compute the replacement values, then discard all of that and replace a single one of them, and then repeat the whole process again for the next shufflevector.

@nikic

Would you rather see this logic in a visitPHINode function that would then check all the users and optimize them all at once. I was considering doing that, but wasn't sure if people would apreciate the change in style as the transform rules would no longer be within visitShuffleVector where people may intuitively look for them.
Or are you saying that we should rather just drop all the checks and just do the simple thing and always optimize regardless of amount of uses. In that case, what sort of qualification/testing would convince you that we don't regress on average?

nikic · 2024-11-19T11:16:53Z

I'm saying we should try the latter first and if that fails go to the former.

MatzeB · 2024-11-20T00:15:19Z

Here's the simpler version just optimistically transforming shufflevector(phi(shufflevector())) ignoring number of uses on the phi.

I did llvm-test-suite builds with/without this change and it only affected codegen in 6 benchmarks[1]. I was not able to measure any performance difference above noise level and inspecting the assembly diff I saw a couple shuffles moving out of loops (good), a bunch of churn that I couldn't immediately tell whether it's better/worse, and did not see anything obviously bad.

So worth a try optimistically enabling this.

[1]
External/SPEC/CFP2006/447.dealII/447.dealII.test
External/SPEC/CFP2006/453.povray/453.povray.test
External/SPEC/CINT2006/464.h264ref/464.h264ref.test
MultiSource/Benchmarks/Bullet/bullet.test
MultiSource/Applications/JM/lencod/lencod.test
MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test

nikic

Sorry for being unclear, I do think that you should still keep the !SeenNonSimplifiedInVal check for the multi-use case. Without that check we could end up with both the original phis and extra new shufflevectors (rather than replacing shufflevectors with phis).

MatzeB · 2024-12-06T22:47:19Z

updated to require all phi inputs to simplify if there were multiple uses.

MatzeB · 2024-12-12T00:27:32Z

@nikic now good to go?

nikic

LGTM

MatzeB requested review from LebedevRI, RKSimon, dtcxzyw, WenleiHe, goldsteinn, efriedma-quic and spatel-gh October 30, 2024 17:35

MatzeB requested a review from nikic as a code owner October 30, 2024 17:35

llvmbot added the llvm:transforms label Oct 30, 2024

dtcxzyw changed the title ~~Optimistically allow multiple shufflevector uses in foldOpPhi~~ [InstCombine] Optimistically allow multiple shufflevector uses in foldOpPhi Oct 30, 2024

This was referenced Oct 30, 2024

Task submission dtcxzyw/llvm-opt-benchmark#1312

Open

pre-commit: PR114278 dtcxzyw/llvm-opt-benchmark#1570

Closed

dtcxzyw reviewed Nov 7, 2024

View reviewed changes

llvm/lib/Transforms/InstCombine/InstructionCombining.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/InstCombine/InstructionCombining.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/InstCombine/InstructionCombining.cpp Outdated Show resolved Hide resolved

dtcxzyw approved these changes Nov 8, 2024

View reviewed changes

nikic reviewed Nov 8, 2024

View reviewed changes

dmpots approved these changes Nov 9, 2024

View reviewed changes

llvm/test/Transforms/InstCombine/vec_shuffle-phi-multiuse.ll Outdated Show resolved Hide resolved

llvm/lib/Transforms/InstCombine/InstructionCombining.cpp Outdated Show resolved Hide resolved

MatzeB force-pushed the shuffle_phi_multi_use branch from 745fac7 to 95dd2c5 Compare November 20, 2024 00:12

llvmbot added the llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes label Nov 20, 2024

Allow multiple uses in foldOpIntoPhi for shufflevector instructions

6f74d52

MatzeB force-pushed the shuffle_phi_multi_use branch from 95dd2c5 to 6f74d52 Compare November 20, 2024 00:25

nikic reviewed Nov 20, 2024

View reviewed changes

MatzeB added 3 commits December 6, 2024 13:39

Require all phi inputs to simplify when there are multiple users

a48da43

Merge remote-tracking branch 'github/main' into shuffle_phi_multi_use

b2032da

adjust test to rebase

8696cf6

nikic approved these changes Dec 12, 2024

View reviewed changes

MatzeB merged commit 7687548 into llvm:main Dec 13, 2024
8 checks passed

MatzeB deleted the shuffle_phi_multi_use branch April 7, 2025 22:39

[InstCombine] Optimistically allow multiple shufflevector uses in foldOpPhi #114278

[InstCombine] Optimistically allow multiple shufflevector uses in foldOpPhi #114278

Uh oh!

Conversation

MatzeB commented Oct 30, 2024

Uh oh!

llvmbot commented Oct 30, 2024

Uh oh!

MatzeB commented Oct 30, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dtcxzyw left a comment

Choose a reason for hiding this comment

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

dtcxzyw commented Nov 8, 2024

Uh oh!

nikic commented Nov 8, 2024

Uh oh!

Uh oh!

Uh oh!

MatzeB commented Nov 11, 2024

Uh oh!

MatzeB commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic commented Nov 15, 2024

Uh oh!

MatzeB commented Nov 18, 2024

Uh oh!

nikic commented Nov 19, 2024

Uh oh!

MatzeB commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

MatzeB commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatzeB commented Dec 12, 2024

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MatzeB commented Nov 11, 2024 •

edited

Loading

MatzeB commented Nov 20, 2024 •

edited

Loading

MatzeB commented Dec 6, 2024 •

edited

Loading