Skip to content

[SimplifyCFG] Treat umul + extract pattern as cheap single instruction. #124933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

spaits
Copy link
Contributor

@spaits spaits commented Jan 29, 2025

Fix issue #115683 .

Overflow arithmetic instruction plus extract value are usually generated when a division is being replaced, but the zero check may still be there. In that case hoist these two instructions out of this basic block, and let later optimizations take care of the unnecessary zero checks.

An example:

define i1 @src(i8 %x, i8 %y) zeroext {
entry:
  %cmp.not = icmp eq i8 %y, 0
  br i1 %cmp.not, label %land.end, label %land.rhs

land.rhs:
  %mul = umul_overflow i8 %y, %x
  %mul.ov = extractvalue {i8, i1} %mul, 1
  br label %land.end

land.end:
  %#0 = phi i1 [ 0, %entry ], [ %mul.ov, %land.rhs ]
  ret i1 %#0
}
=>
define i1 @tgt(i8 %x, i8 %y) zeroext {
entry:
  %fx = freeze i8 %x
  %mul = umul_overflow i8 %y, %fx
  %mul.ov = extractvalue {i8, i1} %mul, 1
  ret i1 %mul.ov
}

@llvmbot
Copy link
Member

llvmbot commented Jan 29, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Gábor Spaits (spaits)

Changes

This pr was created to fix #115683 .

In the comments @dtcxzyw has suggested, to treat the the pattern extract oneuse(op.with.overflow),1 as one instruction.
If we want the speculativelyExecuteBB function to speculatively execute this block, then we must do that, since it only speculatively executes a block, if that has at most two instructions.

I was also thinking, maybe I should create a mechanism to add more cheap patterns like this one, and in the feature we just provide the pattern and don't have to create a add a new if (pattern && len(thenBlock) == 1 + (len(pattern))) then break to the speculativelyExecuteBB function.

What do you think? Is this solution correct? Or should I try to do this in a different way?
Any input is welcomed.


Full diff: https://github.com/llvm/llvm-project/pull/124933.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Utils/SimplifyCFG.cpp (+17-5)
  • (added) llvm/test/Transforms/SimplifyCFG/umul-extract-pattern.ll (+27)
diff --git a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
index 12dd49da279b9c..5197db70285cec 100644
--- a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
+++ b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
@@ -3290,9 +3290,11 @@ bool SimplifyCFGOpt::speculativelyExecuteBB(BranchInst *BI,
   bool HoistLoadsStores = HoistLoadsStoresWithCondFaulting &&
                           Options.HoistLoadsStoresWithCondFaulting;
   SmallVector<Instruction *, 2> SpeculatedConditionalLoadsStores;
+  InstructionCost BlockCostSoFar = 0;
   Value *SpeculatedStoreValue = nullptr;
   StoreInst *SpeculatedStore = nullptr;
   EphemeralValueTracker EphTracker;
+  bool PatternFound = false;
   for (Instruction &I : reverse(drop_end(*ThenBB))) {
     // Skip debug info.
     if (isa<DbgInfoIntrinsic>(I)) {
@@ -3329,9 +3331,6 @@ bool SimplifyCFGOpt::speculativelyExecuteBB(BranchInst *BI,
     else
       ++SpeculatedInstructions;
 
-    if (SpeculatedInstructions > 1)
-      return false;
-
     // Don't hoist the instruction if it's unsafe or expensive.
     if (!IsSafeCheapLoadStore &&
         !isSafeToSpeculativelyExecute(&I, BI, Options.AC) &&
@@ -3339,10 +3338,23 @@ bool SimplifyCFGOpt::speculativelyExecuteBB(BranchInst *BI,
           (SpeculatedStoreValue =
                isSafeToSpeculateStore(&I, BB, ThenBB, EndBB))))
       return false;
-    if (!IsSafeCheapLoadStore && !SpeculatedStoreValue &&
-        computeSpeculationCost(&I, TTI) >
+
+    if (match(&I,
+              m_ExtractValue<1>(m_OneUse(
+                  m_Intrinsic<Intrinsic::umul_with_overflow>(m_Value())))) &&
+        ThenBB->size() <= 3) {
+      PatternFound = true;
+    }
+
+    BlockCostSoFar += computeSpeculationCost(&I, TTI);
+    if (! PatternFound && !IsSafeCheapLoadStore && !SpeculatedStoreValue &&
+        BlockCostSoFar >
             PHINodeFoldingThreshold * TargetTransformInfo::TCC_Basic)
       return false;
+    // If we don't find any pattern, that must be cheap, then only speculatively
+    // execute a single instruction (not counting the terminator).
+    if (!PatternFound && SpeculatedInstructions > 1)
+      return false;
 
     // Store the store speculation candidate.
     if (!SpeculatedStore && SpeculatedStoreValue)
diff --git a/llvm/test/Transforms/SimplifyCFG/umul-extract-pattern.ll b/llvm/test/Transforms/SimplifyCFG/umul-extract-pattern.ll
new file mode 100644
index 00000000000000..72610218c314ec
--- /dev/null
+++ b/llvm/test/Transforms/SimplifyCFG/umul-extract-pattern.ll
@@ -0,0 +1,27 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -passes=simplifycfg -simplifycfg-require-and-preserve-domtree=1 -S < %s | FileCheck %s
+
+define dso_local signext range(i16 0, 2) i16 @func2(i64 noundef %x, i64 noundef %y) local_unnamed_addr #0 {
+; CHECK-LABEL: @func2(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP_NOT:%.*]] = icmp eq i64 [[Y:%.*]], 0
+; CHECK-NEXT:    [[MUL:%.*]] = tail call { i64, i1 } @llvm.umul.with.overflow.i64(i64 [[Y]], i64 [[X:%.*]])
+; CHECK-NEXT:    [[MUL_OV:%.*]] = extractvalue { i64, i1 } [[MUL]], 1
+; CHECK-NEXT:    [[SPEC_SELECT:%.*]] = select i1 [[CMP_NOT]], i1 false, i1 [[MUL_OV]]
+; CHECK-NEXT:    [[CONV:%.*]] = zext i1 [[SPEC_SELECT]] to i16
+; CHECK-NEXT:    ret i16 [[CONV]]
+;
+entry:
+  %cmp.not = icmp eq i64 %y, 0
+  br i1 %cmp.not, label %land.end, label %land.rhs
+
+land.rhs:                                         ; preds = %entry
+  %mul = tail call { i64, i1 } @llvm.umul.with.overflow.i64(i64 %y, i64 %x)
+  %mul.ov = extractvalue { i64, i1 } %mul, 1
+  br label %land.end
+
+land.end:                                         ; preds = %land.rhs, %entry
+  %0 = phi i1 [ false, %entry ], [ %mul.ov, %land.rhs ]
+  %conv = zext i1 %0 to i16
+  ret i16 %conv
+}

Copy link

github-actions bot commented Jan 29, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@spaits spaits changed the title [Transform] Treat umul + extract pattern as cheap single instruction. [SimplifyCFG] Treat umul + extract pattern as cheap single instruction. Jan 29, 2025
@spaits
Copy link
Contributor Author

spaits commented Jan 30, 2025

Also related to:
https://reviews.llvm.org/D65143
And fixes the problem pointed out in:
6b248fc

// If we don't find any pattern, that must be cheap, then only speculatively
// execute a single instruction (not counting the terminator).
if (!PatternFound && SpeculatedInstructions > 1)
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be easier to just add a MaxSpeculatedInstructionsToHoist variable and increment it if you find an overflow intrinsic whose only user is an extract.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, I was thinking something a bit different, particularly I don't really see the reason for accumulating BlockCostSoFar for the extract and mul.overflow if the expressed intent is to treat it as a single instruction.

What about something like:

    bool PartialInst = false;
    ...
    // Not count load/store into cost if target supports conditional faulting
    // b/c it's cheap to speculate it.
    if (IsSafeCheapLoadStore)
      SpeculatedConditionalLoadsStores.push_back(&I);
    else if (match(&I, m_ExtractValue<1>(
                           m_OneUse(m_WithOverflowInst(m_Value())))) &&
             I.getOperand(0)->getParent() == ThenBB)
      // We are treating (extract (mul/sub/add.overflow), 1) as a single
      // instruction
      PartialInst = true;
    else
      ++SpeculatedInstructions;
     ...
    if (!IsSafeCheapLoadStore && !SpeculatedStoreValue && !PartialInst &&
        computeSpeculationCost(&I, TTI) >
            PHINodeFoldingThreshold * TargetTransformInfo::TCC_Basic)
      return false;

Although I'm not really sure if the PartialInst part is necessary at all in fact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the accumulating part of the PR. It was not needed. I have just added it because it looked more generic to me(#124933 (comment) later we would want to hoist more then one instruction but we would still care about the cost then we would need it, but not now). I know it is a generally bad practice to include code in a commit, that is not related to the main goal of the commit. Sorry for that.

If we don't have thePartialInst variable then the hoisting would not be done, because of the high cost.
On rv64, the value of PHINodeFoldingThreshold * TargetTransformInfo::TCC_Basic is 2 and the cost of the pattern is 5. So one of the instructions cost must be 3 and that would be over the threshold.

I was also thinking of extending the pattern. Maybe we should check, if the condition is truly a zero checking for one of the operand of the overflow operator. This hoist is probably beneficial in that case only, because later on the redundant check can be dismissed.
What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also MaxSpeculatedInstructionsToHoist is still needed, because the match only finds the last instruction of the pattern, the others are still checked and therefore they increase the instruction count.

@spaits
Copy link
Contributor Author

spaits commented Jan 31, 2025

@goldsteinn Thank you very much for your review.
I have added the MaxSpeculatedInstructionsToHoist variable as you have suggested.
I kept the PatternFound. It is used to decide whether we should take in account the calculated cost of the basic block. I could use the MaxSpeculatedInstructionsToHoist variable to do this now: if the value of this is one, then there is no cheap patter, else there is. But, what if we would want to hoist more instructions than one, even if there is no patter? For that reason I keep an own variable for patterns.

Copy link
Member

@dtcxzyw dtcxzyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you! Can you adjust the PR description to make it ready for merge?
Please wait for additional approval from other reviewers :)

@spaits
Copy link
Contributor Author

spaits commented Feb 2, 2025

@dtcxzyw @nikic Thank you very much for your review.
I have made some changes on the PR description. Is it proper for a commit message?
I have also fixes the pattern length and the instruction counting issue.
Also added a test with some more add instructions in the basic block, to see that we are not hoisting, if there are more instructions.
The tests still need a deeper look. I will do more work on them soon. When is use %add2 in the %99 = phi i32 ... the hoisting will happen. This is interesting, because when I do debugging I see the return false, when the instruction count hits 3. Maybe these instructions are being hoisted by another part of the SimplifyCFG pass?

@spaits spaits force-pushed the llvm-issue-115683 branch from 6dade29 to 838e617 Compare February 4, 2025 15:38
@spaits
Copy link
Contributor Author

spaits commented Feb 4, 2025

@nikic Sorry for the slow response. I finally had some some time.
The problem with my test was that, I forgot to use the target triple = "riscv64-unknown-unknown-elf" attribute for the file.
When this is not there (or any other target) then the hoisting happens without the patch.
The function doing that is the following (If I understand everything correctly):

if (foldTwoEntryPHINode(PN, TTI, DTU, Options.AC, DL,

Maybe I should look more into that function and try to move this optimization there?

@spaits
Copy link
Contributor Author

spaits commented Feb 12, 2025

Gentle ping @nikic .
I have checked foldTwoEntryPHINode and dominatesMergePoint.

It turns out, that the cost of

 %mul = tail call { i64, i1 } @llvm.umul.with.overflow.i64(i64 %add, i64 %x)

is 1, when there is no target specified. When using the riscv64 target then this value is 5.

dominatesMergePoint works recursively so I think it would be more complex, to "change" the cost of the pattern there. I would personally keep things the way they are in the current version of the patch, but if you would want this to move elsewhere the I am going to do that.

spaits and others added 5 commits February 12, 2025 12:47
@spaits
Copy link
Contributor Author

spaits commented Feb 12, 2025

@antoniofrighetto Thank you very much for your review.
I think I have addressed all the reviews. Sorry for the amount of typos.
In the case of your last comment about MaxSpeculatedInstructionsToHoist I think we should wait for other reviewers opinions too.

@antoniofrighetto
Copy link
Contributor

@spaits Looks like the CI is reporting a failure in PhaseOrdering/X86/unsigned-multiply-overflow-check.ll test, which should require an update, could you look at it please?

@spaits
Copy link
Contributor Author

spaits commented Feb 20, 2025

@antoniofrighetto I have fixed the failing test case. It seems like we have a regression, compared to some earlier state of the PR, that has been lost to squashing. Now we get back the original state of llvm/test/Transforms/PhaseOrdering/X86/unsigned-multiply-overflow-check.ll. I will check why it no longer fixes that X86 test case.

@antoniofrighetto
Copy link
Contributor

@antoniofrighetto I have fixed the failing test case. It seems like we have a regression, compared to some earlier state of the PR, that has been lost to squashing. Now we get back the original state of llvm/test/Transforms/PhaseOrdering/X86/unsigned-multiply-overflow-check.ll. I will check why it no longer fixes that X86 test case.

Yeah, this started regressing since 838e617 (and unfortunately previous 6dade290797ae1852c8c1f816a3c856b1493dfdd seems not to be available anymore).

@spaits
Copy link
Contributor Author

spaits commented Feb 20, 2025

Okay I found out, that when running this command:

bin/opt -passes=instcombine,simplifycfg -simplifycfg-require-and-preserve-domtree=1 -S testfile.ll

The input BasicBlock (ThenBB) for spewill look like this:

bb2:                                              ; preds = %bb
  %mul = call { i64, i1 } @llvm.umul.with.overflow.i64(i64 %arg, i64 %arg1)
  %mul.ov = extractvalue { i64, i1 } %mul, 1
  %0 = xor i1 %mul.ov, true
  br label %bb5

That xor is added into the basic block by InstCombinePass.

@spaits
Copy link
Contributor Author

spaits commented Feb 20, 2025

So, the only reason it has worked before was the omission of the checking how many instructions will be hoisted.

@spaits
Copy link
Contributor Author

spaits commented Feb 20, 2025

I think I will try to move this optimization to to foldTwoEntryPHINode. That function is calling the dominatesMergePoint function. In that I will match to the pattern and set a low cost value for this pattern.

@antoniofrighetto
Copy link
Contributor

Perhaps we could include !extract oneuse(op.with.overflow),1 in the pattern, have something like (and adjust the limit):

WithOverflowInst *UnusedWO;
if (match(&I, m_CombineOr(m_ExtractValue<1>(m_OneUse(m_WithOverflowInst(UnusedWO))),
                          m_Not(m_ExtractValue<1>(m_OneUse(m_WithOverflowInst(UnusedWO)))))))
  Found = true;

But TBH, I believe the current code should be fine (regardless of the test).

@spaits
Copy link
Contributor Author

spaits commented Feb 20, 2025

I have created another solution for this problem. It is in #128021. I think that is a better one. Please check that too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants