Skip to content

[RISCV] Implement RISCVTTIImpl::getPreferredAddressingMode for HasVendorXCVmem #120533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 31, 2024

Conversation

PhilippvK
Copy link
Contributor

For a simple matmult kernel this heuristic reduces the length of the critical basic block from 15 to 20 instructions, resulting in a 20% speedup.

Without heuristic:

       13688: 001b838b      cv.lb   t2, (s7), 0x1
       1368c: 09cdbcab      cv.lb   s9, t3(s11)
       13690: 089db62b      cv.lb   a2, s1(s11)
       13694: 092dbdab      cv.lb   s11, s2(s11)
       13698: 001d028b      cv.lb   t0, (s10), 0x1
       1369c: 00f282b3      add     t0, t0, a5
       136a0: 9072b52b      cv.mac  a0, t0, t2
       136a4: 9192bfab      cv.mac  t6, t0, s9
       136a8: 90c2beab      cv.mac  t4, t0, a2
       136ac: 91b2bf2b      cv.mac  t5, t0, s11
       136b0: fffc0c13      addi    s8, s8, -0x1
       136b4: 018e0633      add     a2, t3, s8
       136b8: 91b2b0ab      cv.mac  ra, t0, s11
       136bc: 000b8d93      mv      s11, s7
       136c0: fc0614e3      bnez    a2, 0x13688 <muriscv_nn_vec_mat_mult_t_s8+0x2f0>

       #instrs = 15

With heuristic:

        7bc0: 001c860b      cv.lb   a2, (s9), 0x1
        7bc4: 001e0d0b      cv.lb   s10, (t3), 0x1
        7bc8: 001e808b      cv.lb   ra, (t4), 0x1
        7bcc: 0015038b      cv.lb   t2, (a0), 0x1
        7bd0: 001c028b      cv.lb   t0, (s8), 0x1
        7bd4: 00f282b3      add     t0, t0, a5
        7bd8: 90c2bfab      cv.mac  t6, t0, a2
        7bdc: 91a2b92b      cv.mac  s2, t0, s10
        7be0: 9012b5ab      cv.mac  a1, t0, ra
        7be4: 9072b9ab      cv.mac  s3, t0, t2
        7be8: 9072b72b      cv.mac  a4, t0, t2
        7bec: fc851ae3      bne     a0, s0, 0x7bc0 <muriscv_nn_vec_mat_mult_t_s8+0x338>

        #instrs = 12

        improvement = 1 - 12/15 = 0.2 = 20%

@llvmbot
Copy link
Member

llvmbot commented Dec 19, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Philipp van Kempen (PhilippvK)

Changes

For a simple matmult kernel this heuristic reduces the length of the critical basic block from 15 to 20 instructions, resulting in a 20% speedup.

Without heuristic:

       13688: 001b838b      cv.lb   t2, (s7), 0x1
       1368c: 09cdbcab      cv.lb   s9, t3(s11)
       13690: 089db62b      cv.lb   a2, s1(s11)
       13694: 092dbdab      cv.lb   s11, s2(s11)
       13698: 001d028b      cv.lb   t0, (s10), 0x1
       1369c: 00f282b3      add     t0, t0, a5
       136a0: 9072b52b      cv.mac  a0, t0, t2
       136a4: 9192bfab      cv.mac  t6, t0, s9
       136a8: 90c2beab      cv.mac  t4, t0, a2
       136ac: 91b2bf2b      cv.mac  t5, t0, s11
       136b0: fffc0c13      addi    s8, s8, -0x1
       136b4: 018e0633      add     a2, t3, s8
       136b8: 91b2b0ab      cv.mac  ra, t0, s11
       136bc: 000b8d93      mv      s11, s7
       136c0: fc0614e3      bnez    a2, 0x13688 &lt;muriscv_nn_vec_mat_mult_t_s8+0x2f0&gt;

       #instrs = 15

With heuristic:

        7bc0: 001c860b      cv.lb   a2, (s9), 0x1
        7bc4: 001e0d0b      cv.lb   s10, (t3), 0x1
        7bc8: 001e808b      cv.lb   ra, (t4), 0x1
        7bcc: 0015038b      cv.lb   t2, (a0), 0x1
        7bd0: 001c028b      cv.lb   t0, (s8), 0x1
        7bd4: 00f282b3      add     t0, t0, a5
        7bd8: 90c2bfab      cv.mac  t6, t0, a2
        7bdc: 91a2b92b      cv.mac  s2, t0, s10
        7be0: 9012b5ab      cv.mac  a1, t0, ra
        7be4: 9072b9ab      cv.mac  s3, t0, t2
        7be8: 9072b72b      cv.mac  a4, t0, t2
        7bec: fc851ae3      bne     a0, s0, 0x7bc0 &lt;muriscv_nn_vec_mat_mult_t_s8+0x338&gt;

        #instrs = 12

        improvement = 1 - 12/15 = 0.2 = 20%

Full diff: https://github.com/llvm/llvm-project/pull/120533.diff

2 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+9)
  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+3)
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 49192bd6380223..662649b5b276b1 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -2329,6 +2329,15 @@ unsigned RISCVTTIImpl::getMaximumVF(unsigned ElemWidth, unsigned Opcode) const {
   return std::max<unsigned>(1U, RegWidth.getFixedValue() / ElemWidth);
 }
 
+TTI::AddressingModeKind
+RISCVTTIImpl::getPreferredAddressingMode(const Loop *L,
+                                         ScalarEvolution *SE) const {
+  if (ST->hasVendorXCVmem())
+    return TTI::AMK_PostIndexed;
+
+  return TTI::AMK_None;
+}
+
 bool RISCVTTIImpl::isLSRCostLess(const TargetTransformInfo::LSRCost &C1,
                                  const TargetTransformInfo::LSRCost &C2) {
   // RISC-V specific here are "instruction number 1st priority".
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index bd90bfed6e2c95..9b364391f0fa47 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -388,6 +388,9 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
     llvm_unreachable("unknown register class");
   }
 
+  TTI::AddressingModeKind getPreferredAddressingMode(const Loop *L,
+                                                     ScalarEvolution *SE) const;
+
   unsigned getRegisterClassForType(bool Vector, Type *Ty = nullptr) const {
     if (Vector)
       return RISCVRegisterClass::VRRC;

@PhilippvK
Copy link
Contributor Author

@ChunyuLiao I think this patch was missed when upstreaming codegen support for XCVmem.

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test?

@PhilippvK
Copy link
Contributor Author

PhilippvK commented Dec 19, 2024

Can you add a test?

@wangpc-pp I looked for existing tests which perform a similar check but was not able to find anything useful. Hence I've tried to come up with a "minimal" example which shows that the heuristic is being applied.

I am not very happy with it and am happy about any feedback on how to make the test more compact / less flaky:

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -O3 -mtriple=riscv32 -mattr=+m,+xcvmem -verify-machineinstrs < %s \
; RUN:   | FileCheck %s --check-prefixes=CHECK

define i32 @test_heuristic(ptr %a, ptr %b, i32 %c, i32 %d, i32 %e) nounwind {
; CHECK-LABEL: test_heuristic:
; CHECK:       # %bb.0: # %entry
; CHECK-NEXT:    beqz a0, .LBB0_4
; CHECK-NEXT:  # %bb.1: # %prepare
; CHECK-NEXT:    add a5, a1, a4
; CHECK-NEXT:    add a3, a4, a3
; CHECK-NEXT:    add a1, a1, a3
; CHECK-NEXT:  .LBB0_2: # %loop
; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
; CHECK-NEXT:    cv.lbu a3, (a5), 1
; CHECK-NEXT:    cv.lbu a4, (a0), 1
; CHECK-NEXT:    bne a5, a1, .LBB0_2
; CHECK-NEXT:  # %bb.3:
; CHECK-NEXT:    add a2, a4, a2
; CHECK-NEXT:    mul a0, a2, a3
; CHECK-NEXT:  .LBB0_4: # %exit
; CHECK-NEXT:    ret
entry:
  %4 = icmp eq ptr %a, null
  br i1 %4, label %exit, label %prepare

prepare:
  %20 = shl nsw i32 %e, 2
  %21 = mul nuw nsw i32 %e, 3
  %22 = shl nuw nsw i32 %e, 1
  %23 = getelementptr inbounds i8, ptr %b, i32 %e
  %24 = getelementptr inbounds i8, ptr %b, i32 %22
  %25 = getelementptr inbounds i8, ptr %b, i32 %21
  %26 = getelementptr inbounds i8, ptr %b, i32 %20
  br label %loop

loop:
  %28 = phi ptr [ %a, %prepare ], [ %68, %loop ]
  %29 = phi ptr [ %b, %prepare ], [ %63, %loop ]
  %30 = phi ptr [ %23, %prepare ], [ %64, %loop ]
  %32 = phi i32 [ 0, %prepare ], [ %69, %loop ]
  %37 = phi i32 [ 0, %prepare ], [ %54, %loop ]
  %40 = load i8, ptr %29, align 1
  %41 = zext i8 %40 to i32
  %42 = load i8, ptr %30, align 1
  %43 = zext i8 %42 to i32
  %50 = load i8, ptr %28, align 1
  %51 = zext i8 %50 to i32
  %52 = add i32 %51, %c
  %53 = mul i32 %52, %41
  %54 = add i32 %53, %37
  %55 = mul i32 %52, %43
  %63 = getelementptr inbounds i8, ptr %29, i32 1
  %64 = getelementptr inbounds i8, ptr %30, i32 1
  %68 = getelementptr inbounds i8, ptr %28, i32 1
  %69 = add nuw nsw i32 %32, 1
  %70 = icmp eq i32 %69, %d
  br i1 %70, label %exit, label %loop

exit:
  %result = phi i32 [ 0, %entry ], [ %55, %loop ]
  ret i32 %result
}

@wangpc-pp
Copy link
Contributor

I am not very happy with it and am happy about any feedback on how to make the test more compact / less flaky.

@PhilippvK I want to share a method that I learnt from the community: you can add an assert("catch you!") to the place you care about, for example, the code that applys some optimizations, the code that selects your instructions, etc. And then you can use llvm-reduce to reduce your test via grepping the assertion.

As for your case, I think you can add an assertion here?

if (Subtarget->hasVendorXCVmem() && !Subtarget->is64Bit()) {
// We match post-incrementing load here
LoadSDNode *Load = cast<LoadSDNode>(Node);
if (Load->getAddressingMode() != ISD::POST_INC)
break;
SDValue Chain = Node->getOperand(0);
SDValue Base = Node->getOperand(1);
SDValue Offset = Node->getOperand(2);
bool Simm12 = false;
bool SignExtend = Load->getExtensionType() == ISD::SEXTLOAD;
if (auto ConstantOffset = dyn_cast<ConstantSDNode>(Offset)) {
int ConstantVal = ConstantOffset->getSExtValue();
Simm12 = isInt<12>(ConstantVal);
if (Simm12)
Offset = CurDAG->getTargetConstant(ConstantVal, SDLoc(Offset),
Offset.getValueType());
}
unsigned Opcode = 0;
switch (Load->getMemoryVT().getSimpleVT().SimpleTy) {
case MVT::i8:
if (Simm12 && SignExtend)
Opcode = RISCV::CV_LB_ri_inc;
else if (Simm12 && !SignExtend)
Opcode = RISCV::CV_LBU_ri_inc;
else if (!Simm12 && SignExtend)
Opcode = RISCV::CV_LB_rr_inc;
else
Opcode = RISCV::CV_LBU_rr_inc;
break;
case MVT::i16:
if (Simm12 && SignExtend)
Opcode = RISCV::CV_LH_ri_inc;
else if (Simm12 && !SignExtend)
Opcode = RISCV::CV_LHU_ri_inc;
else if (!Simm12 && SignExtend)
Opcode = RISCV::CV_LH_rr_inc;
else
Opcode = RISCV::CV_LHU_rr_inc;
break;
case MVT::i32:
if (Simm12)
Opcode = RISCV::CV_LW_ri_inc;
else
Opcode = RISCV::CV_LW_rr_inc;
break;
default:
break;
}
if (!Opcode)
break;
ReplaceNode(Node, CurDAG->getMachineNode(Opcode, DL, XLenVT, XLenVT,
Chain.getSimpleValueType(), Base,
Offset, Chain));
return;
}

@PhilippvK
Copy link
Contributor Author

@wangpc-pp Thanks to your hint, I could reduce the complexity of the test case: https://github.com/llvm/llvm-project/pull/120533/files#diff-ef5be54f9e88c12d2f5283d2cf71802e3ccdfbe601f89c0e8c2d9eda7c01a986

…dorXCVmem

For a simple matmult kernel this heuristic reduces the length of the critical basic
block from 15 to 20 instructions, resulting in a 20% speedup.

[RISCV] Address PR comment

[RISCV] Add !ST->is64Bit() check
@PhilippvK PhilippvK force-pushed the feature-xcvmem-heuristic branch from b6e7722 to 342ea51 Compare December 20, 2024 10:36
@PhilippvK
Copy link
Contributor Author

I realized that getPreferredAddressingMode -> POST_INC is not always beneficial (see provided test case where one extra add instruction is added.) However I could not identify the scenarios where the heruristic is actually useful.

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@wangpc-pp
Copy link
Contributor

wangpc-pp commented Dec 23, 2024

I realized that getPreferredAddressingMode -> POST_INC is not always beneficial (see provided test case where one extra add instruction is added.) However I could not identify the scenarios where the heruristic is actually useful.

The reduced test may not be accurate and similar to real-world code.

@PhilippvK
Copy link
Contributor Author

@wangpc-pp Can this be merged?

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangpc-pp Can this be merged?

Happy to go forward.

@PhilippvK
Copy link
Contributor Author

@wangpc-pp Can this be merged?

Happy to go forward.

I do not have sufficient rights to perform the merge by myself. Could you please do a squash & merge @wangpc-pp?

@wangpc-pp wangpc-pp merged commit f590963 into llvm:main Dec 31, 2024
8 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Dec 31, 2024

LLVM Buildbot has detected a new failure on builder clang-aarch64-sve-vls-2stage running on linaro-g3-01 while building llvm at step 11 "build stage 2".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/4/builds/4442

Here is the relevant piece of the build log for the reference
Step 11 (build stage 2) failure: 'ninja' (failure)

qiaojbao pushed a commit to GPUOpen-Drivers/llvm-project that referenced this pull request Feb 7, 2025
Local branch amd-gfx 23e4c3a Merged main:9abcca5e2529 into amd-gfx:387d39f6081c
Remote branch main f590963 [RISCV] Implement RISCVTTIImpl::getPreferredAddressingMode for HasVendorXCVmem (llvm#120533)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants