[RISCV] Implement RISCVTTIImpl::getPreferredAddressingMode for HasVendorXCVmem #120533

PhilippvK · 2024-12-19T08:22:25Z

For a simple matmult kernel this heuristic reduces the length of the critical basic block from 15 to 20 instructions, resulting in a 20% speedup.

Without heuristic:

       13688: 001b838b      cv.lb   t2, (s7), 0x1
       1368c: 09cdbcab      cv.lb   s9, t3(s11)
       13690: 089db62b      cv.lb   a2, s1(s11)
       13694: 092dbdab      cv.lb   s11, s2(s11)
       13698: 001d028b      cv.lb   t0, (s10), 0x1
       1369c: 00f282b3      add     t0, t0, a5
       136a0: 9072b52b      cv.mac  a0, t0, t2
       136a4: 9192bfab      cv.mac  t6, t0, s9
       136a8: 90c2beab      cv.mac  t4, t0, a2
       136ac: 91b2bf2b      cv.mac  t5, t0, s11
       136b0: fffc0c13      addi    s8, s8, -0x1
       136b4: 018e0633      add     a2, t3, s8
       136b8: 91b2b0ab      cv.mac  ra, t0, s11
       136bc: 000b8d93      mv      s11, s7
       136c0: fc0614e3      bnez    a2, 0x13688 <muriscv_nn_vec_mat_mult_t_s8+0x2f0>

       #instrs = 15

With heuristic:

        7bc0: 001c860b      cv.lb   a2, (s9), 0x1
        7bc4: 001e0d0b      cv.lb   s10, (t3), 0x1
        7bc8: 001e808b      cv.lb   ra, (t4), 0x1
        7bcc: 0015038b      cv.lb   t2, (a0), 0x1
        7bd0: 001c028b      cv.lb   t0, (s8), 0x1
        7bd4: 00f282b3      add     t0, t0, a5
        7bd8: 90c2bfab      cv.mac  t6, t0, a2
        7bdc: 91a2b92b      cv.mac  s2, t0, s10
        7be0: 9012b5ab      cv.mac  a1, t0, ra
        7be4: 9072b9ab      cv.mac  s3, t0, t2
        7be8: 9072b72b      cv.mac  a4, t0, t2
        7bec: fc851ae3      bne     a0, s0, 0x7bc0 <muriscv_nn_vec_mat_mult_t_s8+0x338>

        #instrs = 12

        improvement = 1 - 12/15 = 0.2 = 20%

llvmbot · 2024-12-19T08:22:59Z

@llvm/pr-subscribers-backend-risc-v

Author: Philipp van Kempen (PhilippvK)

Changes

For a simple matmult kernel this heuristic reduces the length of the critical basic block from 15 to 20 instructions, resulting in a 20% speedup.

Without heuristic:

       13688: 001b838b      cv.lb   t2, (s7), 0x1
       1368c: 09cdbcab      cv.lb   s9, t3(s11)
       13690: 089db62b      cv.lb   a2, s1(s11)
       13694: 092dbdab      cv.lb   s11, s2(s11)
       13698: 001d028b      cv.lb   t0, (s10), 0x1
       1369c: 00f282b3      add     t0, t0, a5
       136a0: 9072b52b      cv.mac  a0, t0, t2
       136a4: 9192bfab      cv.mac  t6, t0, s9
       136a8: 90c2beab      cv.mac  t4, t0, a2
       136ac: 91b2bf2b      cv.mac  t5, t0, s11
       136b0: fffc0c13      addi    s8, s8, -0x1
       136b4: 018e0633      add     a2, t3, s8
       136b8: 91b2b0ab      cv.mac  ra, t0, s11
       136bc: 000b8d93      mv      s11, s7
       136c0: fc0614e3      bnez    a2, 0x13688 &lt;muriscv_nn_vec_mat_mult_t_s8+0x2f0&gt;

       #instrs = 15

With heuristic:

        7bc0: 001c860b      cv.lb   a2, (s9), 0x1
        7bc4: 001e0d0b      cv.lb   s10, (t3), 0x1
        7bc8: 001e808b      cv.lb   ra, (t4), 0x1
        7bcc: 0015038b      cv.lb   t2, (a0), 0x1
        7bd0: 001c028b      cv.lb   t0, (s8), 0x1
        7bd4: 00f282b3      add     t0, t0, a5
        7bd8: 90c2bfab      cv.mac  t6, t0, a2
        7bdc: 91a2b92b      cv.mac  s2, t0, s10
        7be0: 9012b5ab      cv.mac  a1, t0, ra
        7be4: 9072b9ab      cv.mac  s3, t0, t2
        7be8: 9072b72b      cv.mac  a4, t0, t2
        7bec: fc851ae3      bne     a0, s0, 0x7bc0 &lt;muriscv_nn_vec_mat_mult_t_s8+0x338&gt;

        #instrs = 12

        improvement = 1 - 12/15 = 0.2 = 20%

Full diff: https://github.com/llvm/llvm-project/pull/120533.diff

2 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+9)
(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+3)

diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 49192bd6380223..662649b5b276b1 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -2329,6 +2329,15 @@ unsigned RISCVTTIImpl::getMaximumVF(unsigned ElemWidth, unsigned Opcode) const {
   return std::max<unsigned>(1U, RegWidth.getFixedValue() / ElemWidth);
 }
 
+TTI::AddressingModeKind
+RISCVTTIImpl::getPreferredAddressingMode(const Loop *L,
+                                         ScalarEvolution *SE) const {
+  if (ST->hasVendorXCVmem())
+    return TTI::AMK_PostIndexed;
+
+  return TTI::AMK_None;
+}
+
 bool RISCVTTIImpl::isLSRCostLess(const TargetTransformInfo::LSRCost &C1,
                                  const TargetTransformInfo::LSRCost &C2) {
   // RISC-V specific here are "instruction number 1st priority".
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index bd90bfed6e2c95..9b364391f0fa47 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -388,6 +388,9 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
     llvm_unreachable("unknown register class");
   }
 
+  TTI::AddressingModeKind getPreferredAddressingMode(const Loop *L,
+                                                     ScalarEvolution *SE) const;
+
   unsigned getRegisterClassForType(bool Vector, Type *Ty = nullptr) const {
     if (Vector)
       return RISCVRegisterClass::VRRC;

PhilippvK · 2024-12-19T08:24:04Z

@ChunyuLiao I think this patch was missed when upstreaming codegen support for XCVmem.

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

wangpc-pp

Can you add a test?

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

PhilippvK · 2024-12-19T14:30:59Z

Can you add a test?

@wangpc-pp I looked for existing tests which perform a similar check but was not able to find anything useful. Hence I've tried to come up with a "minimal" example which shows that the heuristic is being applied.

I am not very happy with it and am happy about any feedback on how to make the test more compact / less flaky:

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -O3 -mtriple=riscv32 -mattr=+m,+xcvmem -verify-machineinstrs < %s \
; RUN:   | FileCheck %s --check-prefixes=CHECK

define i32 @test_heuristic(ptr %a, ptr %b, i32 %c, i32 %d, i32 %e) nounwind {
; CHECK-LABEL: test_heuristic:
; CHECK:       # %bb.0: # %entry
; CHECK-NEXT:    beqz a0, .LBB0_4
; CHECK-NEXT:  # %bb.1: # %prepare
; CHECK-NEXT:    add a5, a1, a4
; CHECK-NEXT:    add a3, a4, a3
; CHECK-NEXT:    add a1, a1, a3
; CHECK-NEXT:  .LBB0_2: # %loop
; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
; CHECK-NEXT:    cv.lbu a3, (a5), 1
; CHECK-NEXT:    cv.lbu a4, (a0), 1
; CHECK-NEXT:    bne a5, a1, .LBB0_2
; CHECK-NEXT:  # %bb.3:
; CHECK-NEXT:    add a2, a4, a2
; CHECK-NEXT:    mul a0, a2, a3
; CHECK-NEXT:  .LBB0_4: # %exit
; CHECK-NEXT:    ret
entry:
  %4 = icmp eq ptr %a, null
  br i1 %4, label %exit, label %prepare

prepare:
  %20 = shl nsw i32 %e, 2
  %21 = mul nuw nsw i32 %e, 3
  %22 = shl nuw nsw i32 %e, 1
  %23 = getelementptr inbounds i8, ptr %b, i32 %e
  %24 = getelementptr inbounds i8, ptr %b, i32 %22
  %25 = getelementptr inbounds i8, ptr %b, i32 %21
  %26 = getelementptr inbounds i8, ptr %b, i32 %20
  br label %loop

loop:
  %28 = phi ptr [ %a, %prepare ], [ %68, %loop ]
  %29 = phi ptr [ %b, %prepare ], [ %63, %loop ]
  %30 = phi ptr [ %23, %prepare ], [ %64, %loop ]
  %32 = phi i32 [ 0, %prepare ], [ %69, %loop ]
  %37 = phi i32 [ 0, %prepare ], [ %54, %loop ]
  %40 = load i8, ptr %29, align 1
  %41 = zext i8 %40 to i32
  %42 = load i8, ptr %30, align 1
  %43 = zext i8 %42 to i32
  %50 = load i8, ptr %28, align 1
  %51 = zext i8 %50 to i32
  %52 = add i32 %51, %c
  %53 = mul i32 %52, %41
  %54 = add i32 %53, %37
  %55 = mul i32 %52, %43
  %63 = getelementptr inbounds i8, ptr %29, i32 1
  %64 = getelementptr inbounds i8, ptr %30, i32 1
  %68 = getelementptr inbounds i8, ptr %28, i32 1
  %69 = add nuw nsw i32 %32, 1
  %70 = icmp eq i32 %69, %d
  br i1 %70, label %exit, label %loop

exit:
  %result = phi i32 [ 0, %entry ], [ %55, %loop ]
  ret i32 %result
}

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

wangpc-pp · 2024-12-20T03:33:29Z

I am not very happy with it and am happy about any feedback on how to make the test more compact / less flaky.

@PhilippvK I want to share a method that I learnt from the community: you can add an assert("catch you!") to the place you care about, for example, the code that applys some optimizations, the code that selects your instructions, etc. And then you can use llvm-reduce to reduce your test via grepping the assertion.

As for your case, I think you can add an assertion here?

llvm-project/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp

Lines 1569 to 1628 in d6e8ab1

    
           if (Subtarget->hasVendorXCVmem() && !Subtarget->is64Bit()) { 
        
             // We match post-incrementing load here 
        
             LoadSDNode *Load = cast<LoadSDNode>(Node); 
        
             if (Load->getAddressingMode() != ISD::POST_INC) 
        
               break; 
        
             SDValue Chain = Node->getOperand(0); 
        
             SDValue Base = Node->getOperand(1); 
        
             SDValue Offset = Node->getOperand(2); 
        
             bool Simm12 = false; 
        
             bool SignExtend = Load->getExtensionType() == ISD::SEXTLOAD; 
        
             if (auto ConstantOffset = dyn_cast<ConstantSDNode>(Offset)) { 
        
               int ConstantVal = ConstantOffset->getSExtValue(); 
        
               Simm12 = isInt<12>(ConstantVal); 
        
               if (Simm12) 
        
                 Offset = CurDAG->getTargetConstant(ConstantVal, SDLoc(Offset), 
        
                                                    Offset.getValueType()); 
        
             } 
        
             unsigned Opcode = 0; 
        
             switch (Load->getMemoryVT().getSimpleVT().SimpleTy) { 
        
             case MVT::i8: 
        
               if (Simm12 && SignExtend) 
        
                 Opcode = RISCV::CV_LB_ri_inc; 
        
               else if (Simm12 && !SignExtend) 
        
                 Opcode = RISCV::CV_LBU_ri_inc; 
        
               else if (!Simm12 && SignExtend) 
        
                 Opcode = RISCV::CV_LB_rr_inc; 
        
               else 
        
                 Opcode = RISCV::CV_LBU_rr_inc; 
        
               break; 
        
             case MVT::i16: 
        
               if (Simm12 && SignExtend) 
        
                 Opcode = RISCV::CV_LH_ri_inc; 
        
               else if (Simm12 && !SignExtend) 
        
                 Opcode = RISCV::CV_LHU_ri_inc; 
        
               else if (!Simm12 && SignExtend) 
        
                 Opcode = RISCV::CV_LH_rr_inc; 
        
               else 
        
                 Opcode = RISCV::CV_LHU_rr_inc; 
        
               break; 
        
             case MVT::i32: 
        
               if (Simm12) 
        
                 Opcode = RISCV::CV_LW_ri_inc; 
        
               else 
        
                 Opcode = RISCV::CV_LW_rr_inc; 
        
               break; 
        
             default: 
        
               break; 
        
             } 
        
             if (!Opcode) 
        
               break; 
        
             ReplaceNode(Node, CurDAG->getMachineNode(Opcode, DL, XLenVT, XLenVT, 
        
                                                      Chain.getSimpleValueType(), Base, 
        
                                                      Offset, Chain)); 
        
             return; 
        
           }

PhilippvK · 2024-12-20T10:08:47Z

@wangpc-pp Thanks to your hint, I could reduce the complexity of the test case: https://github.com/llvm/llvm-project/pull/120533/files#diff-ef5be54f9e88c12d2f5283d2cf71802e3ccdfbe601f89c0e8c2d9eda7c01a986

llvm/test/CodeGen/RISCV/xcvmem-heuristic.ll

…dorXCVmem For a simple matmult kernel this heuristic reduces the length of the critical basic block from 15 to 20 instructions, resulting in a 20% speedup. [RISCV] Address PR comment [RISCV] Add !ST->is64Bit() check

PhilippvK · 2024-12-20T10:42:05Z

I realized that getPreferredAddressingMode -> POST_INC is not always beneficial (see provided test case where one extra add instruction is added.) However I could not identify the scenarios where the heruristic is actually useful.

wangpc-pp

LGTM.

wangpc-pp · 2024-12-23T04:08:25Z

I realized that getPreferredAddressingMode -> POST_INC is not always beneficial (see provided test case where one extra add instruction is added.) However I could not identify the scenarios where the heruristic is actually useful.

The reduced test may not be accurate and similar to real-world code.

PhilippvK · 2024-12-30T09:53:39Z

@wangpc-pp Can this be merged?

wangpc-pp

@wangpc-pp Can this be merged?

Happy to go forward.

PhilippvK · 2024-12-30T14:43:32Z

@wangpc-pp Can this be merged?

Happy to go forward.

I do not have sufficient rights to perform the merge by myself. Could you please do a squash & merge @wangpc-pp?

llvm-ci · 2024-12-31T08:41:15Z

LLVM Buildbot has detected a new failure on builder clang-aarch64-sve-vls-2stage running on linaro-g3-01 while building llvm at step 11 "build stage 2".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/4/builds/4442

Here is the relevant piece of the build log for the reference

Step 11 (build stage 2) failure: 'ninja' (failure)

Local branch amd-gfx 23e4c3a Merged main:9abcca5e2529 into amd-gfx:387d39f6081c Remote branch main f590963 [RISCV] Implement RISCVTTIImpl::getPreferredAddressingMode for HasVendorXCVmem (llvm#120533)

llvmbot added the backend:RISC-V label Dec 19, 2024

PhilippvK commented Dec 19, 2024

View reviewed changes

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp Outdated Show resolved Hide resolved

wangpc-pp reviewed Dec 19, 2024

View reviewed changes

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp Outdated Show resolved Hide resolved

topperc reviewed Dec 19, 2024

View reviewed changes

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp Outdated Show resolved Hide resolved

wangpc-pp reviewed Dec 20, 2024

View reviewed changes

llvm/test/CodeGen/RISCV/xcvmem-heuristic.ll Show resolved Hide resolved

PhilippvK added 2 commits December 20, 2024 11:34

[RISCV] add test case for XCVmem addressing mode heuristic

75a5f8c

PhilippvK force-pushed the feature-xcvmem-heuristic branch from b6e7722 to 342ea51 Compare December 20, 2024 10:36

wangpc-pp approved these changes Dec 23, 2024

View reviewed changes

wangpc-pp approved these changes Dec 30, 2024

View reviewed changes

Merge branch 'main' into feature-xcvmem-heuristic

dda54eb

wangpc-pp merged commit f590963 into llvm:main Dec 31, 2024
8 checks passed

[RISCV] Implement RISCVTTIImpl::getPreferredAddressingMode for HasVendorXCVmem #120533

[RISCV] Implement RISCVTTIImpl::getPreferredAddressingMode for HasVendorXCVmem #120533

Uh oh!

Conversation

PhilippvK commented Dec 19, 2024

Uh oh!

llvmbot commented Dec 19, 2024

Uh oh!

PhilippvK commented Dec 19, 2024

Uh oh!

Uh oh!

wangpc-pp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PhilippvK commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wangpc-pp commented Dec 20, 2024

Uh oh!

PhilippvK commented Dec 20, 2024

Uh oh!

Uh oh!

PhilippvK commented Dec 20, 2024

Uh oh!

wangpc-pp left a comment

Choose a reason for hiding this comment

Uh oh!

wangpc-pp commented Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PhilippvK commented Dec 30, 2024

Uh oh!

wangpc-pp left a comment

Choose a reason for hiding this comment

Uh oh!

PhilippvK commented Dec 30, 2024

Uh oh!

Uh oh!

llvm-ci commented Dec 31, 2024

Uh oh!

Uh oh!

PhilippvK commented Dec 19, 2024 •

edited

Loading

wangpc-pp commented Dec 23, 2024 •

edited

Loading