-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[RISCV] Implement RISCVTTIImpl::getPreferredAddressingMode for HasVendorXCVmem #120533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-risc-v Author: Philipp van Kempen (PhilippvK) ChangesFor a simple matmult kernel this heuristic reduces the length of the critical basic block from 15 to 20 instructions, resulting in a 20% speedup. Without heuristic:
With heuristic:
Full diff: https://github.com/llvm/llvm-project/pull/120533.diff 2 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 49192bd6380223..662649b5b276b1 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -2329,6 +2329,15 @@ unsigned RISCVTTIImpl::getMaximumVF(unsigned ElemWidth, unsigned Opcode) const {
return std::max<unsigned>(1U, RegWidth.getFixedValue() / ElemWidth);
}
+TTI::AddressingModeKind
+RISCVTTIImpl::getPreferredAddressingMode(const Loop *L,
+ ScalarEvolution *SE) const {
+ if (ST->hasVendorXCVmem())
+ return TTI::AMK_PostIndexed;
+
+ return TTI::AMK_None;
+}
+
bool RISCVTTIImpl::isLSRCostLess(const TargetTransformInfo::LSRCost &C1,
const TargetTransformInfo::LSRCost &C2) {
// RISC-V specific here are "instruction number 1st priority".
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index bd90bfed6e2c95..9b364391f0fa47 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -388,6 +388,9 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
llvm_unreachable("unknown register class");
}
+ TTI::AddressingModeKind getPreferredAddressingMode(const Loop *L,
+ ScalarEvolution *SE) const;
+
unsigned getRegisterClassForType(bool Vector, Type *Ty = nullptr) const {
if (Vector)
return RISCVRegisterClass::VRRC;
|
@ChunyuLiao I think this patch was missed when upstreaming codegen support for XCVmem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test?
@wangpc-pp I looked for existing tests which perform a similar check but was not able to find anything useful. Hence I've tried to come up with a "minimal" example which shows that the heuristic is being applied. I am not very happy with it and am happy about any feedback on how to make the test more compact / less flaky: ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -O3 -mtriple=riscv32 -mattr=+m,+xcvmem -verify-machineinstrs < %s \
; RUN: | FileCheck %s --check-prefixes=CHECK
define i32 @test_heuristic(ptr %a, ptr %b, i32 %c, i32 %d, i32 %e) nounwind {
; CHECK-LABEL: test_heuristic:
; CHECK: # %bb.0: # %entry
; CHECK-NEXT: beqz a0, .LBB0_4
; CHECK-NEXT: # %bb.1: # %prepare
; CHECK-NEXT: add a5, a1, a4
; CHECK-NEXT: add a3, a4, a3
; CHECK-NEXT: add a1, a1, a3
; CHECK-NEXT: .LBB0_2: # %loop
; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
; CHECK-NEXT: cv.lbu a3, (a5), 1
; CHECK-NEXT: cv.lbu a4, (a0), 1
; CHECK-NEXT: bne a5, a1, .LBB0_2
; CHECK-NEXT: # %bb.3:
; CHECK-NEXT: add a2, a4, a2
; CHECK-NEXT: mul a0, a2, a3
; CHECK-NEXT: .LBB0_4: # %exit
; CHECK-NEXT: ret
entry:
%4 = icmp eq ptr %a, null
br i1 %4, label %exit, label %prepare
prepare:
%20 = shl nsw i32 %e, 2
%21 = mul nuw nsw i32 %e, 3
%22 = shl nuw nsw i32 %e, 1
%23 = getelementptr inbounds i8, ptr %b, i32 %e
%24 = getelementptr inbounds i8, ptr %b, i32 %22
%25 = getelementptr inbounds i8, ptr %b, i32 %21
%26 = getelementptr inbounds i8, ptr %b, i32 %20
br label %loop
loop:
%28 = phi ptr [ %a, %prepare ], [ %68, %loop ]
%29 = phi ptr [ %b, %prepare ], [ %63, %loop ]
%30 = phi ptr [ %23, %prepare ], [ %64, %loop ]
%32 = phi i32 [ 0, %prepare ], [ %69, %loop ]
%37 = phi i32 [ 0, %prepare ], [ %54, %loop ]
%40 = load i8, ptr %29, align 1
%41 = zext i8 %40 to i32
%42 = load i8, ptr %30, align 1
%43 = zext i8 %42 to i32
%50 = load i8, ptr %28, align 1
%51 = zext i8 %50 to i32
%52 = add i32 %51, %c
%53 = mul i32 %52, %41
%54 = add i32 %53, %37
%55 = mul i32 %52, %43
%63 = getelementptr inbounds i8, ptr %29, i32 1
%64 = getelementptr inbounds i8, ptr %30, i32 1
%68 = getelementptr inbounds i8, ptr %28, i32 1
%69 = add nuw nsw i32 %32, 1
%70 = icmp eq i32 %69, %d
br i1 %70, label %exit, label %loop
exit:
%result = phi i32 [ 0, %entry ], [ %55, %loop ]
ret i32 %result
} |
@PhilippvK I want to share a method that I learnt from the community: you can add an As for your case, I think you can add an assertion here? llvm-project/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp Lines 1569 to 1628 in d6e8ab1
|
@wangpc-pp Thanks to your hint, I could reduce the complexity of the test case: https://github.com/llvm/llvm-project/pull/120533/files#diff-ef5be54f9e88c12d2f5283d2cf71802e3ccdfbe601f89c0e8c2d9eda7c01a986 |
…dorXCVmem For a simple matmult kernel this heuristic reduces the length of the critical basic block from 15 to 20 instructions, resulting in a 20% speedup. [RISCV] Address PR comment [RISCV] Add !ST->is64Bit() check
b6e7722
to
342ea51
Compare
I realized that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
The reduced test may not be accurate and similar to real-world code. |
@wangpc-pp Can this be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangpc-pp Can this be merged?
Happy to go forward.
I do not have sufficient rights to perform the merge by myself. Could you please do a squash & merge @wangpc-pp? |
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/4/builds/4442 Here is the relevant piece of the build log for the reference
|
Local branch amd-gfx 23e4c3a Merged main:9abcca5e2529 into amd-gfx:387d39f6081c Remote branch main f590963 [RISCV] Implement RISCVTTIImpl::getPreferredAddressingMode for HasVendorXCVmem (llvm#120533)
For a simple matmult kernel this heuristic reduces the length of the critical basic block from 15 to 20 instructions, resulting in a 20% speedup.
Without heuristic:
With heuristic: