[RISCV] Prefer VLS over VLA if costs are equal #100564

wangpc-pp · 2024-07-25T12:34:28Z

This is inspired by #95819.

Some kernels like s000 have some improvements and we can reduce
code for calculating vector length, fully unroll tail epilogue.

Currently, we add a SubtargetFeature for this and the processors
can add it if needed.

This is inspired by llvm#95819. Some kernels like s000 have some improvements and we can reduce code for calculating vector length, fully unroll tail epilogue. Currently, we add a SubtargetFeature for this and the processors can add it if needed.

llvmbot · 2024-07-25T12:35:03Z

@llvm/pr-subscribers-llvm-transforms

Author: Pengcheng Wang (wangpc-pp)

Changes

This is inspired by #95819.

Some kernels like s000 have some improvements and we can reduce
code for calculating vector length, fully unroll tail epilogue.

Currently, we add a SubtargetFeature for this and the processors
can add it if needed.

Full diff: https://github.com/llvm/llvm-project/pull/100564.diff

3 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVFeatures.td (+6)
(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+4)
(added) llvm/test/Transforms/LoopVectorize/RISCV/prefer-fixed-if-equal-to-scalable.ll (+165)

diff --git a/llvm/lib/Target/RISCV/RISCVFeatures.td b/llvm/lib/Target/RISCV/RISCVFeatures.td
index 3c868dbbf8b3a..96ec2dcbb715b 100644
--- a/llvm/lib/Target/RISCV/RISCVFeatures.td
+++ b/llvm/lib/Target/RISCV/RISCVFeatures.td
@@ -1324,6 +1324,12 @@ def FeaturePredictableSelectIsExpensive
     : SubtargetFeature<"predictable-select-expensive", "PredictableSelectIsExpensive", "true",
                        "Prefer likely predicted branches over selects">;
 
+def FeatureUseFixedOverScalableIfEqualCost
+    : SubtargetFeature<"use-fixed-over-scalable-if-equal-cost",
+                       "UseFixedOverScalableIfEqualCost", "true",
+                       "Prefer fixed width loop vectorization over scalable"
+                       "if the cost-model assigns equal costs">;
+
 def TuneOptimizedZeroStrideLoad
    : SubtargetFeature<"optimized-zero-stride-load", "HasOptimizedZeroStrideLoad",
                       "true", "Optimized (perform fewer memory operations)"
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index 9c37a4f6ec2d0..fffae92e78b2f 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -342,6 +342,10 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
 
   bool enableInterleavedAccessVectorization() { return true; }
 
+  bool preferFixedOverScalableIfEqualCost() const {
+    return ST->useFixedOverScalableIfEqualCost();
+  }
+
   enum RISCVRegisterClass { GPRRC, FPRRC, VRRC };
   unsigned getNumberOfRegisters(unsigned ClassID) const {
     switch (ClassID) {
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/prefer-fixed-if-equal-to-scalable.ll b/llvm/test/Transforms/LoopVectorize/RISCV/prefer-fixed-if-equal-to-scalable.ll
new file mode 100644
index 0000000000000..eebd34958905c
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/prefer-fixed-if-equal-to-scalable.ll
@@ -0,0 +1,165 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -mtriple riscv64 -S -passes=loop-vectorize -force-target-instruction-cost=1 < %s \
+; RUN:   -mattr=+v | FileCheck %s -check-prefix=SCALABLE
+; RUN: opt -mtriple riscv64 -S -passes=loop-vectorize -force-target-instruction-cost=1 < %s  \
+; RUN:   -mattr=+v,+use-fixed-over-scalable-if-equal-cost \
+; RUN:   | FileCheck %s -check-prefix=FIXED
+
+define void @s000(ptr %a, ptr %b, i32 %n) {
+; SCALABLE-LABEL: define void @s000(
+; SCALABLE-SAME: ptr [[A:%.*]], ptr [[B:%.*]], i32 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; SCALABLE-NEXT:  [[ENTRY:.*:]]
+; SCALABLE-NEXT:    [[B2:%.*]] = ptrtoint ptr [[B]] to i64
+; SCALABLE-NEXT:    [[A1:%.*]] = ptrtoint ptr [[A]] to i64
+; SCALABLE-NEXT:    [[CMP6:%.*]] = icmp sgt i32 [[N]], 0
+; SCALABLE-NEXT:    br i1 [[CMP6]], label %[[FOR_BODY_PREHEADER:.*]], label %[[FOR_COND_CLEANUP:.*]]
+; SCALABLE:       [[FOR_BODY_PREHEADER]]:
+; SCALABLE-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+; SCALABLE-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; SCALABLE-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 4
+; SCALABLE-NEXT:    [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 8, i64 [[TMP1]])
+; SCALABLE-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], [[TMP2]]
+; SCALABLE-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_MEMCHECK:.*]]
+; SCALABLE:       [[VECTOR_MEMCHECK]]:
+; SCALABLE-NEXT:    [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; SCALABLE-NEXT:    [[TMP4:%.*]] = mul i64 [[TMP3]], 4
+; SCALABLE-NEXT:    [[TMP5:%.*]] = mul i64 [[TMP4]], 4
+; SCALABLE-NEXT:    [[TMP6:%.*]] = sub i64 [[A1]], [[B2]]
+; SCALABLE-NEXT:    [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP6]], [[TMP5]]
+; SCALABLE-NEXT:    br i1 [[DIFF_CHECK]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; SCALABLE:       [[VECTOR_PH]]:
+; SCALABLE-NEXT:    [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
+; SCALABLE-NEXT:    [[TMP8:%.*]] = mul i64 [[TMP7]], 4
+; SCALABLE-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], [[TMP8]]
+; SCALABLE-NEXT:    [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
+; SCALABLE-NEXT:    [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
+; SCALABLE-NEXT:    [[TMP10:%.*]] = mul i64 [[TMP9]], 4
+; SCALABLE-NEXT:    br label %[[VECTOR_BODY:.*]]
+; SCALABLE:       [[VECTOR_BODY]]:
+; SCALABLE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; SCALABLE-NEXT:    [[TMP11:%.*]] = add i64 [[INDEX]], 0
+; SCALABLE-NEXT:    [[TMP12:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP11]]
+; SCALABLE-NEXT:    [[TMP13:%.*]] = getelementptr inbounds float, ptr [[TMP12]], i32 0
+; SCALABLE-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP13]], align 4
+; SCALABLE-NEXT:    [[TMP14:%.*]] = fadd <vscale x 4 x float> [[WIDE_LOAD]], shufflevector (<vscale x 4 x float> insertelement (<vscale x 4 x float> poison, float 1.000000e+00, i64 0), <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer)
+; SCALABLE-NEXT:    [[TMP15:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP11]]
+; SCALABLE-NEXT:    [[TMP16:%.*]] = getelementptr inbounds float, ptr [[TMP15]], i32 0
+; SCALABLE-NEXT:    store <vscale x 4 x float> [[TMP14]], ptr [[TMP16]], align 4
+; SCALABLE-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP10]]
+; SCALABLE-NEXT:    [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; SCALABLE-NEXT:    br i1 [[TMP17]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; SCALABLE:       [[MIDDLE_BLOCK]]:
+; SCALABLE-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
+; SCALABLE-NEXT:    br i1 [[CMP_N]], label %[[FOR_COND_CLEANUP_LOOPEXIT:.*]], label %[[SCALAR_PH]]
+; SCALABLE:       [[SCALAR_PH]]:
+; SCALABLE-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[FOR_BODY_PREHEADER]] ], [ 0, %[[VECTOR_MEMCHECK]] ]
+; SCALABLE-NEXT:    br label %[[FOR_BODY:.*]]
+; SCALABLE:       [[FOR_COND_CLEANUP_LOOPEXIT]]:
+; SCALABLE-NEXT:    br label %[[FOR_COND_CLEANUP]]
+; SCALABLE:       [[FOR_COND_CLEANUP]]:
+; SCALABLE-NEXT:    ret void
+; SCALABLE:       [[FOR_BODY]]:
+; SCALABLE-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; SCALABLE-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[INDVARS_IV]]
+; SCALABLE-NEXT:    [[TMP18:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; SCALABLE-NEXT:    [[ADD:%.*]] = fadd float [[TMP18]], 1.000000e+00
+; SCALABLE-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[INDVARS_IV]]
+; SCALABLE-NEXT:    store float [[ADD]], ptr [[ARRAYIDX2]], align 4
+; SCALABLE-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; SCALABLE-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
+; SCALABLE-NEXT:    br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP_LOOPEXIT]], label %[[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+;
+; FIXED-LABEL: define void @s000(
+; FIXED-SAME: ptr [[A:%.*]], ptr [[B:%.*]], i32 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; FIXED-NEXT:  [[ENTRY:.*:]]
+; FIXED-NEXT:    [[B2:%.*]] = ptrtoint ptr [[B]] to i64
+; FIXED-NEXT:    [[A1:%.*]] = ptrtoint ptr [[A]] to i64
+; FIXED-NEXT:    [[CMP6:%.*]] = icmp sgt i32 [[N]], 0
+; FIXED-NEXT:    br i1 [[CMP6]], label %[[FOR_BODY_PREHEADER:.*]], label %[[FOR_COND_CLEANUP:.*]]
+; FIXED:       [[FOR_BODY_PREHEADER]]:
+; FIXED-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+; FIXED-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 16
+; FIXED-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_MEMCHECK:.*]]
+; FIXED:       [[VECTOR_MEMCHECK]]:
+; FIXED-NEXT:    [[TMP0:%.*]] = sub i64 [[A1]], [[B2]]
+; FIXED-NEXT:    [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP0]], 64
+; FIXED-NEXT:    br i1 [[DIFF_CHECK]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; FIXED:       [[VECTOR_PH]]:
+; FIXED-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 16
+; FIXED-NEXT:    [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
+; FIXED-NEXT:    br label %[[VECTOR_BODY:.*]]
+; FIXED:       [[VECTOR_BODY]]:
+; FIXED-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; FIXED-NEXT:    [[TMP1:%.*]] = add i64 [[INDEX]], 0
+; FIXED-NEXT:    [[TMP2:%.*]] = add i64 [[INDEX]], 8
+; FIXED-NEXT:    [[TMP3:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP1]]
+; FIXED-NEXT:    [[TMP4:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP2]]
+; FIXED-NEXT:    [[TMP5:%.*]] = getelementptr inbounds float, ptr [[TMP3]], i32 0
+; FIXED-NEXT:    [[TMP6:%.*]] = getelementptr inbounds float, ptr [[TMP3]], i32 8
+; FIXED-NEXT:    [[WIDE_LOAD:%.*]] = load <8 x float>, ptr [[TMP5]], align 4
+; FIXED-NEXT:    [[WIDE_LOAD3:%.*]] = load <8 x float>, ptr [[TMP6]], align 4
+; FIXED-NEXT:    [[TMP7:%.*]] = fadd <8 x float> [[WIDE_LOAD]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
+; FIXED-NEXT:    [[TMP8:%.*]] = fadd <8 x float> [[WIDE_LOAD3]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
+; FIXED-NEXT:    [[TMP9:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP1]]
+; FIXED-NEXT:    [[TMP10:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP2]]
+; FIXED-NEXT:    [[TMP11:%.*]] = getelementptr inbounds float, ptr [[TMP9]], i32 0
+; FIXED-NEXT:    [[TMP12:%.*]] = getelementptr inbounds float, ptr [[TMP9]], i32 8
+; FIXED-NEXT:    store <8 x float> [[TMP7]], ptr [[TMP11]], align 4
+; FIXED-NEXT:    store <8 x float> [[TMP8]], ptr [[TMP12]], align 4
+; FIXED-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
+; FIXED-NEXT:    [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; FIXED-NEXT:    br i1 [[TMP13]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; FIXED:       [[MIDDLE_BLOCK]]:
+; FIXED-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
+; FIXED-NEXT:    br i1 [[CMP_N]], label %[[FOR_COND_CLEANUP_LOOPEXIT:.*]], label %[[SCALAR_PH]]
+; FIXED:       [[SCALAR_PH]]:
+; FIXED-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[FOR_BODY_PREHEADER]] ], [ 0, %[[VECTOR_MEMCHECK]] ]
+; FIXED-NEXT:    br label %[[FOR_BODY:.*]]
+; FIXED:       [[FOR_COND_CLEANUP_LOOPEXIT]]:
+; FIXED-NEXT:    br label %[[FOR_COND_CLEANUP]]
+; FIXED:       [[FOR_COND_CLEANUP]]:
+; FIXED-NEXT:    ret void
+; FIXED:       [[FOR_BODY]]:
+; FIXED-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; FIXED-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[INDVARS_IV]]
+; FIXED-NEXT:    [[TMP14:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; FIXED-NEXT:    [[ADD:%.*]] = fadd float [[TMP14]], 1.000000e+00
+; FIXED-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[INDVARS_IV]]
+; FIXED-NEXT:    store float [[ADD]], ptr [[ARRAYIDX2]], align 4
+; FIXED-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; FIXED-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
+; FIXED-NEXT:    br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP_LOOPEXIT]], label %[[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+;
+entry:
+  %cmp6 = icmp sgt i32 %n, 0
+  br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup
+
+for.body.preheader:
+  %wide.trip.count = zext nneg i32 %n to i64
+  br label %for.body
+
+for.cond.cleanup:
+  ret void
+
+for.body:
+  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
+  %arrayidx = getelementptr inbounds float, ptr %b, i64 %indvars.iv
+  %0 = load float, ptr %arrayidx, align 4
+  %add = fadd float %0, 1.000000e+00
+  %arrayidx2 = getelementptr inbounds float, ptr %a, i64 %indvars.iv
+  store float %add, ptr %arrayidx2, align 4
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
+;.
+; SCALABLE: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; SCALABLE: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; SCALABLE: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; SCALABLE: [[LOOP3]] = distinct !{[[LOOP3]], [[META1]]}
+;.
+; FIXED: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; FIXED: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; FIXED: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; FIXED: [[LOOP3]] = distinct !{[[LOOP3]], [[META1]]}
+;.

llvmbot · 2024-07-25T12:35:03Z

@llvm/pr-subscribers-backend-risc-v

Author: Pengcheng Wang (wangpc-pp)

Changes

This is inspired by #95819.

Some kernels like s000 have some improvements and we can reduce
code for calculating vector length, fully unroll tail epilogue.

Currently, we add a SubtargetFeature for this and the processors
can add it if needed.

Full diff: https://github.com/llvm/llvm-project/pull/100564.diff

3 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVFeatures.td (+6)
(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+4)
(added) llvm/test/Transforms/LoopVectorize/RISCV/prefer-fixed-if-equal-to-scalable.ll (+165)

diff --git a/llvm/lib/Target/RISCV/RISCVFeatures.td b/llvm/lib/Target/RISCV/RISCVFeatures.td
index 3c868dbbf8b3a..96ec2dcbb715b 100644
--- a/llvm/lib/Target/RISCV/RISCVFeatures.td
+++ b/llvm/lib/Target/RISCV/RISCVFeatures.td
@@ -1324,6 +1324,12 @@ def FeaturePredictableSelectIsExpensive
     : SubtargetFeature<"predictable-select-expensive", "PredictableSelectIsExpensive", "true",
                        "Prefer likely predicted branches over selects">;
 
+def FeatureUseFixedOverScalableIfEqualCost
+    : SubtargetFeature<"use-fixed-over-scalable-if-equal-cost",
+                       "UseFixedOverScalableIfEqualCost", "true",
+                       "Prefer fixed width loop vectorization over scalable"
+                       "if the cost-model assigns equal costs">;
+
 def TuneOptimizedZeroStrideLoad
    : SubtargetFeature<"optimized-zero-stride-load", "HasOptimizedZeroStrideLoad",
                       "true", "Optimized (perform fewer memory operations)"
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index 9c37a4f6ec2d0..fffae92e78b2f 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -342,6 +342,10 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
 
   bool enableInterleavedAccessVectorization() { return true; }
 
+  bool preferFixedOverScalableIfEqualCost() const {
+    return ST->useFixedOverScalableIfEqualCost();
+  }
+
   enum RISCVRegisterClass { GPRRC, FPRRC, VRRC };
   unsigned getNumberOfRegisters(unsigned ClassID) const {
     switch (ClassID) {
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/prefer-fixed-if-equal-to-scalable.ll b/llvm/test/Transforms/LoopVectorize/RISCV/prefer-fixed-if-equal-to-scalable.ll
new file mode 100644
index 0000000000000..eebd34958905c
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/prefer-fixed-if-equal-to-scalable.ll
@@ -0,0 +1,165 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -mtriple riscv64 -S -passes=loop-vectorize -force-target-instruction-cost=1 < %s \
+; RUN:   -mattr=+v | FileCheck %s -check-prefix=SCALABLE
+; RUN: opt -mtriple riscv64 -S -passes=loop-vectorize -force-target-instruction-cost=1 < %s  \
+; RUN:   -mattr=+v,+use-fixed-over-scalable-if-equal-cost \
+; RUN:   | FileCheck %s -check-prefix=FIXED
+
+define void @s000(ptr %a, ptr %b, i32 %n) {
+; SCALABLE-LABEL: define void @s000(
+; SCALABLE-SAME: ptr [[A:%.*]], ptr [[B:%.*]], i32 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; SCALABLE-NEXT:  [[ENTRY:.*:]]
+; SCALABLE-NEXT:    [[B2:%.*]] = ptrtoint ptr [[B]] to i64
+; SCALABLE-NEXT:    [[A1:%.*]] = ptrtoint ptr [[A]] to i64
+; SCALABLE-NEXT:    [[CMP6:%.*]] = icmp sgt i32 [[N]], 0
+; SCALABLE-NEXT:    br i1 [[CMP6]], label %[[FOR_BODY_PREHEADER:.*]], label %[[FOR_COND_CLEANUP:.*]]
+; SCALABLE:       [[FOR_BODY_PREHEADER]]:
+; SCALABLE-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+; SCALABLE-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; SCALABLE-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 4
+; SCALABLE-NEXT:    [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 8, i64 [[TMP1]])
+; SCALABLE-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], [[TMP2]]
+; SCALABLE-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_MEMCHECK:.*]]
+; SCALABLE:       [[VECTOR_MEMCHECK]]:
+; SCALABLE-NEXT:    [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; SCALABLE-NEXT:    [[TMP4:%.*]] = mul i64 [[TMP3]], 4
+; SCALABLE-NEXT:    [[TMP5:%.*]] = mul i64 [[TMP4]], 4
+; SCALABLE-NEXT:    [[TMP6:%.*]] = sub i64 [[A1]], [[B2]]
+; SCALABLE-NEXT:    [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP6]], [[TMP5]]
+; SCALABLE-NEXT:    br i1 [[DIFF_CHECK]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; SCALABLE:       [[VECTOR_PH]]:
+; SCALABLE-NEXT:    [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
+; SCALABLE-NEXT:    [[TMP8:%.*]] = mul i64 [[TMP7]], 4
+; SCALABLE-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], [[TMP8]]
+; SCALABLE-NEXT:    [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
+; SCALABLE-NEXT:    [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
+; SCALABLE-NEXT:    [[TMP10:%.*]] = mul i64 [[TMP9]], 4
+; SCALABLE-NEXT:    br label %[[VECTOR_BODY:.*]]
+; SCALABLE:       [[VECTOR_BODY]]:
+; SCALABLE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; SCALABLE-NEXT:    [[TMP11:%.*]] = add i64 [[INDEX]], 0
+; SCALABLE-NEXT:    [[TMP12:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP11]]
+; SCALABLE-NEXT:    [[TMP13:%.*]] = getelementptr inbounds float, ptr [[TMP12]], i32 0
+; SCALABLE-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP13]], align 4
+; SCALABLE-NEXT:    [[TMP14:%.*]] = fadd <vscale x 4 x float> [[WIDE_LOAD]], shufflevector (<vscale x 4 x float> insertelement (<vscale x 4 x float> poison, float 1.000000e+00, i64 0), <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer)
+; SCALABLE-NEXT:    [[TMP15:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP11]]
+; SCALABLE-NEXT:    [[TMP16:%.*]] = getelementptr inbounds float, ptr [[TMP15]], i32 0
+; SCALABLE-NEXT:    store <vscale x 4 x float> [[TMP14]], ptr [[TMP16]], align 4
+; SCALABLE-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP10]]
+; SCALABLE-NEXT:    [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; SCALABLE-NEXT:    br i1 [[TMP17]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; SCALABLE:       [[MIDDLE_BLOCK]]:
+; SCALABLE-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
+; SCALABLE-NEXT:    br i1 [[CMP_N]], label %[[FOR_COND_CLEANUP_LOOPEXIT:.*]], label %[[SCALAR_PH]]
+; SCALABLE:       [[SCALAR_PH]]:
+; SCALABLE-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[FOR_BODY_PREHEADER]] ], [ 0, %[[VECTOR_MEMCHECK]] ]
+; SCALABLE-NEXT:    br label %[[FOR_BODY:.*]]
+; SCALABLE:       [[FOR_COND_CLEANUP_LOOPEXIT]]:
+; SCALABLE-NEXT:    br label %[[FOR_COND_CLEANUP]]
+; SCALABLE:       [[FOR_COND_CLEANUP]]:
+; SCALABLE-NEXT:    ret void
+; SCALABLE:       [[FOR_BODY]]:
+; SCALABLE-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; SCALABLE-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[INDVARS_IV]]
+; SCALABLE-NEXT:    [[TMP18:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; SCALABLE-NEXT:    [[ADD:%.*]] = fadd float [[TMP18]], 1.000000e+00
+; SCALABLE-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[INDVARS_IV]]
+; SCALABLE-NEXT:    store float [[ADD]], ptr [[ARRAYIDX2]], align 4
+; SCALABLE-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; SCALABLE-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
+; SCALABLE-NEXT:    br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP_LOOPEXIT]], label %[[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+;
+; FIXED-LABEL: define void @s000(
+; FIXED-SAME: ptr [[A:%.*]], ptr [[B:%.*]], i32 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; FIXED-NEXT:  [[ENTRY:.*:]]
+; FIXED-NEXT:    [[B2:%.*]] = ptrtoint ptr [[B]] to i64
+; FIXED-NEXT:    [[A1:%.*]] = ptrtoint ptr [[A]] to i64
+; FIXED-NEXT:    [[CMP6:%.*]] = icmp sgt i32 [[N]], 0
+; FIXED-NEXT:    br i1 [[CMP6]], label %[[FOR_BODY_PREHEADER:.*]], label %[[FOR_COND_CLEANUP:.*]]
+; FIXED:       [[FOR_BODY_PREHEADER]]:
+; FIXED-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+; FIXED-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 16
+; FIXED-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_MEMCHECK:.*]]
+; FIXED:       [[VECTOR_MEMCHECK]]:
+; FIXED-NEXT:    [[TMP0:%.*]] = sub i64 [[A1]], [[B2]]
+; FIXED-NEXT:    [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP0]], 64
+; FIXED-NEXT:    br i1 [[DIFF_CHECK]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; FIXED:       [[VECTOR_PH]]:
+; FIXED-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 16
+; FIXED-NEXT:    [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
+; FIXED-NEXT:    br label %[[VECTOR_BODY:.*]]
+; FIXED:       [[VECTOR_BODY]]:
+; FIXED-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; FIXED-NEXT:    [[TMP1:%.*]] = add i64 [[INDEX]], 0
+; FIXED-NEXT:    [[TMP2:%.*]] = add i64 [[INDEX]], 8
+; FIXED-NEXT:    [[TMP3:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP1]]
+; FIXED-NEXT:    [[TMP4:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP2]]
+; FIXED-NEXT:    [[TMP5:%.*]] = getelementptr inbounds float, ptr [[TMP3]], i32 0
+; FIXED-NEXT:    [[TMP6:%.*]] = getelementptr inbounds float, ptr [[TMP3]], i32 8
+; FIXED-NEXT:    [[WIDE_LOAD:%.*]] = load <8 x float>, ptr [[TMP5]], align 4
+; FIXED-NEXT:    [[WIDE_LOAD3:%.*]] = load <8 x float>, ptr [[TMP6]], align 4
+; FIXED-NEXT:    [[TMP7:%.*]] = fadd <8 x float> [[WIDE_LOAD]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
+; FIXED-NEXT:    [[TMP8:%.*]] = fadd <8 x float> [[WIDE_LOAD3]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
+; FIXED-NEXT:    [[TMP9:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP1]]
+; FIXED-NEXT:    [[TMP10:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP2]]
+; FIXED-NEXT:    [[TMP11:%.*]] = getelementptr inbounds float, ptr [[TMP9]], i32 0
+; FIXED-NEXT:    [[TMP12:%.*]] = getelementptr inbounds float, ptr [[TMP9]], i32 8
+; FIXED-NEXT:    store <8 x float> [[TMP7]], ptr [[TMP11]], align 4
+; FIXED-NEXT:    store <8 x float> [[TMP8]], ptr [[TMP12]], align 4
+; FIXED-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
+; FIXED-NEXT:    [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; FIXED-NEXT:    br i1 [[TMP13]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; FIXED:       [[MIDDLE_BLOCK]]:
+; FIXED-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
+; FIXED-NEXT:    br i1 [[CMP_N]], label %[[FOR_COND_CLEANUP_LOOPEXIT:.*]], label %[[SCALAR_PH]]
+; FIXED:       [[SCALAR_PH]]:
+; FIXED-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[FOR_BODY_PREHEADER]] ], [ 0, %[[VECTOR_MEMCHECK]] ]
+; FIXED-NEXT:    br label %[[FOR_BODY:.*]]
+; FIXED:       [[FOR_COND_CLEANUP_LOOPEXIT]]:
+; FIXED-NEXT:    br label %[[FOR_COND_CLEANUP]]
+; FIXED:       [[FOR_COND_CLEANUP]]:
+; FIXED-NEXT:    ret void
+; FIXED:       [[FOR_BODY]]:
+; FIXED-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; FIXED-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[INDVARS_IV]]
+; FIXED-NEXT:    [[TMP14:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; FIXED-NEXT:    [[ADD:%.*]] = fadd float [[TMP14]], 1.000000e+00
+; FIXED-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[INDVARS_IV]]
+; FIXED-NEXT:    store float [[ADD]], ptr [[ARRAYIDX2]], align 4
+; FIXED-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; FIXED-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
+; FIXED-NEXT:    br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP_LOOPEXIT]], label %[[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+;
+entry:
+  %cmp6 = icmp sgt i32 %n, 0
+  br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup
+
+for.body.preheader:
+  %wide.trip.count = zext nneg i32 %n to i64
+  br label %for.body
+
+for.cond.cleanup:
+  ret void
+
+for.body:
+  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
+  %arrayidx = getelementptr inbounds float, ptr %b, i64 %indvars.iv
+  %0 = load float, ptr %arrayidx, align 4
+  %add = fadd float %0, 1.000000e+00
+  %arrayidx2 = getelementptr inbounds float, ptr %a, i64 %indvars.iv
+  store float %add, ptr %arrayidx2, align 4
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
+;.
+; SCALABLE: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; SCALABLE: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; SCALABLE: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; SCALABLE: [[LOOP3]] = distinct !{[[LOOP3]], [[META1]]}
+;.
+; FIXED: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; FIXED: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; FIXED: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; FIXED: [[LOOP3]] = distinct !{[[LOOP3]], [[META1]]}
+;.

lukel97

Makes sense to me

preames · 2024-07-25T16:30:39Z

I would prefer this not land without some evidence of profitability on at least some hardware.

I glanced at something like this - though I keyed mine off having an exactly known VLEN from -mrvv-vector-bits=zvl - and glanced at some test deltas. I didn't see much, and what I did see appeared to be easily addressable in the scalable lowering.

I'm not really opposed to the idea, I just want to make sure we're not carrying complexity for no value.

topperc · 2024-07-25T16:35:19Z

Is the cost affected by Zvl*b? I see in the test case, SCALABLE is LMUL=2 vectorization. For FIXED, we use LMUL=2 registers, but VL=8. If the runtime VLEN is greater than 128 the scalable code will be able to process more elements per iteration.

If the microarchitecture doesn't dynamically reduce LMUL based on VL, then the FIXED code will waste resources with VLEN > 128 since all 8 elements will fit in half of an LMUL=2 register.

lukel97 · 2024-07-26T00:08:53Z

Is the cost affected by Zvl*b?

Doesn't the vectorizer estimate the scalable VF costs based off of Zvl*b via getVScaleForTuning to begin with? This would only make a difference in a tie-breaker. For other cases today the vectorizer might still choose a fixed VF even if the runtime VLEN might be more profitable

wangpc-pp · 2024-07-26T08:11:24Z

I would prefer this not land without some evidence of profitability on at least some hardware.

Compile TSVC with clang -O2 -mcpu=spacemit-x60 -Xclang -target-feature -Xclang +use-fixed-over-scalable-if-equal-cost and test it on Spacemit K1 board, we can see some improvments (we can reduce 1 scalar instruction in some loops like s121).

If the runtime VLEN is greater than 128 the scalable code will be able to process more elements per iteration.

I was thinking that we are just compiling for one CPU with -mcpu.

preames · 2024-07-29T16:00:21Z

Compile TSVC with clang -O2 -mcpu=spacemit-x60 -Xclang -target-feature -Xclang +use-fixed-over-scalable-if-equal-cost and test it on Spacemit K1 board, we can see some improvments (we can reduce 1 scalar instruction in some loops like s121).

Can you share actual perf results? And maybe an example where we actually save the instruction? I'm wondering if we have an opportunity to improve the scalable lowering.

wangpc-pp · 2024-07-30T04:02:26Z

Compile TSVC with clang -O2 -mcpu=spacemit-x60 -Xclang -target-feature -Xclang +use-fixed-over-scalable-if-equal-cost and test it on Spacemit K1 board, we can see some improvments (we can reduce 1 scalar instruction in some loops like s121).

Can you share actual perf results?

I'm glad to but I just found the result has variability (some kernels have large runtime diffs, even if there is no binary diff). Do you have a method to make the result stable?

And maybe an example where we actually save the instruction? I'm wondering if we have an opportunity to improve the scalable lowering.

Yeah, please have a look at: https://godbolt.org/z/5fshc7Enq. I disable loop unrolling here to make the assemblies comparable.
As you can see, we can reduce one scalar instruction (for adding trip count I think) if we are doing fixed-vector vectorization.

topperc · 2024-07-30T05:49:07Z

Compile TSVC with clang -O2 -mcpu=spacemit-x60 -Xclang -target-feature -Xclang +use-fixed-over-scalable-if-equal-cost and test it on Spacemit K1 board, we can see some improvments (we can reduce 1 scalar instruction in some loops like s121).

Can you share actual perf results?

I'm glad to but I just found the result has variability (some kernels have large runtime diffs, even if there is no binary diff). Do you have a method to make the result stable?

And maybe an example where we actually save the instruction? I'm wondering if we have an opportunity to improve the scalable lowering.

Yeah, please have a look at: https://godbolt.org/z/5fshc7Enq. I disable loop unrolling here to make the assemblies comparable. As you can see, we can reduce one scalar instruction (for adding trip count I think) if we are doing fixed-vector vectorization.

Looks maybe LSR related? Disabling LSR has the same number of instructions in both loops.

preames · 2024-07-30T16:08:33Z

Looks maybe LSR related? Disabling LSR has the same number of instructions in both loops.

First impression is this is a deficiency in lsr term-folding. If you want to file a separate bug for this, I can take a look and see if it's easy to fix. My initial guess is that we're struggling to prove the step non-zero given the slightly complicate expression there.

Worth noting is that adding -mrvv-vector-bits=zvl to both sides results in exactly the same code, and that code appears reasonable.

wangpc-pp · 2024-07-31T03:24:38Z

Worth noting is that adding -mrvv-vector-bits=zvl to both sides results in exactly the same code, and that code appears reasonable.

I think it is because we are using fixed-vectors with -mrvv-vector-bits=zvl.

preames · 2024-07-31T14:55:37Z

Worth noting is that adding -mrvv-vector-bits=zvl to both sides results in exactly the same code, and that code appears reasonable.

I think it is because we are using fixed-vectors with -mrvv-vector-bits=zvl.

On the compiler explorer example you posted, we still use scalable types. We just know the exact size of the scalable type.

So that SCEV can analyse the bound of loop count. This can fix issue found in llvm#100564.

llvmbot added backend:RISC-V llvm:transforms labels Jul 25, 2024

wangpc-pp requested a review from preames July 25, 2024 12:34

wangpc-pp requested review from lukel97 and topperc July 25, 2024 12:35

lukel97 approved these changes Jul 25, 2024

View reviewed changes

wangpc-pp mentioned this pull request Aug 9, 2024

[LV] Use ICMP_UGE for BranchOnCount when VF is scalable #102575

Open

wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this pull request Aug 9, 2024

[LV] Use ICMP_UGE for BranchOnCount when VF is scalable

ea3c380

So that SCEV can analyse the bound of loop count. This can fix issue found in llvm#100564.

wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this pull request Aug 13, 2024

[LV] Use ICMP_UGE for BranchOnCount when VF is scalable

bdc69c8

So that SCEV can analyse the bound of loop count. This can fix issue found in llvm#100564.

wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this pull request Aug 19, 2024

[LV] Use ICMP_UGE for BranchOnCount when VF is scalable

587674a

So that SCEV can analyse the bound of loop count. This can fix issue found in llvm#100564.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RISCV] Prefer VLS over VLA if costs are equal #100564

[RISCV] Prefer VLS over VLA if costs are equal #100564

Uh oh!

wangpc-pp commented Jul 25, 2024

Uh oh!

llvmbot commented Jul 25, 2024

Uh oh!

llvmbot commented Jul 25, 2024

Uh oh!

lukel97 left a comment

Uh oh!

preames commented Jul 25, 2024

Uh oh!

topperc commented Jul 25, 2024

Uh oh!

lukel97 commented Jul 26, 2024

Uh oh!

wangpc-pp commented Jul 26, 2024

Uh oh!

preames commented Jul 29, 2024

Uh oh!

wangpc-pp commented Jul 30, 2024 •

edited

Loading

Uh oh!

topperc commented Jul 30, 2024

Uh oh!

preames commented Jul 30, 2024 •

edited

Loading

Uh oh!

wangpc-pp commented Jul 31, 2024

Uh oh!

preames commented Jul 31, 2024 •

edited

Loading

Uh oh!

Uh oh!

[RISCV] Prefer VLS over VLA if costs are equal #100564

Are you sure you want to change the base?

[RISCV] Prefer VLS over VLA if costs are equal #100564

Uh oh!

Conversation

wangpc-pp commented Jul 25, 2024

Uh oh!

llvmbot commented Jul 25, 2024

Uh oh!

llvmbot commented Jul 25, 2024

Uh oh!

lukel97 left a comment

Choose a reason for hiding this comment

Uh oh!

preames commented Jul 25, 2024

Uh oh!

topperc commented Jul 25, 2024

Uh oh!

lukel97 commented Jul 26, 2024

Uh oh!

wangpc-pp commented Jul 26, 2024

Uh oh!

preames commented Jul 29, 2024

Uh oh!

wangpc-pp commented Jul 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

topperc commented Jul 30, 2024

Uh oh!

preames commented Jul 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangpc-pp commented Jul 31, 2024

Uh oh!

preames commented Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wangpc-pp commented Jul 30, 2024 •

edited

Loading

preames commented Jul 30, 2024 •

edited

Loading

preames commented Jul 31, 2024 •

edited

Loading