Skip to content

[RISCV] Reverse default assumption about performance of vlseN.v vd, (rs1), x0 #98205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 10, 2024

Conversation

preames
Copy link
Collaborator

@preames preames commented Jul 9, 2024

Some cores implement an optimization for a strided load with an x0 stride, which results in fewer memory operations being performed then implied by VL since all address are the same. It seems to be the case that this is the case only for a minority of available implementations. We know that sifive-x280 does, but sifive-p670 and spacemit-x60 both do not.

(To be more precise, measurements on the x60 appear to indicate that a
stride of x0 has similar latency to a non-zero stride, and that both
are about twice a vleN.v. I'm taking this to mean the x0
case is not optimized.)

We had an existing flag by which a processor could opt out of this assumption but no upstream users. Instead of adding this flag to the p670 and x60, this patch reverses the default and adds the opt-in flag only to the x280.

…rs1), x0

Some cores implement an optimization for a strided load with an x0 stride,
which results in fewer memory operations being performed then implied by VL
since all address are the same.  It seems to be the case that this is the
case only for a minority of available implementations.  We know that
sifive-x280 does, but sifive-p670 and spacemit-x60 both do not.

(To be more precise, measurements on the x60 appear to indicate that a
 stride of x0 has similar latency to a non-zero stride, and that both
 are about twice a vleN.v.  I'm taking this to mean the x0
 case is not optimized.)

We had an existing flag by which a processor could opt out of this assumption
but no upstream users.  Instead of adding this flag to the p670 and x60, this
patch reverses the default and adds the opt-in flag only to the x280.
@llvmbot
Copy link
Member

llvmbot commented Jul 9, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Philip Reames (preames)

Changes

Some cores implement an optimization for a strided load with an x0 stride, which results in fewer memory operations being performed then implied by VL since all address are the same. It seems to be the case that this is the case only for a minority of available implementations. We know that sifive-x280 does, but sifive-p670 and spacemit-x60 both do not.

(To be more precise, measurements on the x60 appear to indicate that a
stride of x0 has similar latency to a non-zero stride, and that both
are about twice a vleN.v. I'm taking this to mean the x0
case is not optimized.)

We had an existing flag by which a processor could opt out of this assumption but no upstream users. Instead of adding this flag to the p670 and x60, this patch reverses the default and adds the opt-in flag only to the x280.


Patch is 128.75 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/98205.diff

21 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVFeatures.td (+3-3)
  • (modified) llvm/lib/Target/RISCV/RISCVProcessors.td (+2-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll (+150-96)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-shuffles.ll (+39-18)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-vrgather.ll (+12-11)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll (+271-216)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-vrgather.ll (+52-36)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int.ll (+12-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-mask-buildvec.ll (+5-5)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll (+190-174)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll (+42-36)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-store-asm.ll (+26-25)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwadd.ll (+12-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwaddu.ll (+12-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmulsu.ll (+2-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwsub.ll (+18-9)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwsubu.ll (+18-9)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfma-vp-combine.ll (+10-10)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vreductions-fp-sdnode.ll (+3-3)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vsetvli-insert-crossbb.ll (+9-9)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vsplats-fp.ll (+8-8)
diff --git a/llvm/lib/Target/RISCV/RISCVFeatures.td b/llvm/lib/Target/RISCV/RISCVFeatures.td
index e2a8fb485850f..b96465bbe3e50 100644
--- a/llvm/lib/Target/RISCV/RISCVFeatures.td
+++ b/llvm/lib/Target/RISCV/RISCVFeatures.td
@@ -1264,9 +1264,9 @@ def FeaturePredictableSelectIsExpensive
     : SubtargetFeature<"predictable-select-expensive", "PredictableSelectIsExpensive", "true",
                        "Prefer likely predicted branches over selects">;
 
-def TuneNoOptimizedZeroStrideLoad
-   : SubtargetFeature<"no-optimized-zero-stride-load", "HasOptimizedZeroStrideLoad",
-                      "false", "Hasn't optimized (perform fewer memory operations)"
+def TuneOptimizedZeroStrideLoad
+   : SubtargetFeature<"optimized-zero-stride-load", "HasOptimizedZeroStrideLoad",
+                      "true", "optimized (perform fewer memory operations)"
                       "zero-stride vector load">;
 
 def Experimental
diff --git a/llvm/lib/Target/RISCV/RISCVProcessors.td b/llvm/lib/Target/RISCV/RISCVProcessors.td
index 13a2491116b5d..6eed2ae01f646 100644
--- a/llvm/lib/Target/RISCV/RISCVProcessors.td
+++ b/llvm/lib/Target/RISCV/RISCVProcessors.td
@@ -231,7 +231,8 @@ def SIFIVE_X280 : RISCVProcessorModel<"sifive-x280", SiFive7Model,
                                        FeatureStdExtZbb],
                                       [TuneSiFive7,
                                        FeaturePostRAScheduler,
-                                       TuneDLenFactor2]>;
+                                       TuneDLenFactor2,
+                                       TuneOptimizedZeroStrideLoad]>;
 
 def SIFIVE_P450 : RISCVProcessorModel<"sifive-p450", SiFiveP400Model,
                                       [Feature64Bit,
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll
index eb7f6b1bb6540..26ed4595ca758 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll
@@ -1137,37 +1137,67 @@ define <32 x double> @buildvec_v32f64(double %e0, double %e1, double %e2, double
 define <32 x double> @buildvec_v32f64_exact_vlen(double %e0, double %e1, double %e2, double %e3, double %e4, double %e5, double %e6, double %e7, double %e8, double %e9, double %e10, double %e11, double %e12, double %e13, double %e14, double %e15, double %e16, double %e17, double %e18, double %e19, double %e20, double %e21, double %e22, double %e23, double %e24, double %e25, double %e26, double %e27, double %e28, double %e29, double %e30, double %e31) vscale_range(2,2) {
 ; RV32-LABEL: buildvec_v32f64_exact_vlen:
 ; RV32:       # %bb.0:
-; RV32-NEXT:    addi sp, sp, -32
-; RV32-NEXT:    .cfi_def_cfa_offset 32
-; RV32-NEXT:    fsd fs0, 24(sp) # 8-byte Folded Spill
-; RV32-NEXT:    fsd fs1, 16(sp) # 8-byte Folded Spill
+; RV32-NEXT:    addi sp, sp, -112
+; RV32-NEXT:    .cfi_def_cfa_offset 112
+; RV32-NEXT:    fsd fs0, 104(sp) # 8-byte Folded Spill
+; RV32-NEXT:    fsd fs1, 96(sp) # 8-byte Folded Spill
+; RV32-NEXT:    fsd fs2, 88(sp) # 8-byte Folded Spill
+; RV32-NEXT:    fsd fs3, 80(sp) # 8-byte Folded Spill
+; RV32-NEXT:    fsd fs4, 72(sp) # 8-byte Folded Spill
+; RV32-NEXT:    fsd fs5, 64(sp) # 8-byte Folded Spill
+; RV32-NEXT:    fsd fs6, 56(sp) # 8-byte Folded Spill
+; RV32-NEXT:    fsd fs7, 48(sp) # 8-byte Folded Spill
+; RV32-NEXT:    fsd fs8, 40(sp) # 8-byte Folded Spill
+; RV32-NEXT:    fsd fs9, 32(sp) # 8-byte Folded Spill
+; RV32-NEXT:    fsd fs10, 24(sp) # 8-byte Folded Spill
+; RV32-NEXT:    fsd fs11, 16(sp) # 8-byte Folded Spill
 ; RV32-NEXT:    .cfi_offset fs0, -8
 ; RV32-NEXT:    .cfi_offset fs1, -16
+; RV32-NEXT:    .cfi_offset fs2, -24
+; RV32-NEXT:    .cfi_offset fs3, -32
+; RV32-NEXT:    .cfi_offset fs4, -40
+; RV32-NEXT:    .cfi_offset fs5, -48
+; RV32-NEXT:    .cfi_offset fs6, -56
+; RV32-NEXT:    .cfi_offset fs7, -64
+; RV32-NEXT:    .cfi_offset fs8, -72
+; RV32-NEXT:    .cfi_offset fs9, -80
+; RV32-NEXT:    .cfi_offset fs10, -88
+; RV32-NEXT:    .cfi_offset fs11, -96
 ; RV32-NEXT:    sw a6, 8(sp)
 ; RV32-NEXT:    sw a7, 12(sp)
-; RV32-NEXT:    fld ft4, 8(sp)
+; RV32-NEXT:    fld ft6, 8(sp)
 ; RV32-NEXT:    sw a4, 8(sp)
 ; RV32-NEXT:    sw a5, 12(sp)
-; RV32-NEXT:    fld ft5, 8(sp)
+; RV32-NEXT:    fld ft7, 8(sp)
 ; RV32-NEXT:    sw a2, 8(sp)
 ; RV32-NEXT:    sw a3, 12(sp)
-; RV32-NEXT:    fld ft6, 8(sp)
+; RV32-NEXT:    fld ft8, 8(sp)
 ; RV32-NEXT:    sw a0, 8(sp)
 ; RV32-NEXT:    sw a1, 12(sp)
-; RV32-NEXT:    fld ft7, 8(sp)
-; RV32-NEXT:    fld ft0, 184(sp)
-; RV32-NEXT:    fld ft1, 168(sp)
-; RV32-NEXT:    fld ft2, 152(sp)
-; RV32-NEXT:    fld ft3, 136(sp)
-; RV32-NEXT:    fld ft8, 120(sp)
-; RV32-NEXT:    fld ft9, 104(sp)
-; RV32-NEXT:    fld ft10, 72(sp)
-; RV32-NEXT:    fld ft11, 88(sp)
-; RV32-NEXT:    fld fs0, 56(sp)
-; RV32-NEXT:    fld fs1, 40(sp)
+; RV32-NEXT:    fld ft9, 8(sp)
+; RV32-NEXT:    fld ft0, 264(sp)
+; RV32-NEXT:    fld ft1, 256(sp)
+; RV32-NEXT:    fld ft2, 248(sp)
+; RV32-NEXT:    fld ft3, 240(sp)
+; RV32-NEXT:    fld ft4, 232(sp)
+; RV32-NEXT:    fld ft5, 224(sp)
+; RV32-NEXT:    fld ft10, 216(sp)
+; RV32-NEXT:    fld ft11, 208(sp)
+; RV32-NEXT:    fld fs0, 200(sp)
+; RV32-NEXT:    fld fs1, 192(sp)
+; RV32-NEXT:    fld fs2, 184(sp)
+; RV32-NEXT:    fld fs3, 176(sp)
+; RV32-NEXT:    fld fs4, 152(sp)
+; RV32-NEXT:    fld fs5, 144(sp)
+; RV32-NEXT:    fld fs6, 168(sp)
+; RV32-NEXT:    fld fs7, 160(sp)
+; RV32-NEXT:    fld fs8, 136(sp)
+; RV32-NEXT:    fld fs9, 128(sp)
+; RV32-NEXT:    fld fs10, 120(sp)
+; RV32-NEXT:    fld fs11, 112(sp)
 ; RV32-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
-; RV32-NEXT:    vfmv.v.f v8, ft7
-; RV32-NEXT:    vfslide1down.vf v12, v8, ft6
+; RV32-NEXT:    vfmv.v.f v8, ft9
+; RV32-NEXT:    vfslide1down.vf v12, v8, ft8
 ; RV32-NEXT:    vfmv.v.f v8, fa2
 ; RV32-NEXT:    vfslide1down.vf v9, v8, fa3
 ; RV32-NEXT:    vfmv.v.f v8, fa0
@@ -1176,55 +1206,71 @@ define <32 x double> @buildvec_v32f64_exact_vlen(double %e0, double %e1, double
 ; RV32-NEXT:    vfslide1down.vf v10, v10, fa5
 ; RV32-NEXT:    vfmv.v.f v11, fa6
 ; RV32-NEXT:    vfslide1down.vf v11, v11, fa7
-; RV32-NEXT:    addi a0, sp, 32
-; RV32-NEXT:    vlse64.v v14, (a0), zero
-; RV32-NEXT:    addi a0, sp, 48
-; RV32-NEXT:    vlse64.v v15, (a0), zero
-; RV32-NEXT:    vfmv.v.f v13, ft5
-; RV32-NEXT:    vfslide1down.vf v13, v13, ft4
-; RV32-NEXT:    vfslide1down.vf v14, v14, fs1
-; RV32-NEXT:    vfslide1down.vf v15, v15, fs0
-; RV32-NEXT:    addi a0, sp, 80
-; RV32-NEXT:    vlse64.v v16, (a0), zero
-; RV32-NEXT:    addi a0, sp, 64
-; RV32-NEXT:    vlse64.v v18, (a0), zero
-; RV32-NEXT:    addi a0, sp, 96
-; RV32-NEXT:    vlse64.v v19, (a0), zero
-; RV32-NEXT:    addi a0, sp, 112
-; RV32-NEXT:    vlse64.v v20, (a0), zero
-; RV32-NEXT:    vfslide1down.vf v17, v16, ft11
-; RV32-NEXT:    vfslide1down.vf v16, v18, ft10
-; RV32-NEXT:    vfslide1down.vf v18, v19, ft9
-; RV32-NEXT:    vfslide1down.vf v19, v20, ft8
-; RV32-NEXT:    addi a0, sp, 128
-; RV32-NEXT:    vlse64.v v20, (a0), zero
-; RV32-NEXT:    addi a0, sp, 144
-; RV32-NEXT:    vlse64.v v21, (a0), zero
-; RV32-NEXT:    addi a0, sp, 160
-; RV32-NEXT:    vlse64.v v22, (a0), zero
-; RV32-NEXT:    addi a0, sp, 176
-; RV32-NEXT:    vlse64.v v23, (a0), zero
-; RV32-NEXT:    vfslide1down.vf v20, v20, ft3
-; RV32-NEXT:    vfslide1down.vf v21, v21, ft2
-; RV32-NEXT:    vfslide1down.vf v22, v22, ft1
+; RV32-NEXT:    vfmv.v.f v13, ft7
+; RV32-NEXT:    vfslide1down.vf v13, v13, ft6
+; RV32-NEXT:    vfmv.v.f v14, fs11
+; RV32-NEXT:    vfslide1down.vf v14, v14, fs10
+; RV32-NEXT:    vfmv.v.f v15, fs9
+; RV32-NEXT:    vfslide1down.vf v15, v15, fs8
+; RV32-NEXT:    vfmv.v.f v16, fs7
+; RV32-NEXT:    vfslide1down.vf v17, v16, fs6
+; RV32-NEXT:    vfmv.v.f v16, fs5
+; RV32-NEXT:    vfslide1down.vf v16, v16, fs4
+; RV32-NEXT:    vfmv.v.f v18, fs3
+; RV32-NEXT:    vfslide1down.vf v18, v18, fs2
+; RV32-NEXT:    vfmv.v.f v19, fs1
+; RV32-NEXT:    vfslide1down.vf v19, v19, fs0
+; RV32-NEXT:    vfmv.v.f v20, ft11
+; RV32-NEXT:    vfslide1down.vf v20, v20, ft10
+; RV32-NEXT:    vfmv.v.f v21, ft5
+; RV32-NEXT:    vfslide1down.vf v21, v21, ft4
+; RV32-NEXT:    vfmv.v.f v22, ft3
+; RV32-NEXT:    vfslide1down.vf v22, v22, ft2
+; RV32-NEXT:    vfmv.v.f v23, ft1
 ; RV32-NEXT:    vfslide1down.vf v23, v23, ft0
-; RV32-NEXT:    fld fs0, 24(sp) # 8-byte Folded Reload
-; RV32-NEXT:    fld fs1, 16(sp) # 8-byte Folded Reload
-; RV32-NEXT:    addi sp, sp, 32
+; RV32-NEXT:    fld fs0, 104(sp) # 8-byte Folded Reload
+; RV32-NEXT:    fld fs1, 96(sp) # 8-byte Folded Reload
+; RV32-NEXT:    fld fs2, 88(sp) # 8-byte Folded Reload
+; RV32-NEXT:    fld fs3, 80(sp) # 8-byte Folded Reload
+; RV32-NEXT:    fld fs4, 72(sp) # 8-byte Folded Reload
+; RV32-NEXT:    fld fs5, 64(sp) # 8-byte Folded Reload
+; RV32-NEXT:    fld fs6, 56(sp) # 8-byte Folded Reload
+; RV32-NEXT:    fld fs7, 48(sp) # 8-byte Folded Reload
+; RV32-NEXT:    fld fs8, 40(sp) # 8-byte Folded Reload
+; RV32-NEXT:    fld fs9, 32(sp) # 8-byte Folded Reload
+; RV32-NEXT:    fld fs10, 24(sp) # 8-byte Folded Reload
+; RV32-NEXT:    fld fs11, 16(sp) # 8-byte Folded Reload
+; RV32-NEXT:    addi sp, sp, 112
 ; RV32-NEXT:    ret
 ;
 ; RV64-LABEL: buildvec_v32f64_exact_vlen:
 ; RV64:       # %bb.0:
-; RV64-NEXT:    addi sp, sp, -32
-; RV64-NEXT:    .cfi_def_cfa_offset 32
-; RV64-NEXT:    fsd fs0, 24(sp) # 8-byte Folded Spill
-; RV64-NEXT:    fsd fs1, 16(sp) # 8-byte Folded Spill
-; RV64-NEXT:    fsd fs2, 8(sp) # 8-byte Folded Spill
-; RV64-NEXT:    fsd fs3, 0(sp) # 8-byte Folded Spill
+; RV64-NEXT:    addi sp, sp, -96
+; RV64-NEXT:    .cfi_def_cfa_offset 96
+; RV64-NEXT:    fsd fs0, 88(sp) # 8-byte Folded Spill
+; RV64-NEXT:    fsd fs1, 80(sp) # 8-byte Folded Spill
+; RV64-NEXT:    fsd fs2, 72(sp) # 8-byte Folded Spill
+; RV64-NEXT:    fsd fs3, 64(sp) # 8-byte Folded Spill
+; RV64-NEXT:    fsd fs4, 56(sp) # 8-byte Folded Spill
+; RV64-NEXT:    fsd fs5, 48(sp) # 8-byte Folded Spill
+; RV64-NEXT:    fsd fs6, 40(sp) # 8-byte Folded Spill
+; RV64-NEXT:    fsd fs7, 32(sp) # 8-byte Folded Spill
+; RV64-NEXT:    fsd fs8, 24(sp) # 8-byte Folded Spill
+; RV64-NEXT:    fsd fs9, 16(sp) # 8-byte Folded Spill
+; RV64-NEXT:    fsd fs10, 8(sp) # 8-byte Folded Spill
+; RV64-NEXT:    fsd fs11, 0(sp) # 8-byte Folded Spill
 ; RV64-NEXT:    .cfi_offset fs0, -8
 ; RV64-NEXT:    .cfi_offset fs1, -16
 ; RV64-NEXT:    .cfi_offset fs2, -24
 ; RV64-NEXT:    .cfi_offset fs3, -32
+; RV64-NEXT:    .cfi_offset fs4, -40
+; RV64-NEXT:    .cfi_offset fs5, -48
+; RV64-NEXT:    .cfi_offset fs6, -56
+; RV64-NEXT:    .cfi_offset fs7, -64
+; RV64-NEXT:    .cfi_offset fs8, -72
+; RV64-NEXT:    .cfi_offset fs9, -80
+; RV64-NEXT:    .cfi_offset fs10, -88
+; RV64-NEXT:    .cfi_offset fs11, -96
 ; RV64-NEXT:    fmv.d.x ft4, a7
 ; RV64-NEXT:    fmv.d.x ft5, a6
 ; RV64-NEXT:    fmv.d.x ft6, a5
@@ -1233,14 +1279,22 @@ define <32 x double> @buildvec_v32f64_exact_vlen(double %e0, double %e1, double
 ; RV64-NEXT:    fmv.d.x ft9, a2
 ; RV64-NEXT:    fmv.d.x ft10, a1
 ; RV64-NEXT:    fmv.d.x ft11, a0
-; RV64-NEXT:    fld ft0, 152(sp)
-; RV64-NEXT:    fld ft1, 136(sp)
-; RV64-NEXT:    fld ft2, 120(sp)
-; RV64-NEXT:    fld ft3, 104(sp)
-; RV64-NEXT:    fld fs0, 88(sp)
-; RV64-NEXT:    fld fs1, 72(sp)
-; RV64-NEXT:    fld fs2, 40(sp)
-; RV64-NEXT:    fld fs3, 56(sp)
+; RV64-NEXT:    fld ft0, 216(sp)
+; RV64-NEXT:    fld ft1, 208(sp)
+; RV64-NEXT:    fld ft2, 200(sp)
+; RV64-NEXT:    fld ft3, 192(sp)
+; RV64-NEXT:    fld fs0, 184(sp)
+; RV64-NEXT:    fld fs1, 176(sp)
+; RV64-NEXT:    fld fs2, 168(sp)
+; RV64-NEXT:    fld fs3, 160(sp)
+; RV64-NEXT:    fld fs4, 152(sp)
+; RV64-NEXT:    fld fs5, 144(sp)
+; RV64-NEXT:    fld fs6, 136(sp)
+; RV64-NEXT:    fld fs7, 128(sp)
+; RV64-NEXT:    fld fs8, 104(sp)
+; RV64-NEXT:    fld fs9, 96(sp)
+; RV64-NEXT:    fld fs10, 120(sp)
+; RV64-NEXT:    fld fs11, 112(sp)
 ; RV64-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
 ; RV64-NEXT:    vfmv.v.f v8, fa2
 ; RV64-NEXT:    vfslide1down.vf v9, v8, fa3
@@ -1258,35 +1312,35 @@ define <32 x double> @buildvec_v32f64_exact_vlen(double %e0, double %e1, double
 ; RV64-NEXT:    vfslide1down.vf v14, v14, ft6
 ; RV64-NEXT:    vfmv.v.f v15, ft5
 ; RV64-NEXT:    vfslide1down.vf v15, v15, ft4
-; RV64-NEXT:    addi a0, sp, 48
-; RV64-NEXT:    vlse64.v v16, (a0), zero
-; RV64-NEXT:    addi a0, sp, 32
-; RV64-NEXT:    vlse64.v v18, (a0), zero
-; RV64-NEXT:    addi a0, sp, 64
-; RV64-NEXT:    vlse64.v v19, (a0), zero
-; RV64-NEXT:    addi a0, sp, 80
-; RV64-NEXT:    vlse64.v v20, (a0), zero
-; RV64-NEXT:    vfslide1down.vf v17, v16, fs3
-; RV64-NEXT:    vfslide1down.vf v16, v18, fs2
-; RV64-NEXT:    vfslide1down.vf v18, v19, fs1
-; RV64-NEXT:    vfslide1down.vf v19, v20, fs0
-; RV64-NEXT:    addi a0, sp, 96
-; RV64-NEXT:    vlse64.v v20, (a0), zero
-; RV64-NEXT:    addi a0, sp, 112
-; RV64-NEXT:    vlse64.v v21, (a0), zero
-; RV64-NEXT:    addi a0, sp, 128
-; RV64-NEXT:    vlse64.v v22, (a0), zero
-; RV64-NEXT:    addi a0, sp, 144
-; RV64-NEXT:    vlse64.v v23, (a0), zero
-; RV64-NEXT:    vfslide1down.vf v20, v20, ft3
-; RV64-NEXT:    vfslide1down.vf v21, v21, ft2
-; RV64-NEXT:    vfslide1down.vf v22, v22, ft1
+; RV64-NEXT:    vfmv.v.f v16, fs11
+; RV64-NEXT:    vfslide1down.vf v17, v16, fs10
+; RV64-NEXT:    vfmv.v.f v16, fs9
+; RV64-NEXT:    vfslide1down.vf v16, v16, fs8
+; RV64-NEXT:    vfmv.v.f v18, fs7
+; RV64-NEXT:    vfslide1down.vf v18, v18, fs6
+; RV64-NEXT:    vfmv.v.f v19, fs5
+; RV64-NEXT:    vfslide1down.vf v19, v19, fs4
+; RV64-NEXT:    vfmv.v.f v20, fs3
+; RV64-NEXT:    vfslide1down.vf v20, v20, fs2
+; RV64-NEXT:    vfmv.v.f v21, fs1
+; RV64-NEXT:    vfslide1down.vf v21, v21, fs0
+; RV64-NEXT:    vfmv.v.f v22, ft3
+; RV64-NEXT:    vfslide1down.vf v22, v22, ft2
+; RV64-NEXT:    vfmv.v.f v23, ft1
 ; RV64-NEXT:    vfslide1down.vf v23, v23, ft0
-; RV64-NEXT:    fld fs0, 24(sp) # 8-byte Folded Reload
-; RV64-NEXT:    fld fs1, 16(sp) # 8-byte Folded Reload
-; RV64-NEXT:    fld fs2, 8(sp) # 8-byte Folded Reload
-; RV64-NEXT:    fld fs3, 0(sp) # 8-byte Folded Reload
-; RV64-NEXT:    addi sp, sp, 32
+; RV64-NEXT:    fld fs0, 88(sp) # 8-byte Folded Reload
+; RV64-NEXT:    fld fs1, 80(sp) # 8-byte Folded Reload
+; RV64-NEXT:    fld fs2, 72(sp) # 8-byte Folded Reload
+; RV64-NEXT:    fld fs3, 64(sp) # 8-byte Folded Reload
+; RV64-NEXT:    fld fs4, 56(sp) # 8-byte Folded Reload
+; RV64-NEXT:    fld fs5, 48(sp) # 8-byte Folded Reload
+; RV64-NEXT:    fld fs6, 40(sp) # 8-byte Folded Reload
+; RV64-NEXT:    fld fs7, 32(sp) # 8-byte Folded Reload
+; RV64-NEXT:    fld fs8, 24(sp) # 8-byte Folded Reload
+; RV64-NEXT:    fld fs9, 16(sp) # 8-byte Folded Reload
+; RV64-NEXT:    fld fs10, 8(sp) # 8-byte Folded Reload
+; RV64-NEXT:    fld fs11, 0(sp) # 8-byte Folded Reload
+; RV64-NEXT:    addi sp, sp, 96
 ; RV64-NEXT:    ret
   %v0 = insertelement <32 x double> poison, double %e0, i64 0
   %v1 = insertelement <32 x double> %v0, double %e1, i64 1
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-shuffles.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-shuffles.ll
index 6408402ef787f..958321f6c46d3 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-shuffles.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-shuffles.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv32 -mattr=+d,+zfh,+zvfh,+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV32
-; RUN: llc -mtriple=riscv64 -mattr=+d,+zfh,+zvfh,+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV64
-; RUN: llc -mtriple=riscv32 -mattr=+d,+zvfhmin,+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV32
-; RUN: llc -mtriple=riscv64 -mattr=+d,+zvfhmin,+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV64
+; RUN: llc -mtriple=riscv32 -mattr=+d,+zfh,+zvfh,+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV32,RV32-ZVFH
+; RUN: llc -mtriple=riscv64 -mattr=+d,+zfh,+zvfh,+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV64,RV64-ZVFH
+; RUN: llc -mtriple=riscv32 -mattr=+d,+zvfhmin,+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV32,RV32-ZVFHMIN
+; RUN: llc -mtriple=riscv64 -mattr=+d,+zvfhmin,+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV64,RV64-ZVFHMIN
 
 define <4 x half> @shuffle_v4f16(<4 x half> %x, <4 x half> %y) {
 ; CHECK-LABEL: shuffle_v4f16:
@@ -110,13 +110,13 @@ define <4 x double> @vrgather_shuffle_xv_v4f64(<4 x double> %x) {
 ; CHECK-LABEL: vrgather_shuffle_xv_v4f64:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    lui a0, %hi(.LCPI7_0)
-; CHECK-NEXT:    addi a0, a0, %lo(.LCPI7_0)
+; CHECK-NEXT:    fld fa5, %lo(.LCPI7_0)(a0)
 ; CHECK-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
-; CHECK-NEXT:    vlse64.v v10, (a0), zero
-; CHECK-NEXT:    vid.v v12
+; CHECK-NEXT:    vid.v v10
+; CHECK-NEXT:    vrsub.vi v12, v10, 4
 ; CHECK-NEXT:    vmv.v.i v0, 12
-; CHECK-NEXT:    vrsub.vi v12, v12, 4
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m2, ta, mu
+; CHECK-NEXT:    vfmv.v.f v10, fa5
 ; CHECK-NEXT:    vrgatherei16.vv v10, v8, v12, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v10
 ; CHECK-NEXT:    ret
@@ -128,14 +128,14 @@ define <4 x double> @vrgather_shuffle_vx_v4f64(<4 x double> %x) {
 ; CHECK-LABEL: vrgather_shuffle_vx_v4f64:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
-; CHECK-NEXT:    vid.v v12
+; CHECK-NEXT:    vid.v v10
 ; CHECK-NEXT:    lui a0, %hi(.LCPI8_0)
-; CHECK-NEXT:    addi a0, a0, %lo(.LCPI8_0)
-; CHECK-NEXT:    vlse64.v v10, (a0), zero
+; CHECK-NEXT:    fld fa5, %lo(.LCPI8_0)(a0)
 ; CHECK-NEXT:    li a0, 3
+; CHECK-NEXT:    vmul.vx v12, v10, a0
 ; CHECK-NEXT:    vmv.v.i v0, 3
-; CHECK-NEXT:    vmul.vx v12, v12, a0
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m2, ta, mu
+; CHECK-NEXT:    vfmv.v.f v10, fa5
 ; CHECK-NEXT:    vrgatherei16.vv v10, v8, v12, v0.t
 ; CHECK-NEXT:    vmv.v.v v8, v10
 ; CHECK-NEXT:    ret
@@ -298,12 +298,33 @@ define <4 x half> @vrgather_shuffle_vv_v4f16(<4 x half> %x, <4 x half> %y) {
 }
 
 define <4 x half> @vrgather_shuffle_vx_v4f16_load(ptr %p) {
-; CHECK-LABEL: vrgather_shuffle_vx_v4f16_load:
-; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, 2
-; CHECK-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
-; CHECK-NEXT:    vlse16.v v8, (a0), zero
-; CHECK-NEXT:    ret
+; RV32-ZVFH-LABEL: vrgather_shuffle_vx_v4f16_load:
+; RV32-ZVFH:       # %bb.0:
+; RV32-ZVFH-NEXT:    flh fa5, 2(a0)
+; RV32-ZVFH-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
+; RV32-ZVFH-NEXT:    vfmv.v.f v8, fa5
+; RV32-ZVFH-NEXT:    ret
+;
+; RV64-ZVFH-LABEL: vrgather_shuffle_vx_v4f16_load:
+; RV64-ZVFH:       # %bb.0:
+; RV64-ZVFH-NEXT:    flh fa5, 2(a0)
+; RV64-ZVFH-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
+; RV64-ZVFH-NEXT:    vfmv.v.f v8, fa5
+; RV64-ZVFH-NEXT:    ret
+;
+; RV32-ZVFHMIN-LABEL: vrgather_shuffle_vx_v4f16_load:
+; RV32-ZVFHMIN:       # %bb.0:
+; RV32-ZVFHMIN-NEXT:    lh a0, 2(a0)
+; RV32-ZVFHMIN-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
+; RV32-ZVFHMIN-NEXT:    vmv.v.x v8, a0
+; RV32-ZVFHMIN-NEXT:    ret
+;
+; RV64-ZVFHMIN-LABEL: vrgather_shuffle_vx_v4f16_load:
+; RV64-ZVFHMIN:       # %bb.0:
+; RV64-ZVFHMIN-NEXT:    lh a0, 2(a0)
+; RV64-ZVFHMIN-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
+; RV64-ZVFHMIN-NEXT:    vmv.v.x v8, a0
+; RV64-ZVFHMIN-NEXT:    ret
   %v = load <4 x half>, ptr %p
   %s = shufflevector <4 x half> %v, <4 x half> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>
   ret <4 x half> %s
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-vrgather.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-vrgather.ll
index de7dfab1dfcff..58b0a17cdccd6 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-vrgather.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-vrgather.ll
@@ -5,9 +5,9 @@
 define void @gather_const_v8f16(ptr %x) {
 ; CHECK-LABEL: gather_const_v8f16:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a1, a0, 10
+; CHECK-NEXT:    flh fa5, 10(a0)
 ; CHECK-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
-; CHECK-NEXT:    vlse16.v v8, (a1), zero
+; CHECK-NEXT:    vfmv.v.f v8, fa5
 ; CHECK-NEXT:    vse16.v v8, (a0)
 ; CHECK-NEXT:    ret
   %a = load <8 x half>, ptr %x
@@ -21,9 +21,9 @@ define void @gather_const_v8f16(ptr %x) {
 define void @gather_const_v4f32(ptr %x) {
 ; CHECK-LABEL: gather_const_v4f32:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a1, a0, 8
+; CHECK-NEXT:    flw fa5, 8(a0)
 ; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
-; CHECK-NEXT: ...
[truncated]

Copy link
Collaborator

@topperc topperc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"false", "Hasn't optimized (perform fewer memory operations)"
def TuneOptimizedZeroStrideLoad
: SubtargetFeature<"optimized-zero-stride-load", "HasOptimizedZeroStrideLoad",
"true", "optimized (perform fewer memory operations)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimized -> Optimized

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also reproduce the spacemit-x60 not having optimised x0-stride vlses.

@preames preames merged commit b5657d6 into llvm:main Jul 10, 2024
4 of 6 checks passed
@preames preames deleted the pr-optimized-strided-load-default branch July 10, 2024 14:36
yetingk pushed a commit to yetingk/llvm-project that referenced this pull request Jul 12, 2024
This is recommit of llvm#98140. It should be based on llvm#98205 which changes
the feature of hardware zero stride optimization.

It's a similar patch as a214c52 for vp.stride.load.
Some targets prefer pattern (vmv.v.x (load)) instead of vlse with zero stride.
yetingk pushed a commit to yetingk/llvm-project that referenced this pull request Jul 12, 2024
This is recommit of llvm#98140. It should be based on llvm#98205 which changes
the feature of hardware zero stride optimization.

It's a similar patch as a214c52 for vp.stride.load.
Some targets prefer pattern (vmv.v.x (load)) instead of vlse with zero stride.
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
…rs1), x0 (llvm#98205)

Some cores implement an optimization for a strided load with an x0
stride, which results in fewer memory operations being performed then
implied by VL since all address are the same. It seems to be the case
that this is the case only for a minority of available implementations.
We know that sifive-x280 does, but sifive-p670 and spacemit-x60 both do
not.

(To be more precise, measurements on the x60 appear to indicate that a
 stride of x0 has similar latency to a non-zero stride, and that both
 are about twice a vleN.v.  I'm taking this to mean the x0
 case is not optimized.)

We had an existing flag by which a processor could opt out of this
assumption but no upstream users. Instead of adding this flag to the
p670 and x60, this patch reverses the default and adds the opt-in flag
only to the x280.
yetingk added a commit that referenced this pull request Jul 15, 2024
…98579)

This is a recommit of #98140. The old commit should be rebased on #98205
which changes the feature of hardware zero stride optimization.

It's a similar patch as a214c52 for
vp.stride.load. Some targets prefer pattern (vmv.v.x (load)) instead of
vlse with zero stride.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants