You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[RISCV] Account for factor in interleave memory op costs (llvm#111511)
Currently we cost an interleaved memory op as if it were a load/store of
the widened vector type, but this was undercosting in all cases when
compared to the measured performance of todays hardware.
On the x280 at NF=2 and spacemit-x60 at NF=2,3 and 4, a segmented load
is carried out as a wide load and NF LMUL shuffle ops:
https://github.com/preames/bp3-microarch#vlseg_lmul_x_sew_throughput
All other NFs go through a slow path. On the spacemit-x60 this is
proportional to VLMAX * NF, and on the x280 proportional to the number
of segments.
This patch increases the cost by implementing a wide load + NF LMUL
shuffle op cost for the lowest common denominator NF=2, and then a
slower cost proportional to VL for the other NFs.
In a follow up patch we can add a tuning flag to use the faster cost
model for NF=3 and 4 on the spacemit-x60.
Note that the FIXME about illegal vectors seems to have been fixed in
llvm#100436
; CHECK-NEXT: [[TMP9:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
429
+
; CHECK-NEXT: [[TMP10:%.*]] = add <vscale x 4 x i64> [[TMP9]], zeroinitializer
430
+
; CHECK-NEXT: [[TMP11:%.*]] = mul <vscale x 4 x i64> [[TMP10]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 2, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
431
+
; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> zeroinitializer, [[TMP11]]
432
+
; CHECK-NEXT: [[TMP12:%.*]] = mul i64 2, [[TMP8]]
433
+
; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP12]], i64 0
434
+
; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <vscale x 8 x i8>, ptr [[TMP14]], align 1
443
+
; CHECK-NEXT: [[STRIDED_VEC:%.*]] = call { <vscale x 4 x i8>, <vscale x 4 x i8> } @llvm.vector.deinterleave2.nxv8i8(<vscale x 8 x i8> [[WIDE_VEC]])
444
+
; CHECK-NEXT: [[TMP15:%.*]] = extractvalue { <vscale x 4 x i8>, <vscale x 4 x i8> } [[STRIDED_VEC]], 0
445
+
; CHECK-NEXT: [[TMP16:%.*]] = extractvalue { <vscale x 4 x i8>, <vscale x 4 x i8> } [[STRIDED_VEC]], 1
446
+
; CHECK-NEXT: [[TMP17:%.*]] = zext <vscale x 4 x i8> [[TMP16]] to <vscale x 4 x i32>
447
+
; CHECK-NEXT: [[TMP18:%.*]] = getelementptr i32, ptr [[DST]], <vscale x 4 x i64> [[VEC_IND]]
448
+
; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[TMP17]], <vscale x 4 x ptr> [[TMP18]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
0 commit comments