-
Notifications
You must be signed in to change notification settings - Fork 14.3k
LV: clamp VF with TC only when scalar epilogue is needed #91253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -252,36 +252,46 @@ for.end: ; preds = %for.body | |
define void @trip16_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture noundef readonly %src) #0 { | ||
; CHECK-LABEL: @trip16_i8( | ||
; CHECK-NEXT: entry: | ||
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]] | ||
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64() | ||
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 8 | ||
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 16, [[TMP1]] | ||
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]] | ||
; CHECK: vector.ph: | ||
; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64() | ||
; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 8 | ||
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 16, [[TMP3]] | ||
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 16, [[N_MOD_VF]] | ||
; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64() | ||
; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 8 | ||
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]] | ||
; CHECK: vector.body: | ||
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ] | ||
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0 | ||
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP0]] | ||
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP2]], align 1 | ||
; CHECK-NEXT: [[TMP3:%.*]] = shl <16 x i8> [[WIDE_LOAD]], <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1> | ||
; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP0]] | ||
; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <16 x i8>, ptr [[TMP5]], align 1 | ||
; CHECK-NEXT: [[TMP6:%.*]] = add <16 x i8> [[TMP3]], [[WIDE_LOAD1]] | ||
; CHECK-NEXT: store <16 x i8> [[TMP6]], ptr [[TMP5]], align 1 | ||
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16 | ||
; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[INDEX]], 0 | ||
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP6]] | ||
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 8 x i8>, ptr [[TMP8]], align 1 | ||
; CHECK-NEXT: [[TMP9:%.*]] = shl <vscale x 8 x i8> [[WIDE_LOAD]], shufflevector (<vscale x 8 x i8> insertelement (<vscale x 8 x i8> poison, i8 1, i64 0), <vscale x 8 x i8> poison, <vscale x 8 x i32> zeroinitializer) | ||
; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP6]] | ||
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[TMP10]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <vscale x 8 x i8>, ptr [[TMP11]], align 1 | ||
; CHECK-NEXT: [[TMP12:%.*]] = add <vscale x 8 x i8> [[TMP9]], [[WIDE_LOAD1]] | ||
; CHECK-NEXT: store <vscale x 8 x i8> [[TMP12]], ptr [[TMP11]], align 1 | ||
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]] | ||
; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]] | ||
; CHECK: middle.block: | ||
; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]] | ||
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 16, [[N_VEC]] | ||
; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]] | ||
; CHECK: scalar.ph: | ||
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ] | ||
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ] | ||
; CHECK-NEXT: br label [[FOR_BODY:%.*]] | ||
; CHECK: for.body: | ||
; CHECK-NEXT: [[I_08:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.*]], [[FOR_BODY]] ] | ||
; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[SRC]], i64 [[I_08]] | ||
; CHECK-NEXT: [[TMP7:%.*]] = load i8, ptr [[ARRAYIDX]], align 1 | ||
; CHECK-NEXT: [[MUL:%.*]] = shl i8 [[TMP7]], 1 | ||
; CHECK-NEXT: [[TMP13:%.*]] = load i8, ptr [[ARRAYIDX]], align 1 | ||
; CHECK-NEXT: [[MUL:%.*]] = shl i8 [[TMP13]], 1 | ||
; CHECK-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[I_08]] | ||
; CHECK-NEXT: [[TMP8:%.*]] = load i8, ptr [[ARRAYIDX1]], align 1 | ||
; CHECK-NEXT: [[ADD:%.*]] = add i8 [[MUL]], [[TMP8]] | ||
; CHECK-NEXT: [[TMP14:%.*]] = load i8, ptr [[ARRAYIDX1]], align 1 | ||
; CHECK-NEXT: [[ADD:%.*]] = add i8 [[MUL]], [[TMP14]] | ||
; CHECK-NEXT: store i8 [[ADD]], ptr [[ARRAYIDX1]], align 1 | ||
; CHECK-NEXT: [[INC]] = add nuw nsw i64 [[I_08]], 1 | ||
; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], 16 | ||
|
@@ -313,36 +323,46 @@ for.end: ; preds = %for.body | |
define void @trip32_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture noundef readonly %src) #0 { | ||
; CHECK-LABEL: @trip32_i8( | ||
; CHECK-NEXT: entry: | ||
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]] | ||
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64() | ||
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 16 | ||
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 32, [[TMP1]] | ||
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]] | ||
; CHECK: vector.ph: | ||
; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64() | ||
; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 16 | ||
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 32, [[TMP3]] | ||
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 32, [[N_MOD_VF]] | ||
; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64() | ||
; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 16 | ||
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]] | ||
; CHECK: vector.body: | ||
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ] | ||
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0 | ||
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP0]] | ||
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <32 x i8>, ptr [[TMP2]], align 1 | ||
; CHECK-NEXT: [[TMP3:%.*]] = shl <32 x i8> [[WIDE_LOAD]], <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1> | ||
; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP0]] | ||
; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <32 x i8>, ptr [[TMP5]], align 1 | ||
; CHECK-NEXT: [[TMP6:%.*]] = add <32 x i8> [[TMP3]], [[WIDE_LOAD1]] | ||
; CHECK-NEXT: store <32 x i8> [[TMP6]], ptr [[TMP5]], align 1 | ||
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 32 | ||
; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[INDEX]], 0 | ||
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP6]] | ||
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 16 x i8>, ptr [[TMP8]], align 1 | ||
; CHECK-NEXT: [[TMP9:%.*]] = shl <vscale x 16 x i8> [[WIDE_LOAD]], shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 1, i64 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer) | ||
; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP6]] | ||
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[TMP10]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <vscale x 16 x i8>, ptr [[TMP11]], align 1 | ||
; CHECK-NEXT: [[TMP12:%.*]] = add <vscale x 16 x i8> [[TMP9]], [[WIDE_LOAD1]] | ||
; CHECK-NEXT: store <vscale x 16 x i8> [[TMP12]], ptr [[TMP11]], align 1 | ||
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]] | ||
; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]] | ||
; CHECK: middle.block: | ||
; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]] | ||
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 32, [[N_VEC]] | ||
; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]] | ||
; CHECK: scalar.ph: | ||
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 32, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ] | ||
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ] | ||
; CHECK-NEXT: br label [[FOR_BODY:%.*]] | ||
; CHECK: for.body: | ||
; CHECK-NEXT: [[I_08:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.*]], [[FOR_BODY]] ] | ||
; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[SRC]], i64 [[I_08]] | ||
; CHECK-NEXT: [[TMP7:%.*]] = load i8, ptr [[ARRAYIDX]], align 1 | ||
; CHECK-NEXT: [[MUL:%.*]] = shl i8 [[TMP7]], 1 | ||
; CHECK-NEXT: [[TMP13:%.*]] = load i8, ptr [[ARRAYIDX]], align 1 | ||
; CHECK-NEXT: [[MUL:%.*]] = shl i8 [[TMP13]], 1 | ||
; CHECK-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[I_08]] | ||
; CHECK-NEXT: [[TMP8:%.*]] = load i8, ptr [[ARRAYIDX1]], align 1 | ||
; CHECK-NEXT: [[ADD:%.*]] = add i8 [[MUL]], [[TMP8]] | ||
; CHECK-NEXT: [[TMP14:%.*]] = load i8, ptr [[ARRAYIDX1]], align 1 | ||
; CHECK-NEXT: [[ADD:%.*]] = add i8 [[MUL]], [[TMP14]] | ||
; CHECK-NEXT: store i8 [[ADD]], ptr [[ARRAYIDX1]], align 1 | ||
; CHECK-NEXT: [[INC]] = add nuw nsw i64 [[I_08]], 1 | ||
; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], 32 | ||
|
@@ -373,37 +393,47 @@ for.end: ; preds = %for.body | |
define void @trip24_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture noundef readonly %src) #0 { | ||
; CHECK-LABEL: @trip24_i8( | ||
; CHECK-NEXT: entry: | ||
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]] | ||
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64() | ||
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4 | ||
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 24, [[TMP1]] | ||
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]] | ||
; CHECK: vector.ph: | ||
; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64() | ||
; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4 | ||
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 24, [[TMP3]] | ||
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 24, [[N_MOD_VF]] | ||
; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64() | ||
; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4 | ||
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]] | ||
; CHECK: vector.body: | ||
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ] | ||
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0 | ||
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP0]] | ||
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <8 x i8>, ptr [[TMP2]], align 1 | ||
; CHECK-NEXT: [[TMP3:%.*]] = shl <8 x i8> [[WIDE_LOAD]], <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1> | ||
; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP0]] | ||
; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <8 x i8>, ptr [[TMP5]], align 1 | ||
; CHECK-NEXT: [[TMP6:%.*]] = add <8 x i8> [[TMP3]], [[WIDE_LOAD1]] | ||
; CHECK-NEXT: store <8 x i8> [[TMP6]], ptr [[TMP5]], align 1 | ||
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8 | ||
; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 24 | ||
; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]] | ||
; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[INDEX]], 0 | ||
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP6]] | ||
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x i8>, ptr [[TMP8]], align 1 | ||
; CHECK-NEXT: [[TMP9:%.*]] = shl <vscale x 4 x i8> [[WIDE_LOAD]], shufflevector (<vscale x 4 x i8> insertelement (<vscale x 4 x i8> poison, i8 1, i64 0), <vscale x 4 x i8> poison, <vscale x 4 x i32> zeroinitializer) | ||
; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP6]] | ||
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[TMP10]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <vscale x 4 x i8>, ptr [[TMP11]], align 1 | ||
; CHECK-NEXT: [[TMP12:%.*]] = add <vscale x 4 x i8> [[TMP9]], [[WIDE_LOAD1]] | ||
; CHECK-NEXT: store <vscale x 4 x i8> [[TMP12]], ptr [[TMP11]], align 1 | ||
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]] | ||
; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]] | ||
; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]] | ||
Comment on lines
-381
to
+422
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, for instance, this test change is profitable: we changed a fixed-width 8 (requiring three iterations) to a scalable-vector 4 (with vscale computed as 6, hence requiring one iteration). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. RISC-V uses That said, specifically this case is not clear:
I'm talking from hw we have, but on some targets that might be even better to choose full vector + scalar |
||
; CHECK: middle.block: | ||
; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]] | ||
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 24, [[N_VEC]] | ||
; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]] | ||
; CHECK: scalar.ph: | ||
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 24, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ] | ||
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ] | ||
; CHECK-NEXT: br label [[FOR_BODY:%.*]] | ||
; CHECK: for.body: | ||
; CHECK-NEXT: [[I_08:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.*]], [[FOR_BODY]] ] | ||
; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[SRC]], i64 [[I_08]] | ||
; CHECK-NEXT: [[TMP8:%.*]] = load i8, ptr [[ARRAYIDX]], align 1 | ||
; CHECK-NEXT: [[MUL:%.*]] = shl i8 [[TMP8]], 1 | ||
; CHECK-NEXT: [[TMP14:%.*]] = load i8, ptr [[ARRAYIDX]], align 1 | ||
; CHECK-NEXT: [[MUL:%.*]] = shl i8 [[TMP14]], 1 | ||
; CHECK-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[I_08]] | ||
; CHECK-NEXT: [[TMP9:%.*]] = load i8, ptr [[ARRAYIDX1]], align 1 | ||
; CHECK-NEXT: [[ADD:%.*]] = add i8 [[MUL]], [[TMP9]] | ||
; CHECK-NEXT: [[TMP15:%.*]] = load i8, ptr [[ARRAYIDX1]], align 1 | ||
; CHECK-NEXT: [[ADD:%.*]] = add i8 [[MUL]], [[TMP15]] | ||
; CHECK-NEXT: store i8 [[ADD]], ptr [[ARRAYIDX1]], align 1 | ||
; CHECK-NEXT: [[INC]] = add nuw nsw i64 [[I_08]], 1 | ||
; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], 24 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are any of the test changes profitable? I'd assume at least for this one vectorizing with VF 16 as done previously is more profitable, as we know that we execute exactly one vector iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't the number of iterations the same, as fixed-width 16 has now been changed to vscale x 8, where vscale is computed as 2? On targets supporting scalable-vectors, I think scalable-vector vectorization is slightly more profitable than the fixed-width one (see profitability function in LV, which slightly prefers scalable over fixed).
Also, consider that this patch has removed an ugly and unnecessary special-case: this test output is now in line with the rest of the test outputs in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we need an extra minimum iteration checks and code to compute the runtime VF. If
16 x i8
fits in a vector register, I'd assume that there's no benefit from using scalable vectors here.It's not my area of expertise, maybe @preames or @nikolaypanchenko could chime in on whether this is desirable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On RISC-V this minimum iteration check is going to be expensive, since register's size is unknown. For instance, that's what is generated before that change:
after that change
However, long term plan is to enable EVL vectorization for RISC-V. As it doesn't require this extra check, generated code will be good:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! The current code with
isPowerOf2_32
seemed wrong to me, purely from a logical standpoint: it is technically the job ofisMoreProfitableThan
to determine the best VPlan with the best VF, with access to the CostModel, and the code I removed seemed to be doing an ad-hoc overriding of the VF, and its existence seems to be an accident of history (see commit message). However, as EVL isn't enabled for RISC-V today, this change is a regression, and LV isn't ready for this cleanup yet. I will close this PR.