-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[LV] Disable fold tail by masking - when induction vars used outside #81609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write If you have received no comments on your PR for a week, you can request a review If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-llvm-transforms Author: Niwin Anto (niwinanto) ChangesWhen induction variable are used outside the loop body, tail folding by masking mis-compiles. Full diff: https://github.com/llvm/llvm-project/pull/81609.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index 37a356c43e29a4..d33743e74cbe31 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -1552,6 +1552,19 @@ bool LoopVectorizationLegality::prepareToFoldTailByMasking() {
}
}
+ for (const auto &Entry : getInductionVars()) {
+ PHINode *OrigPhi = Entry.first;
+ for (User *U : OrigPhi->users()) {
+ auto *UI = cast<Instruction>(U);
+ if (!TheLoop->contains(UI)) {
+ LLVM_DEBUG(dbgs() << "LV: Cannot fold tail by masking, loop IV has an "
+ "outside user for "
+ << *UI << "\n");
+ return false;
+ }
+ }
+ }
+
// The list of pointers that we can safely read and write to remains empty.
SmallPtrSet<Value *, 8> SafePointers;
diff --git a/llvm/test/Transforms/LoopVectorize/no-fold-tail-by-masking-iv-external-uses.ll b/llvm/test/Transforms/LoopVectorize/no-fold-tail-by-masking-iv-external-uses.ll
new file mode 100644
index 00000000000000..f7379df934bd77
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/no-fold-tail-by-masking-iv-external-uses.ll
@@ -0,0 +1,85 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt < %s -passes=loop-vectorize -S | FileCheck %s
+
+
+; #include <stdio.h>
+; #define SIZE 17
+;
+; unsigned char result;
+; unsigned char arr_1[SIZE];
+;
+; __attribute__((__noinline__))
+; void test(int limit, unsigned char val, int arr_2[SIZE][SIZE][SIZE]) {
+; #pragma clang loop vectorize_predicate(enable)
+; for (short i_5 = 0; i_5 < limit; i_5++) {
+; arr_1 [i_5] = val;
+; result = arr_2[0][0][i_5] != arr_2[i_5][i_5][0];
+; }
+; }
+;
+;int main(void) {
+; int arr_2[SIZE][SIZE][SIZE];
+;
+; __builtin_memset(arr_2, 1, sizeof(arr_2));
+;
+; test(SIZE, 0, arr_2);
+; printf("%hu \n", result);
+;}
+; clang miss-compiles the above code
+; with vectorize_predicate(enable), result is 0 and 1 without.
+
+
+@result = global i8 0, align 1
+@arr_17 = global [17 x i8] zeroinitializer, align 1
+@a = external global i8, align 1
+
+define void @test(i32 %limit, i8 zeroext %val, ptr readonly %arr_14) {
+; CHECK-LABEL: @test(
+; CHECK-NOT: pred.store.if:
+; CHECK-NOT: pred.store.continue:
+;
+entry:
+ %cmp18 = icmp sgt i32 %limit, 0
+ br i1 %cmp18, label %for.body.preheader, label %for.cond.cleanup
+
+for.body.preheader: ; preds = %entry
+ br label %for.body
+
+for.cond.for.cond.cleanup_crit_edge: ; preds = %for.body
+ %conv20.lcssa = phi i32 [ %conv20, %for.body ]
+ %arrayidx4 = getelementptr inbounds [17 x i32], ptr %arr_14, i32 0, i32 %conv20.lcssa
+ %0 = load i32, ptr %arrayidx4, align 4, !tbaa !4
+ %arrayidx8 = getelementptr inbounds [17 x [17 x i32]], ptr %arr_14, i32 %conv20.lcssa, i32 %conv20.lcssa
+ %1 = load i32, ptr %arrayidx8, align 4, !tbaa !4
+ %cmp10 = icmp ne i32 %0, %1
+ %conv11 = zext i1 %cmp10 to i8
+ store i8 %conv11, ptr @result, align 1, !tbaa !8
+ br label %for.cond.cleanup
+
+for.cond.cleanup: ; preds = %for.cond.for.cond.cleanup_crit_edge, %entry
+ ret void
+
+for.body: ; preds = %for.body.preheader, %for.body
+ %conv20 = phi i32 [ %conv, %for.body ], [ 0, %for.body.preheader ]
+ %i_5.019 = phi i16 [ %inc, %for.body ], [ 0, %for.body.preheader ]
+ %arrayidx = getelementptr inbounds [17 x i8], ptr @arr_17, i32 0, i32 %conv20
+ store i8 %val, ptr %arrayidx, align 1, !tbaa !8
+ %inc = add i16 %i_5.019, 1
+ %conv = sext i16 %inc to i32
+ %cmp = icmp slt i32 %conv, %limit
+ br i1 %cmp, label %for.body, label %for.cond.for.cond.cleanup_crit_edge, !llvm.loop !9
+}
+
+
+
+!4 = !{!5, !5, i64 0}
+!5 = !{!"int", !6, i64 0}
+!6 = !{!"omnipotent char", !7, i64 0}
+!7 = !{!"Simple C++ TBAA"}
+!8 = !{!6, !6, i64 0}
+!9 = distinct !{!9, !10, !11, !12, !13, !14}
+!10 = !{!"llvm.loop.mustprogress"}
+!11 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}
+!12 = !{!"llvm.loop.vectorize.width", i32 2}
+!13 = !{!"llvm.loop.vectorize.scalable.enable", i1 false}
+!14 = !{!"llvm.loop.vectorize.enable", i1 true}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch!
Could you add the test as a separate PR (with a FIXME); this patch then just adjust the test and the diff shows the change in the test only.
Previously there was a patch shared here https://reviews.llvm.org/D115109 by @rickyz (hope it's the same as on Phabricator) but the patch never got pushed through. Would be good to look at the comments and potentially pick it up
br label %for.body | ||
|
||
for.cond.for.cond.cleanup_crit_edge: ; preds = %for.body | ||
%conv20.lcssa = phi i32 [ %conv20, %for.body ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test can be simplified by just returning %conv20.lcssa
here
ret void | ||
|
||
for.body: ; preds = %for.body.preheader, %for.body | ||
%conv20 = phi i32 [ %conv, %for.body ], [ 0, %for.body.preheader ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the issue reproduce if all uses of %conv20
are replaced by i_5.019
?
|
||
for.body: ; preds = %for.body.preheader, %for.body | ||
%conv20 = phi i32 [ %conv, %for.body ], [ 0, %for.body.preheader ] | ||
%i_5.019 = phi i16 [ %inc, %for.body ], [ 0, %for.body.preheader ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the phi be changed to i32, so the sext
in the loop isn't needed?
%conv20 = phi i32 [ %conv, %for.body ], [ 0, %for.body.preheader ] | ||
%i_5.019 = phi i16 [ %inc, %for.body ], [ 0, %for.body.preheader ] | ||
%arrayidx = getelementptr inbounds [17 x i8], ptr @arr_17, i32 0, i32 %conv20 | ||
store i8 %val, ptr %arrayidx, align 1, !tbaa !8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove !tbaa metadata
; CHECK-NOT: pred.store.continue: | ||
; | ||
entry: | ||
%cmp18 = icmp sgt i32 %limit, 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the check and branch shouldn't be needed.
;int main(void) { | ||
; int arr_2[SIZE][SIZE][SIZE]; | ||
; | ||
; __builtin_memset(arr_2, 1, sizeof(arr_2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we don't include C/C++ source code, as the IR usually needs to stand on its own. Below are a few suggestions to further simplify the IR and make it more readable.
It would be helpful if you could instead a brief comment explaining the issue.
|
||
define void @test(i32 %limit, i8 zeroext %val, ptr readonly %arr_14) { | ||
; CHECK-LABEL: @test( | ||
; CHECK-NOT: pred.store.if: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite fragile; some existing tests use CHECK-NOT: vector.body:
to check for not vectorizing.
|
||
|
||
|
||
!4 = !{!5, !5, i64 0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nodes used by tbaa shouldn't be needed after dropping !tbaa
@@ -0,0 +1,85 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py | |||
; RUN: opt < %s -passes=loop-vectorize -S | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is added as a target-independent test, it probably needs something like -force-vector-width=4 -force-vector-interleave=1
to make sure the vectorizer tries to vectorize independent of the cost-model.
Thanks @fhahn for the reviews. Great that you mentioned the Phabricator patch, the test looks good and I copied here. As you suggested, created new pr for the test case with default behavior(niwinanto@33ec308) and then updated this pr. However, I messed with the git workflow(I think). Could you please take a look, this is what you intended. |
Thank you @niwinanto for picking this up (and apologies for letting the change languish for so long despite @fhahn's helpful comments!) |
Yeah that looks good, I'll add a few small additional comments. But best to create a separate PR to just add the test case showing the issue first. |
@fhahn I am exactly trying to create a separate PR. niwinanto#2. May be you can help me to figure out what I am doing wrong. I am extremely sorry, getting used to the new workflow. As you suggested, I created a new commit with different branch and created new PR(for test as mentioned above). For some reason it contain the commit from this PR, which I tried to remove by dropping in interactive re-base and forced push. Also, addressed feedback regarding the tests. |
Looking at https://github.com/niwinanto/llvm-project/pull/2/commits, it looks like there's a single commit adding the test, so that looks good I think? Could you update the destination branch to be upstream llvm-project's |
|
@fhahn Updated the PR to adjust the changes after merging the test early. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
I adjusted the description of the PR a bit to add a few more details.
@niwinanto Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested Please check whether problems have been caused by your change specifically, as How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
When induction variable are used outside the loop body, tail folding
by masking mis-compiles, because for users outside of the loop the
final value of the induction is computed separately from the vector
loop.
Fixes #76069
Fixes #51677