Skip to content

[SimplifyCFG] Not folding branch in loop header with constant iterations #74268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

xiangzh1
Copy link
Contributor

@xiangzh1 xiangzh1 commented Dec 4, 2023

[SimplifyCFG] Not folding branch in constant loops which expected unroll

Constant iteration loop with unroll hint usually expected do unroll
by consumers, folding branches in such loop header at SimplifyCFG will
break unroll optimization.

for example:
#program unroll
for (int I = 0; I < ConstNum; ++I) { // ConstNum > 1
if (Cond2) {
break;
}
xxx loop body;
}
Folding these conditional branches may break loop unroll.

@llvmbot llvmbot added clang Clang issues not falling into any other category llvm:transforms labels Dec 4, 2023
@llvmbot
Copy link
Member

llvmbot commented Dec 4, 2023

@llvm/pr-subscribers-clang

Author: None (xiangzh1)

Changes

Loop header with constant usually can be optimized in unroll, folding branch in such loop header at SimplifyCFG will break unroll optimization.
for example:
Escape folding "I < ConstNum" with "Cond2" due to loops of constant iterations can be easily optimized (e.g unroll).
for (int I = 0; I < ConstNum; ++I) { // ConstNum > 1
if (Cond2) {
break;
}
xxx loop body;
}
Folding these conditional branches may break loop optimizations.


Patch is 48.66 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/74268.diff

3 Files Affected:

  • (added) clang/test/CodeGenCUDA/simplify-cfg-unroll.cu (+364)
  • (modified) llvm/lib/Transforms/Utils/SimplifyCFG.cpp (+43)
  • (modified) llvm/test/Transforms/LoopVectorize/if-pred-non-void.ll (+46-45)
diff --git a/clang/test/CodeGenCUDA/simplify-cfg-unroll.cu b/clang/test/CodeGenCUDA/simplify-cfg-unroll.cu
new file mode 100644
index 0000000000000..ecb421f9fc85c
--- /dev/null
+++ b/clang/test/CodeGenCUDA/simplify-cfg-unroll.cu
@@ -0,0 +1,364 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 4
+// REQUIRES: amdgpu-registered-target
+// REQUIRES: x86-registered-target
+// RUN: %clang_cc1 -O2 "-aux-triple" "x86_64-unknown-linux-gnu" "-triple" "amdgcn-amd-amdhsa" \
+// RUN:    -fcuda-is-device "-aux-target-cpu" "x86-64" -emit-llvm -o - %s | FileCheck %s
+
+#include "Inputs/cuda.h"
+
+// CHECK-LABEL: define dso_local void @_Z4funciPPiiS_(
+// CHECK-SAME: i32 noundef [[IDX:%.*]], ptr nocapture noundef readonly [[ARR:%.*]], i32 noundef [[DIMS:%.*]], ptr nocapture noundef [[OUT:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CMP1:%.*]] = icmp eq i32 [[DIMS]], 0
+// CHECK-NEXT:    br i1 [[CMP1]], label [[CLEANUP:%.*]], label [[IF_END:%.*]]
+// CHECK:       if.end:
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARR]], align 8, !tbaa [[TBAA3:![0-9]+]]
+// CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[TMP0]], align 4, !tbaa [[TBAA7:![0-9]+]]
+// CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14:%.*]] = add nsw i32 [[TMP2]], [[TMP1]]
+// CHECK-NEXT:    store i32 [[ADD14]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 1
+// CHECK-NEXT:    [[TMP3:%.*]] = load i32, ptr [[ARRAYIDX11_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX13_1:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 1
+// CHECK-NEXT:    [[TMP4:%.*]] = load i32, ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1:%.*]] = add nsw i32 [[TMP4]], [[TMP3]]
+// CHECK-NEXT:    store i32 [[ADD14_1]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 2
+// CHECK-NEXT:    [[TMP5:%.*]] = load i32, ptr [[ARRAYIDX11_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX13_2:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 2
+// CHECK-NEXT:    [[TMP6:%.*]] = load i32, ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2:%.*]] = add nsw i32 [[TMP6]], [[TMP5]]
+// CHECK-NEXT:    store i32 [[ADD14_2]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 3
+// CHECK-NEXT:    [[TMP7:%.*]] = load i32, ptr [[ARRAYIDX11_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX13_3:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 3
+// CHECK-NEXT:    [[TMP8:%.*]] = load i32, ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3:%.*]] = add nsw i32 [[TMP8]], [[TMP7]]
+// CHECK-NEXT:    store i32 [[ADD14_3]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_1:%.*]] = icmp eq i32 [[DIMS]], 1
+// CHECK-NEXT:    br i1 [[CMP1_1]], label [[CLEANUP]], label [[IF_END_1:%.*]]
+// CHECK:       if.end.1:
+// CHECK-NEXT:    [[ARRAYIDX_1:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 1
+// CHECK-NEXT:    [[TMP9:%.*]] = load ptr, ptr [[ARRAYIDX_1]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP10:%.*]] = load i32, ptr [[TMP9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_129:%.*]] = add nsw i32 [[ADD14]], [[TMP10]]
+// CHECK-NEXT:    store i32 [[ADD14_129]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_1:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i64 1
+// CHECK-NEXT:    [[TMP11:%.*]] = load i32, ptr [[ARRAYIDX11_1_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_1:%.*]] = add nsw i32 [[ADD14_1]], [[TMP11]]
+// CHECK-NEXT:    store i32 [[ADD14_1_1]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_1:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i64 2
+// CHECK-NEXT:    [[TMP12:%.*]] = load i32, ptr [[ARRAYIDX11_2_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_1:%.*]] = add nsw i32 [[ADD14_2]], [[TMP12]]
+// CHECK-NEXT:    store i32 [[ADD14_2_1]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_1:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i64 3
+// CHECK-NEXT:    [[TMP13:%.*]] = load i32, ptr [[ARRAYIDX11_3_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_1:%.*]] = add nsw i32 [[ADD14_3]], [[TMP13]]
+// CHECK-NEXT:    store i32 [[ADD14_3_1]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_2:%.*]] = icmp eq i32 [[DIMS]], 2
+// CHECK-NEXT:    br i1 [[CMP1_2]], label [[CLEANUP]], label [[IF_END_2:%.*]]
+// CHECK:       if.end.2:
+// CHECK-NEXT:    [[ARRAYIDX_2:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 2
+// CHECK-NEXT:    [[TMP14:%.*]] = load ptr, ptr [[ARRAYIDX_2]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP15:%.*]] = load i32, ptr [[TMP14]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_230:%.*]] = add nsw i32 [[ADD14_129]], [[TMP15]]
+// CHECK-NEXT:    store i32 [[ADD14_230]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_2:%.*]] = getelementptr inbounds i32, ptr [[TMP14]], i64 1
+// CHECK-NEXT:    [[TMP16:%.*]] = load i32, ptr [[ARRAYIDX11_1_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_2:%.*]] = add nsw i32 [[ADD14_1_1]], [[TMP16]]
+// CHECK-NEXT:    store i32 [[ADD14_1_2]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_2:%.*]] = getelementptr inbounds i32, ptr [[TMP14]], i64 2
+// CHECK-NEXT:    [[TMP17:%.*]] = load i32, ptr [[ARRAYIDX11_2_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_2:%.*]] = add nsw i32 [[ADD14_2_1]], [[TMP17]]
+// CHECK-NEXT:    store i32 [[ADD14_2_2]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_2:%.*]] = getelementptr inbounds i32, ptr [[TMP14]], i64 3
+// CHECK-NEXT:    [[TMP18:%.*]] = load i32, ptr [[ARRAYIDX11_3_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_2:%.*]] = add nsw i32 [[ADD14_3_1]], [[TMP18]]
+// CHECK-NEXT:    store i32 [[ADD14_3_2]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_3:%.*]] = icmp eq i32 [[DIMS]], 3
+// CHECK-NEXT:    br i1 [[CMP1_3]], label [[CLEANUP]], label [[IF_END_3:%.*]]
+// CHECK:       if.end.3:
+// CHECK-NEXT:    [[ARRAYIDX_3:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 3
+// CHECK-NEXT:    [[TMP19:%.*]] = load ptr, ptr [[ARRAYIDX_3]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP20:%.*]] = load i32, ptr [[TMP19]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_331:%.*]] = add nsw i32 [[ADD14_230]], [[TMP20]]
+// CHECK-NEXT:    store i32 [[ADD14_331]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_3:%.*]] = getelementptr inbounds i32, ptr [[TMP19]], i64 1
+// CHECK-NEXT:    [[TMP21:%.*]] = load i32, ptr [[ARRAYIDX11_1_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_3:%.*]] = add nsw i32 [[ADD14_1_2]], [[TMP21]]
+// CHECK-NEXT:    store i32 [[ADD14_1_3]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_3:%.*]] = getelementptr inbounds i32, ptr [[TMP19]], i64 2
+// CHECK-NEXT:    [[TMP22:%.*]] = load i32, ptr [[ARRAYIDX11_2_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_3:%.*]] = add nsw i32 [[ADD14_2_2]], [[TMP22]]
+// CHECK-NEXT:    store i32 [[ADD14_2_3]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_3:%.*]] = getelementptr inbounds i32, ptr [[TMP19]], i64 3
+// CHECK-NEXT:    [[TMP23:%.*]] = load i32, ptr [[ARRAYIDX11_3_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_3:%.*]] = add nsw i32 [[ADD14_3_2]], [[TMP23]]
+// CHECK-NEXT:    store i32 [[ADD14_3_3]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_4:%.*]] = icmp eq i32 [[DIMS]], 4
+// CHECK-NEXT:    br i1 [[CMP1_4]], label [[CLEANUP]], label [[IF_END_4:%.*]]
+// CHECK:       if.end.4:
+// CHECK-NEXT:    [[ARRAYIDX_4:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 4
+// CHECK-NEXT:    [[TMP24:%.*]] = load ptr, ptr [[ARRAYIDX_4]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP25:%.*]] = load i32, ptr [[TMP24]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_4:%.*]] = add nsw i32 [[ADD14_331]], [[TMP25]]
+// CHECK-NEXT:    store i32 [[ADD14_4]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_4:%.*]] = getelementptr inbounds i32, ptr [[TMP24]], i64 1
+// CHECK-NEXT:    [[TMP26:%.*]] = load i32, ptr [[ARRAYIDX11_1_4]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_4:%.*]] = add nsw i32 [[ADD14_1_3]], [[TMP26]]
+// CHECK-NEXT:    store i32 [[ADD14_1_4]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_4:%.*]] = getelementptr inbounds i32, ptr [[TMP24]], i64 2
+// CHECK-NEXT:    [[TMP27:%.*]] = load i32, ptr [[ARRAYIDX11_2_4]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_4:%.*]] = add nsw i32 [[ADD14_2_3]], [[TMP27]]
+// CHECK-NEXT:    store i32 [[ADD14_2_4]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_4:%.*]] = getelementptr inbounds i32, ptr [[TMP24]], i64 3
+// CHECK-NEXT:    [[TMP28:%.*]] = load i32, ptr [[ARRAYIDX11_3_4]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_4:%.*]] = add nsw i32 [[ADD14_3_3]], [[TMP28]]
+// CHECK-NEXT:    store i32 [[ADD14_3_4]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_5:%.*]] = icmp eq i32 [[DIMS]], 5
+// CHECK-NEXT:    br i1 [[CMP1_5]], label [[CLEANUP]], label [[IF_END_5:%.*]]
+// CHECK:       if.end.5:
+// CHECK-NEXT:    [[ARRAYIDX_5:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 5
+// CHECK-NEXT:    [[TMP29:%.*]] = load ptr, ptr [[ARRAYIDX_5]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP30:%.*]] = load i32, ptr [[TMP29]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_5:%.*]] = add nsw i32 [[ADD14_4]], [[TMP30]]
+// CHECK-NEXT:    store i32 [[ADD14_5]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_5:%.*]] = getelementptr inbounds i32, ptr [[TMP29]], i64 1
+// CHECK-NEXT:    [[TMP31:%.*]] = load i32, ptr [[ARRAYIDX11_1_5]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_5:%.*]] = add nsw i32 [[ADD14_1_4]], [[TMP31]]
+// CHECK-NEXT:    store i32 [[ADD14_1_5]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_5:%.*]] = getelementptr inbounds i32, ptr [[TMP29]], i64 2
+// CHECK-NEXT:    [[TMP32:%.*]] = load i32, ptr [[ARRAYIDX11_2_5]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_5:%.*]] = add nsw i32 [[ADD14_2_4]], [[TMP32]]
+// CHECK-NEXT:    store i32 [[ADD14_2_5]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_5:%.*]] = getelementptr inbounds i32, ptr [[TMP29]], i64 3
+// CHECK-NEXT:    [[TMP33:%.*]] = load i32, ptr [[ARRAYIDX11_3_5]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_5:%.*]] = add nsw i32 [[ADD14_3_4]], [[TMP33]]
+// CHECK-NEXT:    store i32 [[ADD14_3_5]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_6:%.*]] = icmp eq i32 [[DIMS]], 6
+// CHECK-NEXT:    br i1 [[CMP1_6]], label [[CLEANUP]], label [[IF_END_6:%.*]]
+// CHECK:       if.end.6:
+// CHECK-NEXT:    [[ARRAYIDX_6:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 6
+// CHECK-NEXT:    [[TMP34:%.*]] = load ptr, ptr [[ARRAYIDX_6]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP35:%.*]] = load i32, ptr [[TMP34]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_6:%.*]] = add nsw i32 [[ADD14_5]], [[TMP35]]
+// CHECK-NEXT:    store i32 [[ADD14_6]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_6:%.*]] = getelementptr inbounds i32, ptr [[TMP34]], i64 1
+// CHECK-NEXT:    [[TMP36:%.*]] = load i32, ptr [[ARRAYIDX11_1_6]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_6:%.*]] = add nsw i32 [[ADD14_1_5]], [[TMP36]]
+// CHECK-NEXT:    store i32 [[ADD14_1_6]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_6:%.*]] = getelementptr inbounds i32, ptr [[TMP34]], i64 2
+// CHECK-NEXT:    [[TMP37:%.*]] = load i32, ptr [[ARRAYIDX11_2_6]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_6:%.*]] = add nsw i32 [[ADD14_2_5]], [[TMP37]]
+// CHECK-NEXT:    store i32 [[ADD14_2_6]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_6:%.*]] = getelementptr inbounds i32, ptr [[TMP34]], i64 3
+// CHECK-NEXT:    [[TMP38:%.*]] = load i32, ptr [[ARRAYIDX11_3_6]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_6:%.*]] = add nsw i32 [[ADD14_3_5]], [[TMP38]]
+// CHECK-NEXT:    store i32 [[ADD14_3_6]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_7:%.*]] = icmp eq i32 [[DIMS]], 7
+// CHECK-NEXT:    br i1 [[CMP1_7]], label [[CLEANUP]], label [[IF_END_7:%.*]]
+// CHECK:       if.end.7:
+// CHECK-NEXT:    [[ARRAYIDX_7:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 7
+// CHECK-NEXT:    [[TMP39:%.*]] = load ptr, ptr [[ARRAYIDX_7]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP40:%.*]] = load i32, ptr [[TMP39]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_7:%.*]] = add nsw i32 [[ADD14_6]], [[TMP40]]
+// CHECK-NEXT:    store i32 [[ADD14_7]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_7:%.*]] = getelementptr inbounds i32, ptr [[TMP39]], i64 1
+// CHECK-NEXT:    [[TMP41:%.*]] = load i32, ptr [[ARRAYIDX11_1_7]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_7:%.*]] = add nsw i32 [[ADD14_1_6]], [[TMP41]]
+// CHECK-NEXT:    store i32 [[ADD14_1_7]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_7:%.*]] = getelementptr inbounds i32, ptr [[TMP39]], i64 2
+// CHECK-NEXT:    [[TMP42:%.*]] = load i32, ptr [[ARRAYIDX11_2_7]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_7:%.*]] = add nsw i32 [[ADD14_2_6]], [[TMP42]]
+// CHECK-NEXT:    store i32 [[ADD14_2_7]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_7:%.*]] = getelementptr inbounds i32, ptr [[TMP39]], i64 3
+// CHECK-NEXT:    [[TMP43:%.*]] = load i32, ptr [[ARRAYIDX11_3_7]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_7:%.*]] = add nsw i32 [[ADD14_3_6]], [[TMP43]]
+// CHECK-NEXT:    store i32 [[ADD14_3_7]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_8:%.*]] = icmp eq i32 [[DIMS]], 8
+// CHECK-NEXT:    br i1 [[CMP1_8]], label [[CLEANUP]], label [[IF_END_8:%.*]]
+// CHECK:       if.end.8:
+// CHECK-NEXT:    [[ARRAYIDX_8:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 8
+// CHECK-NEXT:    [[TMP44:%.*]] = load ptr, ptr [[ARRAYIDX_8]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP45:%.*]] = load i32, ptr [[TMP44]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_8:%.*]] = add nsw i32 [[ADD14_7]], [[TMP45]]
+// CHECK-NEXT:    store i32 [[ADD14_8]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_8:%.*]] = getelementptr inbounds i32, ptr [[TMP44]], i64 1
+// CHECK-NEXT:    [[TMP46:%.*]] = load i32, ptr [[ARRAYIDX11_1_8]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_8:%.*]] = add nsw i32 [[ADD14_1_7]], [[TMP46]]
+// CHECK-NEXT:    store i32 [[ADD14_1_8]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_8:%.*]] = getelementptr inbounds i32, ptr [[TMP44]], i64 2
+// CHECK-NEXT:    [[TMP47:%.*]] = load i32, ptr [[ARRAYIDX11_2_8]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_8:%.*]] = add nsw i32 [[ADD14_2_7]], [[TMP47]]
+// CHECK-NEXT:    store i32 [[ADD14_2_8]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_8:%.*]] = getelementptr inbounds i32, ptr [[TMP44]], i64 3
+// CHECK-NEXT:    [[TMP48:%.*]] = load i32, ptr [[ARRAYIDX11_3_8]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_8:%.*]] = add nsw i32 [[ADD14_3_7]], [[TMP48]]
+// CHECK-NEXT:    store i32 [[ADD14_3_8]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_9:%.*]] = icmp eq i32 [[DIMS]], 9
+// CHECK-NEXT:    br i1 [[CMP1_9]], label [[CLEANUP]], label [[IF_END_9:%.*]]
+// CHECK:       if.end.9:
+// CHECK-NEXT:    [[ARRAYIDX_9:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 9
+// CHECK-NEXT:    [[TMP49:%.*]] = load ptr, ptr [[ARRAYIDX_9]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP50:%.*]] = load i32, ptr [[TMP49]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_9:%.*]] = add nsw i32 [[ADD14_8]], [[TMP50]]
+// CHECK-NEXT:    store i32 [[ADD14_9]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_9:%.*]] = getelementptr inbounds i32, ptr [[TMP49]], i64 1
+// CHECK-NEXT:    [[TMP51:%.*]] = load i32, ptr [[ARRAYIDX11_1_9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_9:%.*]] = add nsw i32 [[ADD14_1_8]], [[TMP51]]
+// CHECK-NEXT:    store i32 [[ADD14_1_9]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_9:%.*]] = getelementptr inbounds i32, ptr [[TMP49]], i64 2
+// CHECK-NEXT:    [[TMP52:%.*]] = load i32, ptr [[ARRAYIDX11_2_9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_9:%.*]] = add nsw i32 [[ADD14_2_8]], [[TMP52]]
+// CHECK-NEXT:    store i32 [[ADD14_2_9]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_9:%.*]] = getelementptr inbounds i32, ptr [[TMP49]], i64 3
+// CHECK-NEXT:    [[TMP53:%.*]] = load i32, ptr [[ARRAYIDX11_3_9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_9:%.*]] = add nsw i32 [[ADD14_3_8]], [[TMP53]]
+// CHECK-NEXT:    store i32 [[ADD14_3_9]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_10:%.*]] = icmp eq i32 [[DIMS]], 10
+// CHECK-NEXT:    br i1 [[CMP1_10]], label [[CLEANUP]], label [[IF_END_10:%.*]]
+// CHECK:       if.end.10:
+// CHECK-NEXT:    [[ARRAYIDX_10:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 10
+// CHECK-NEXT:    [[TMP54:%.*]] = load ptr, ptr [[ARRAYIDX_10]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP55:%.*]] = load i32, ptr [[TMP54]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_10:%.*]] = add nsw i32 [[ADD14_9]], [[TMP55]]
+// CHECK-NEXT:    store i32 [[ADD14_10]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_10:%.*]] = getelementptr inbounds i32, ptr [[TMP54]], i64 1
+// CHECK-NEXT:    [[TMP56:%.*]] = load i32, ptr [[ARRAYIDX11_1_10]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_10:%.*]] = add nsw i32 [[ADD14_1_9]], [[TMP56]]
+// CHECK-NEXT:    store i32 [[ADD14_1_10]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_10:%.*]] = getelementptr inbounds i32, ptr [[TMP54]], i64 2
+// CHECK-NEXT:    [[TMP57:%.*]] = load i32, ptr [[ARRAYIDX11_2_10]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_10:%.*]] = add nsw i32 [[ADD14_2_9]], [[TMP57]]
+// CHECK-NEXT:    store i32 [[ADD14_2_10]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_10:%.*]] = getelementptr inbounds i32, ptr [[TMP54]], i64 3
+// CHECK-NEXT:    [[TMP58:%.*]] = load i32, ptr [[ARRAYIDX11_3_10]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_10:%.*]] = add nsw i32 [[ADD14_3_9]], [[TMP58]]
+// CHECK-NEXT:    store i32 [[ADD14_3_10]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_11:%.*]] = icmp eq i32 [[DIMS]], 11
+// CHECK-NEXT:    br i1 [[CMP1_11]], label [[CLEANUP]], label [[IF_END_11:%.*]]
+// CHECK:       if.end.11:
+// CHECK-NEXT:    [[ARRAYIDX_11:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 11
+// CHECK-NEXT:    [[TMP59:%.*]] = load ptr, ptr [[ARRAYIDX_11]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP60:%.*]] = load i32, ptr [[TMP59]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_11:%.*]] = add nsw i32 [[ADD14_10]], [[TMP60]]
+// CHECK-NEXT:    store i32 [[ADD14_11]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_11:%.*]] = getelementptr inbounds i32, ptr [[TMP59]], i64 1
+// CHECK-NEXT:    [[TMP61:%.*]] = load i32, ptr [[ARRAYIDX11_1_11]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_11:%.*]] = add nsw i32 [[ADD14_1_10]], [[TMP61]]
+// CHECK-NEXT:    store i32 [[ADD14_1_11]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_11:%.*]] = getelement...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Dec 4, 2023

@llvm/pr-subscribers-llvm-transforms

Author: None (xiangzh1)

Changes

Loop header with constant usually can be optimized in unroll, folding branch in such loop header at SimplifyCFG will break unroll optimization.
for example:
Escape folding "I < ConstNum" with "Cond2" due to loops of constant iterations can be easily optimized (e.g unroll).
for (int I = 0; I < ConstNum; ++I) { // ConstNum > 1
if (Cond2) {
break;
}
xxx loop body;
}
Folding these conditional branches may break loop optimizations.


Patch is 48.66 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/74268.diff

3 Files Affected:

  • (added) clang/test/CodeGenCUDA/simplify-cfg-unroll.cu (+364)
  • (modified) llvm/lib/Transforms/Utils/SimplifyCFG.cpp (+43)
  • (modified) llvm/test/Transforms/LoopVectorize/if-pred-non-void.ll (+46-45)
diff --git a/clang/test/CodeGenCUDA/simplify-cfg-unroll.cu b/clang/test/CodeGenCUDA/simplify-cfg-unroll.cu
new file mode 100644
index 0000000000000..ecb421f9fc85c
--- /dev/null
+++ b/clang/test/CodeGenCUDA/simplify-cfg-unroll.cu
@@ -0,0 +1,364 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 4
+// REQUIRES: amdgpu-registered-target
+// REQUIRES: x86-registered-target
+// RUN: %clang_cc1 -O2 "-aux-triple" "x86_64-unknown-linux-gnu" "-triple" "amdgcn-amd-amdhsa" \
+// RUN:    -fcuda-is-device "-aux-target-cpu" "x86-64" -emit-llvm -o - %s | FileCheck %s
+
+#include "Inputs/cuda.h"
+
+// CHECK-LABEL: define dso_local void @_Z4funciPPiiS_(
+// CHECK-SAME: i32 noundef [[IDX:%.*]], ptr nocapture noundef readonly [[ARR:%.*]], i32 noundef [[DIMS:%.*]], ptr nocapture noundef [[OUT:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CMP1:%.*]] = icmp eq i32 [[DIMS]], 0
+// CHECK-NEXT:    br i1 [[CMP1]], label [[CLEANUP:%.*]], label [[IF_END:%.*]]
+// CHECK:       if.end:
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARR]], align 8, !tbaa [[TBAA3:![0-9]+]]
+// CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[TMP0]], align 4, !tbaa [[TBAA7:![0-9]+]]
+// CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14:%.*]] = add nsw i32 [[TMP2]], [[TMP1]]
+// CHECK-NEXT:    store i32 [[ADD14]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 1
+// CHECK-NEXT:    [[TMP3:%.*]] = load i32, ptr [[ARRAYIDX11_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX13_1:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 1
+// CHECK-NEXT:    [[TMP4:%.*]] = load i32, ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1:%.*]] = add nsw i32 [[TMP4]], [[TMP3]]
+// CHECK-NEXT:    store i32 [[ADD14_1]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 2
+// CHECK-NEXT:    [[TMP5:%.*]] = load i32, ptr [[ARRAYIDX11_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX13_2:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 2
+// CHECK-NEXT:    [[TMP6:%.*]] = load i32, ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2:%.*]] = add nsw i32 [[TMP6]], [[TMP5]]
+// CHECK-NEXT:    store i32 [[ADD14_2]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 3
+// CHECK-NEXT:    [[TMP7:%.*]] = load i32, ptr [[ARRAYIDX11_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX13_3:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 3
+// CHECK-NEXT:    [[TMP8:%.*]] = load i32, ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3:%.*]] = add nsw i32 [[TMP8]], [[TMP7]]
+// CHECK-NEXT:    store i32 [[ADD14_3]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_1:%.*]] = icmp eq i32 [[DIMS]], 1
+// CHECK-NEXT:    br i1 [[CMP1_1]], label [[CLEANUP]], label [[IF_END_1:%.*]]
+// CHECK:       if.end.1:
+// CHECK-NEXT:    [[ARRAYIDX_1:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 1
+// CHECK-NEXT:    [[TMP9:%.*]] = load ptr, ptr [[ARRAYIDX_1]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP10:%.*]] = load i32, ptr [[TMP9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_129:%.*]] = add nsw i32 [[ADD14]], [[TMP10]]
+// CHECK-NEXT:    store i32 [[ADD14_129]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_1:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i64 1
+// CHECK-NEXT:    [[TMP11:%.*]] = load i32, ptr [[ARRAYIDX11_1_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_1:%.*]] = add nsw i32 [[ADD14_1]], [[TMP11]]
+// CHECK-NEXT:    store i32 [[ADD14_1_1]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_1:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i64 2
+// CHECK-NEXT:    [[TMP12:%.*]] = load i32, ptr [[ARRAYIDX11_2_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_1:%.*]] = add nsw i32 [[ADD14_2]], [[TMP12]]
+// CHECK-NEXT:    store i32 [[ADD14_2_1]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_1:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i64 3
+// CHECK-NEXT:    [[TMP13:%.*]] = load i32, ptr [[ARRAYIDX11_3_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_1:%.*]] = add nsw i32 [[ADD14_3]], [[TMP13]]
+// CHECK-NEXT:    store i32 [[ADD14_3_1]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_2:%.*]] = icmp eq i32 [[DIMS]], 2
+// CHECK-NEXT:    br i1 [[CMP1_2]], label [[CLEANUP]], label [[IF_END_2:%.*]]
+// CHECK:       if.end.2:
+// CHECK-NEXT:    [[ARRAYIDX_2:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 2
+// CHECK-NEXT:    [[TMP14:%.*]] = load ptr, ptr [[ARRAYIDX_2]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP15:%.*]] = load i32, ptr [[TMP14]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_230:%.*]] = add nsw i32 [[ADD14_129]], [[TMP15]]
+// CHECK-NEXT:    store i32 [[ADD14_230]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_2:%.*]] = getelementptr inbounds i32, ptr [[TMP14]], i64 1
+// CHECK-NEXT:    [[TMP16:%.*]] = load i32, ptr [[ARRAYIDX11_1_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_2:%.*]] = add nsw i32 [[ADD14_1_1]], [[TMP16]]
+// CHECK-NEXT:    store i32 [[ADD14_1_2]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_2:%.*]] = getelementptr inbounds i32, ptr [[TMP14]], i64 2
+// CHECK-NEXT:    [[TMP17:%.*]] = load i32, ptr [[ARRAYIDX11_2_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_2:%.*]] = add nsw i32 [[ADD14_2_1]], [[TMP17]]
+// CHECK-NEXT:    store i32 [[ADD14_2_2]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_2:%.*]] = getelementptr inbounds i32, ptr [[TMP14]], i64 3
+// CHECK-NEXT:    [[TMP18:%.*]] = load i32, ptr [[ARRAYIDX11_3_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_2:%.*]] = add nsw i32 [[ADD14_3_1]], [[TMP18]]
+// CHECK-NEXT:    store i32 [[ADD14_3_2]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_3:%.*]] = icmp eq i32 [[DIMS]], 3
+// CHECK-NEXT:    br i1 [[CMP1_3]], label [[CLEANUP]], label [[IF_END_3:%.*]]
+// CHECK:       if.end.3:
+// CHECK-NEXT:    [[ARRAYIDX_3:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 3
+// CHECK-NEXT:    [[TMP19:%.*]] = load ptr, ptr [[ARRAYIDX_3]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP20:%.*]] = load i32, ptr [[TMP19]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_331:%.*]] = add nsw i32 [[ADD14_230]], [[TMP20]]
+// CHECK-NEXT:    store i32 [[ADD14_331]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_3:%.*]] = getelementptr inbounds i32, ptr [[TMP19]], i64 1
+// CHECK-NEXT:    [[TMP21:%.*]] = load i32, ptr [[ARRAYIDX11_1_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_3:%.*]] = add nsw i32 [[ADD14_1_2]], [[TMP21]]
+// CHECK-NEXT:    store i32 [[ADD14_1_3]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_3:%.*]] = getelementptr inbounds i32, ptr [[TMP19]], i64 2
+// CHECK-NEXT:    [[TMP22:%.*]] = load i32, ptr [[ARRAYIDX11_2_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_3:%.*]] = add nsw i32 [[ADD14_2_2]], [[TMP22]]
+// CHECK-NEXT:    store i32 [[ADD14_2_3]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_3:%.*]] = getelementptr inbounds i32, ptr [[TMP19]], i64 3
+// CHECK-NEXT:    [[TMP23:%.*]] = load i32, ptr [[ARRAYIDX11_3_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_3:%.*]] = add nsw i32 [[ADD14_3_2]], [[TMP23]]
+// CHECK-NEXT:    store i32 [[ADD14_3_3]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_4:%.*]] = icmp eq i32 [[DIMS]], 4
+// CHECK-NEXT:    br i1 [[CMP1_4]], label [[CLEANUP]], label [[IF_END_4:%.*]]
+// CHECK:       if.end.4:
+// CHECK-NEXT:    [[ARRAYIDX_4:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 4
+// CHECK-NEXT:    [[TMP24:%.*]] = load ptr, ptr [[ARRAYIDX_4]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP25:%.*]] = load i32, ptr [[TMP24]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_4:%.*]] = add nsw i32 [[ADD14_331]], [[TMP25]]
+// CHECK-NEXT:    store i32 [[ADD14_4]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_4:%.*]] = getelementptr inbounds i32, ptr [[TMP24]], i64 1
+// CHECK-NEXT:    [[TMP26:%.*]] = load i32, ptr [[ARRAYIDX11_1_4]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_4:%.*]] = add nsw i32 [[ADD14_1_3]], [[TMP26]]
+// CHECK-NEXT:    store i32 [[ADD14_1_4]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_4:%.*]] = getelementptr inbounds i32, ptr [[TMP24]], i64 2
+// CHECK-NEXT:    [[TMP27:%.*]] = load i32, ptr [[ARRAYIDX11_2_4]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_4:%.*]] = add nsw i32 [[ADD14_2_3]], [[TMP27]]
+// CHECK-NEXT:    store i32 [[ADD14_2_4]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_4:%.*]] = getelementptr inbounds i32, ptr [[TMP24]], i64 3
+// CHECK-NEXT:    [[TMP28:%.*]] = load i32, ptr [[ARRAYIDX11_3_4]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_4:%.*]] = add nsw i32 [[ADD14_3_3]], [[TMP28]]
+// CHECK-NEXT:    store i32 [[ADD14_3_4]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_5:%.*]] = icmp eq i32 [[DIMS]], 5
+// CHECK-NEXT:    br i1 [[CMP1_5]], label [[CLEANUP]], label [[IF_END_5:%.*]]
+// CHECK:       if.end.5:
+// CHECK-NEXT:    [[ARRAYIDX_5:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 5
+// CHECK-NEXT:    [[TMP29:%.*]] = load ptr, ptr [[ARRAYIDX_5]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP30:%.*]] = load i32, ptr [[TMP29]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_5:%.*]] = add nsw i32 [[ADD14_4]], [[TMP30]]
+// CHECK-NEXT:    store i32 [[ADD14_5]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_5:%.*]] = getelementptr inbounds i32, ptr [[TMP29]], i64 1
+// CHECK-NEXT:    [[TMP31:%.*]] = load i32, ptr [[ARRAYIDX11_1_5]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_5:%.*]] = add nsw i32 [[ADD14_1_4]], [[TMP31]]
+// CHECK-NEXT:    store i32 [[ADD14_1_5]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_5:%.*]] = getelementptr inbounds i32, ptr [[TMP29]], i64 2
+// CHECK-NEXT:    [[TMP32:%.*]] = load i32, ptr [[ARRAYIDX11_2_5]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_5:%.*]] = add nsw i32 [[ADD14_2_4]], [[TMP32]]
+// CHECK-NEXT:    store i32 [[ADD14_2_5]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_5:%.*]] = getelementptr inbounds i32, ptr [[TMP29]], i64 3
+// CHECK-NEXT:    [[TMP33:%.*]] = load i32, ptr [[ARRAYIDX11_3_5]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_5:%.*]] = add nsw i32 [[ADD14_3_4]], [[TMP33]]
+// CHECK-NEXT:    store i32 [[ADD14_3_5]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_6:%.*]] = icmp eq i32 [[DIMS]], 6
+// CHECK-NEXT:    br i1 [[CMP1_6]], label [[CLEANUP]], label [[IF_END_6:%.*]]
+// CHECK:       if.end.6:
+// CHECK-NEXT:    [[ARRAYIDX_6:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 6
+// CHECK-NEXT:    [[TMP34:%.*]] = load ptr, ptr [[ARRAYIDX_6]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP35:%.*]] = load i32, ptr [[TMP34]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_6:%.*]] = add nsw i32 [[ADD14_5]], [[TMP35]]
+// CHECK-NEXT:    store i32 [[ADD14_6]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_6:%.*]] = getelementptr inbounds i32, ptr [[TMP34]], i64 1
+// CHECK-NEXT:    [[TMP36:%.*]] = load i32, ptr [[ARRAYIDX11_1_6]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_6:%.*]] = add nsw i32 [[ADD14_1_5]], [[TMP36]]
+// CHECK-NEXT:    store i32 [[ADD14_1_6]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_6:%.*]] = getelementptr inbounds i32, ptr [[TMP34]], i64 2
+// CHECK-NEXT:    [[TMP37:%.*]] = load i32, ptr [[ARRAYIDX11_2_6]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_6:%.*]] = add nsw i32 [[ADD14_2_5]], [[TMP37]]
+// CHECK-NEXT:    store i32 [[ADD14_2_6]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_6:%.*]] = getelementptr inbounds i32, ptr [[TMP34]], i64 3
+// CHECK-NEXT:    [[TMP38:%.*]] = load i32, ptr [[ARRAYIDX11_3_6]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_6:%.*]] = add nsw i32 [[ADD14_3_5]], [[TMP38]]
+// CHECK-NEXT:    store i32 [[ADD14_3_6]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_7:%.*]] = icmp eq i32 [[DIMS]], 7
+// CHECK-NEXT:    br i1 [[CMP1_7]], label [[CLEANUP]], label [[IF_END_7:%.*]]
+// CHECK:       if.end.7:
+// CHECK-NEXT:    [[ARRAYIDX_7:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 7
+// CHECK-NEXT:    [[TMP39:%.*]] = load ptr, ptr [[ARRAYIDX_7]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP40:%.*]] = load i32, ptr [[TMP39]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_7:%.*]] = add nsw i32 [[ADD14_6]], [[TMP40]]
+// CHECK-NEXT:    store i32 [[ADD14_7]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_7:%.*]] = getelementptr inbounds i32, ptr [[TMP39]], i64 1
+// CHECK-NEXT:    [[TMP41:%.*]] = load i32, ptr [[ARRAYIDX11_1_7]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_7:%.*]] = add nsw i32 [[ADD14_1_6]], [[TMP41]]
+// CHECK-NEXT:    store i32 [[ADD14_1_7]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_7:%.*]] = getelementptr inbounds i32, ptr [[TMP39]], i64 2
+// CHECK-NEXT:    [[TMP42:%.*]] = load i32, ptr [[ARRAYIDX11_2_7]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_7:%.*]] = add nsw i32 [[ADD14_2_6]], [[TMP42]]
+// CHECK-NEXT:    store i32 [[ADD14_2_7]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_7:%.*]] = getelementptr inbounds i32, ptr [[TMP39]], i64 3
+// CHECK-NEXT:    [[TMP43:%.*]] = load i32, ptr [[ARRAYIDX11_3_7]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_7:%.*]] = add nsw i32 [[ADD14_3_6]], [[TMP43]]
+// CHECK-NEXT:    store i32 [[ADD14_3_7]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_8:%.*]] = icmp eq i32 [[DIMS]], 8
+// CHECK-NEXT:    br i1 [[CMP1_8]], label [[CLEANUP]], label [[IF_END_8:%.*]]
+// CHECK:       if.end.8:
+// CHECK-NEXT:    [[ARRAYIDX_8:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 8
+// CHECK-NEXT:    [[TMP44:%.*]] = load ptr, ptr [[ARRAYIDX_8]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP45:%.*]] = load i32, ptr [[TMP44]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_8:%.*]] = add nsw i32 [[ADD14_7]], [[TMP45]]
+// CHECK-NEXT:    store i32 [[ADD14_8]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_8:%.*]] = getelementptr inbounds i32, ptr [[TMP44]], i64 1
+// CHECK-NEXT:    [[TMP46:%.*]] = load i32, ptr [[ARRAYIDX11_1_8]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_8:%.*]] = add nsw i32 [[ADD14_1_7]], [[TMP46]]
+// CHECK-NEXT:    store i32 [[ADD14_1_8]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_8:%.*]] = getelementptr inbounds i32, ptr [[TMP44]], i64 2
+// CHECK-NEXT:    [[TMP47:%.*]] = load i32, ptr [[ARRAYIDX11_2_8]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_8:%.*]] = add nsw i32 [[ADD14_2_7]], [[TMP47]]
+// CHECK-NEXT:    store i32 [[ADD14_2_8]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_8:%.*]] = getelementptr inbounds i32, ptr [[TMP44]], i64 3
+// CHECK-NEXT:    [[TMP48:%.*]] = load i32, ptr [[ARRAYIDX11_3_8]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_8:%.*]] = add nsw i32 [[ADD14_3_7]], [[TMP48]]
+// CHECK-NEXT:    store i32 [[ADD14_3_8]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_9:%.*]] = icmp eq i32 [[DIMS]], 9
+// CHECK-NEXT:    br i1 [[CMP1_9]], label [[CLEANUP]], label [[IF_END_9:%.*]]
+// CHECK:       if.end.9:
+// CHECK-NEXT:    [[ARRAYIDX_9:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 9
+// CHECK-NEXT:    [[TMP49:%.*]] = load ptr, ptr [[ARRAYIDX_9]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP50:%.*]] = load i32, ptr [[TMP49]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_9:%.*]] = add nsw i32 [[ADD14_8]], [[TMP50]]
+// CHECK-NEXT:    store i32 [[ADD14_9]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_9:%.*]] = getelementptr inbounds i32, ptr [[TMP49]], i64 1
+// CHECK-NEXT:    [[TMP51:%.*]] = load i32, ptr [[ARRAYIDX11_1_9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_9:%.*]] = add nsw i32 [[ADD14_1_8]], [[TMP51]]
+// CHECK-NEXT:    store i32 [[ADD14_1_9]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_9:%.*]] = getelementptr inbounds i32, ptr [[TMP49]], i64 2
+// CHECK-NEXT:    [[TMP52:%.*]] = load i32, ptr [[ARRAYIDX11_2_9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_9:%.*]] = add nsw i32 [[ADD14_2_8]], [[TMP52]]
+// CHECK-NEXT:    store i32 [[ADD14_2_9]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_9:%.*]] = getelementptr inbounds i32, ptr [[TMP49]], i64 3
+// CHECK-NEXT:    [[TMP53:%.*]] = load i32, ptr [[ARRAYIDX11_3_9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_9:%.*]] = add nsw i32 [[ADD14_3_8]], [[TMP53]]
+// CHECK-NEXT:    store i32 [[ADD14_3_9]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_10:%.*]] = icmp eq i32 [[DIMS]], 10
+// CHECK-NEXT:    br i1 [[CMP1_10]], label [[CLEANUP]], label [[IF_END_10:%.*]]
+// CHECK:       if.end.10:
+// CHECK-NEXT:    [[ARRAYIDX_10:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 10
+// CHECK-NEXT:    [[TMP54:%.*]] = load ptr, ptr [[ARRAYIDX_10]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP55:%.*]] = load i32, ptr [[TMP54]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_10:%.*]] = add nsw i32 [[ADD14_9]], [[TMP55]]
+// CHECK-NEXT:    store i32 [[ADD14_10]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_10:%.*]] = getelementptr inbounds i32, ptr [[TMP54]], i64 1
+// CHECK-NEXT:    [[TMP56:%.*]] = load i32, ptr [[ARRAYIDX11_1_10]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_10:%.*]] = add nsw i32 [[ADD14_1_9]], [[TMP56]]
+// CHECK-NEXT:    store i32 [[ADD14_1_10]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_10:%.*]] = getelementptr inbounds i32, ptr [[TMP54]], i64 2
+// CHECK-NEXT:    [[TMP57:%.*]] = load i32, ptr [[ARRAYIDX11_2_10]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_10:%.*]] = add nsw i32 [[ADD14_2_9]], [[TMP57]]
+// CHECK-NEXT:    store i32 [[ADD14_2_10]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_10:%.*]] = getelementptr inbounds i32, ptr [[TMP54]], i64 3
+// CHECK-NEXT:    [[TMP58:%.*]] = load i32, ptr [[ARRAYIDX11_3_10]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_10:%.*]] = add nsw i32 [[ADD14_3_9]], [[TMP58]]
+// CHECK-NEXT:    store i32 [[ADD14_3_10]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_11:%.*]] = icmp eq i32 [[DIMS]], 11
+// CHECK-NEXT:    br i1 [[CMP1_11]], label [[CLEANUP]], label [[IF_END_11:%.*]]
+// CHECK:       if.end.11:
+// CHECK-NEXT:    [[ARRAYIDX_11:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 11
+// CHECK-NEXT:    [[TMP59:%.*]] = load ptr, ptr [[ARRAYIDX_11]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP60:%.*]] = load i32, ptr [[TMP59]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_11:%.*]] = add nsw i32 [[ADD14_10]], [[TMP60]]
+// CHECK-NEXT:    store i32 [[ADD14_11]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_11:%.*]] = getelementptr inbounds i32, ptr [[TMP59]], i64 1
+// CHECK-NEXT:    [[TMP61:%.*]] = load i32, ptr [[ARRAYIDX11_1_11]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_11:%.*]] = add nsw i32 [[ADD14_1_10]], [[TMP61]]
+// CHECK-NEXT:    store i32 [[ADD14_1_11]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_11:%.*]] = getelement...
[truncated]

@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 4, 2023

Seems the check fail "dr2xx.cpp Line 1297: conversion from 'T' to 'unsigned long long' is ambiguous" has no relation with this change.

@bcl5980
Copy link
Contributor

bcl5980 commented Dec 4, 2023

We should add TTI check for the condition. I believe don't unroll on X86 is a correct decision.
Only like AMDGPU or NVPTX these GPU backend with heavy stack cost need this.

And I think you need to precommit tests first.

@xiangzh1 xiangzh1 requested a review from bcl5980 December 4, 2023 06:07
@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 4, 2023

We should add TTI check for the condition. I believe don't unroll on X86 is a correct decision. Only like AMDGPU or NVPTX these GPU backend with heavy stack cost need this.

And I think you need to precommit tests first.

In fact, there is no direct/strong relation with stack cost, it mostly base on unroll or not (or other loop optimizations). Maybe we should check "unroll" info (e.g #pragma unroll, any targets with this hint should try best to unroll too) before do or not do this folding. A little trouble is loop info is not well established now.

@bcl5980
Copy link
Contributor

bcl5980 commented Dec 4, 2023

We should add TTI check for the condition. I believe don't unroll on X86 is a correct decision. Only like AMDGPU or NVPTX these GPU backend with heavy stack cost need this.
And I think you need to precommit tests first.

In fact, there is no direct/strong relation with stack cost, it mostly base on unroll or not (or other loop optimizations). Maybe we should check "unroll" info (e.g #pragma unroll, any targets with this hint should try best to unroll too) before do or not do this folding. A little trouble is loop info is not well established now.

Yeah, it is a problem we unroll or not. But this case generally I believe we don't want to unroll it.

@xiangzh1 xiangzh1 force-pushed the users/xiangzhangllvm/refine-simplify-CFG-for-loop-unroll branch from 9f3ff6f to e963223 Compare December 4, 2023 08:34
@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 4, 2023

And I think you need to precommit tests first.
Done
In fact, there is no direct/strong relation with stack cost, it mostly base on unroll or not (or other loop optimizations). Maybe we should check "unroll" info (e.g #pragma unroll, any targets with this hint should try best to unroll too) before do or not do this folding. A little trouble is loop info is not well established now.

Yeah, it is a problem we unroll or not. But this case generally I believe we don't want to unroll it.

Update to handle unroll hint
Thanks for reviewing!

@bcl5980 bcl5980 requested a review from nikic December 4, 2023 10:30
@bcl5980
Copy link
Contributor

bcl5980 commented Dec 4, 2023

We need at lease one more IR test.

@bcl5980 bcl5980 requested a review from goldsteinn December 4, 2023 10:33
Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoopUnroll supports upper bound unrolling. Why is it not working in this case?

@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 4, 2023

We need at lease one more IR test.

Let me try

@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 4, 2023

LoopUnroll supports upper bound unrolling. Why is it not working in this case?
for example:
#program unroll
for (int I = 0; I < LoopCount; ++I) { // ConstNum > 1
if (Cond2) {
break;
}
xxx loop body;
}

After the branches fodling, the old loop condition "I < LoopCount" changed/disapeared, I think unroll can not make sure the "upper bound"

@nikic
Copy link
Contributor

nikic commented Dec 4, 2023

Can you please share the IR before the unroll pass?

@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 4, 2023

Can you please share the IR before the unroll pass?
Sure:

with this patch:

4523 ; *** IR Dump After LoopDeletionPass on for.body ***
4524
4525 ; Preheader:
4526 entry:
4527 br label %for.body
4528
4529 ; Loop:
4530 for.body: ; preds = %entry, %for.body7
4531 %Dim.027 = phi i32 [ 0, %entry ], [ %inc16, %for.body7 ]
4532 %cmp1 = icmp eq i32 %Dim.027, %Dims
4533 br i1 %cmp1, label %cleanup, label %if.end
4534
4535 if.end: ; preds = %for.body
4536 %idxprom = zext nneg i32 %Dim.027 to i64
4537 %arrayidx = getelementptr inbounds ptr, ptr %Arr, i64 %idxprom
4538 br label %for.body7
4539
4540 for.body7: ; preds = %if.end
4541 %0 = load ptr, ptr %arrayidx, align 8, !tbaa !3
4542 %1 = load i32, ptr %0, align 4, !tbaa !7
4543 %2 = load i32, ptr %Out, align 4, !tbaa !7
4544 %add14 = add nsw i32 %2, %1
4545 store i32 %add14, ptr %Out, align 4, !tbaa !7
4546 tail call void @_Z3barv() #2
4547 %3 = load ptr, ptr %arrayidx, align 8, !tbaa !3
4548 %arrayidx11.1 = getelementptr inbounds i32, ptr %3, i64 1
4549 %4 = load i32, ptr %arrayidx11.1, align 4, !tbaa !7
4550 %arrayidx13.1 = getelementptr inbounds i32, ptr %Out, i64 1
4551 %5 = load i32, ptr %arrayidx13.1, align 4, !tbaa !7
4552 %add14.1 = add nsw i32 %5, %4
4553 store i32 %add14.1, ptr %arrayidx13.1, align 4, !tbaa !7
4554 tail call void @_Z3barv() #2
4555 %6 = load ptr, ptr %arrayidx, align 8, !tbaa !3
4556 %arrayidx11.2 = getelementptr inbounds i32, ptr %6, i64 2
4557 %7 = load i32, ptr %arrayidx11.2, align 4, !tbaa !7
4558 %arrayidx13.2 = getelementptr inbounds i32, ptr %Out, i64 2
4559 %8 = load i32, ptr %arrayidx13.2, align 4, !tbaa !7
4560 %add14.2 = add nsw i32 %8, %7
4561 store i32 %add14.2, ptr %arrayidx13.2, align 4, !tbaa !7
4562 tail call void @_Z3barv() #2
4563 %9 = load ptr, ptr %arrayidx, align 8, !tbaa !3
4564 %arrayidx11.3 = getelementptr inbounds i32, ptr %9, i64 3
4565 %10 = load i32, ptr %arrayidx11.3, align 4, !tbaa !7
4566 %arrayidx13.3 = getelementptr inbounds i32, ptr %Out, i64 3
4567 %11 = load i32, ptr %arrayidx13.3, align 4, !tbaa !7
4568 %add14.3 = add nsw i32 %11, %10
4569 store i32 %add14.3, ptr %arrayidx13.3, align 4, !tbaa !7
4570 tail call void @_Z3barv() #2
4571 %inc16 = add nuw nsw i32 %Dim.027, 1
4572 %exitcond = icmp ne i32 %inc16, 16
4573 br i1 %exitcond, label %for.body, label %cleanup, !llvm.loop !9
4574
4575 ; Exit blocks
4576 cleanup: ; preds = %for.body, %for.body7
4577 ret void
4578
4579 cleanup: ; preds = %for.body, %for.body7
4580 ret void

4581 ; *** IR Dump After LoopFullUnrollPass on for.body (invalidated) ***

without this patch:
3829 ; *** IR Dump After LoopDeletionPass on if.end ***
3830
3831 ; Preheader:
3832 if.end.preheader: ; preds = %entry
3833 %0 = add i32 %Dims, -1
3834 %umin = call i32 @llvm.umin.i32(i32 %0, i32 15)
3835 %1 = add nuw nsw i32 %umin, 1
3836 %wide.trip.count = zext i32 %1 to i64
3837 br label %if.end
3838
3839 ; Loop:
3840 if.end: ; preds = %if.end.preheader, %for.body7
3841 %indvars.iv = phi i64 [ 0, %if.end.preheader ], [ %indvars.iv.next, %for.body7 ]
3842 %arrayidx = getelementptr inbounds ptr, ptr %Arr, i64 %indvars.iv
3843 br label %for.body7
3844
3845 for.body7: ; preds = %if.end
3846 %2 = load ptr, ptr %arrayidx, align 8, !tbaa !3
3847 %3 = load i32, ptr %2, align 4, !tbaa !7
3848 %4 = load i32, ptr %Out, align 4, !tbaa !7
3849 %add14 = add nsw i32 %4, %3
3850 store i32 %add14, ptr %Out, align 4, !tbaa !7
3851 tail call void @_Z3barv() #3
3852 %5 = load ptr, ptr %arrayidx, align 8, !tbaa !3
3853 %arrayidx11.1 = getelementptr inbounds i32, ptr %5, i64 1
3854 %6 = load i32, ptr %arrayidx11.1, align 4, !tbaa !7
3855 %arrayidx13.1 = getelementptr inbounds i32, ptr %Out, i64 1
3856 %7 = load i32, ptr %arrayidx13.1, align 4, !tbaa !7
3857 %add14.1 = add nsw i32 %7, %6
3858 store i32 %add14.1, ptr %arrayidx13.1, align 4, !tbaa !7
3859 tail call void @_Z3barv() #3
3860 %8 = load ptr, ptr %arrayidx, align 8, !tbaa !3
3861 %arrayidx11.2 = getelementptr inbounds i32, ptr %8, i64 2
3862 %9 = load i32, ptr %arrayidx11.2, align 4, !tbaa !7
3863 %arrayidx13.2 = getelementptr inbounds i32, ptr %Out, i64 2
3864 %10 = load i32, ptr %arrayidx13.2, align 4, !tbaa !7
3865 %add14.2 = add nsw i32 %10, %9
3866 store i32 %add14.2, ptr %arrayidx13.2, align 4, !tbaa !7
3867 tail call void @_Z3barv() #3
3868 %11 = load ptr, ptr %arrayidx, align 8, !tbaa !3
3869 %arrayidx11.3 = getelementptr inbounds i32, ptr %11, i64 3
3870 %12 = load i32, ptr %arrayidx11.3, align 4, !tbaa !7
3871 %arrayidx13.3 = getelementptr inbounds i32, ptr %Out, i64 3
3872 %13 = load i32, ptr %arrayidx13.3, align 4, !tbaa !7
3873 %add14.3 = add nsw i32 %13, %12
3874 store i32 %add14.3, ptr %arrayidx13.3, align 4, !tbaa !7
3875 tail call void @_Z3barv() #3
3876 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
3877 %exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
3878 br i1 %exitcond, label %cleanup.loopexit, label %if.end, !llvm.loop !9
3879
3880 ; Exit blocks
3881 cleanup.loopexit: ; preds = %for.body7
3882 br label %cleanup
3883 ; *** IR Dump After LoopFullUnrollPass on if.end ***

@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 4, 2023

I first guess the trip.count maybe too small, but I changed the iteration Num from 16 to 1600, without this patch, it still not unroll.

@bcl5980
Copy link
Contributor

bcl5980 commented Dec 5, 2023

AMDGPU can not unorll this case:

https://godbolt.org/z/4Pq3bnzTT

But the same code in X86 looks can unroll:

https://godbolt.org/z/zr8aTG1KW

We may need to continue debug on it.

@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 5, 2023

AMDGPU can not unorll this case:

https://godbolt.org/z/4Pq3bnzTT

But the same code in X86 looks can unroll:

https://godbolt.org/z/zr8aTG1KW

We may need to continue debug on it.

X86 do very conservative unroll too,its upper bound send to 4 (default is 8), if we not fold the loop branch, it can fully unroll (16)

@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 5, 2023

I think we should follow this principle:
if a loop required to be unroll later, we should not distroy the loop count info.

@bcl5980
Copy link
Contributor

bcl5980 commented Dec 5, 2023

I think we should follow this principle: if a loop required to be unroll later, we should not distroy the loop count info.

The ideal is right. But I think what Nikic say is loop unroll should handle the case( upper bound unrolling). But It doesn't work. We need to find why loop unroll doesn't work (maybe in UnrollRuntimeLoopRemainder), then can check if it can do in loop unroll or stop the SimpilfyCfg's transform.

AMDGPU can not unorll this case:
https://godbolt.org/z/4Pq3bnzTT
But the same code in X86 looks can unroll:
https://godbolt.org/z/zr8aTG1KW
We may need to continue debug on it.

X86 do very conservative unroll too,its upper bound send to 4 (default is 8), if we not fold the loop branch, it can fully unroll (16)

So where is the different X86 can partial unroll but AMDGPU can not unroll at all?

@bcl5980
Copy link
Contributor

bcl5980 commented Dec 5, 2023

https://godbolt.org/z/cMeE61bhf
Loop unroll with -unroll-runtime can partial unroll the case.
@nikic It looks if we don't avoid the transform, it will become a runtime unroll. The case before simplifycfg is https://godbolt.org/z/5MoYM8rGn.
@xiangzh1 's solution looks fine to me if we do not involve loopInfo in simplifycfg. And we still need a mininal IR test for it.

@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 5, 2023

So where is the different X86 can partial unroll but AMDGPU can not unroll at all?
https://godbolt.org/z/cMeE61bhf Loop unroll with -unroll-runtime can partial unroll the case. @nikic It looks if we don't avoid the transform, it will become a runtime unroll. The case before simplifycfg is https://godbolt.org/z/5MoYM8rGn. @xiangzh1 's solution looks fine to me if we do not involve loopInfo in simplifycfg. And we still need a mininal IR test for it.

1 In fact, I didn't much care about the different unroll between different targets. The loop unroll pass consider the TTI port, it is make sense to me "one do partial unroll or not" or "partial unroll with different unroll count".
I more care about the Known loop count for unroll become Unkown. This do big change for unroll (even successful). For example, loop with small Known loop count can usually be fully unrolled, which usually much simplify the address (offset) calculations in old iterations (then we can do a lot of others optimizations, e.g, SROA, for these simplifed calculations). But these don't work for Unkown loop count.

2 I am creating the mininal IR test. (I'll replace current .cu test with it, duo to I use -O2 in current test)

thanks again!

@xiangzh1 xiangzh1 force-pushed the users/xiangzhangllvm/refine-simplify-CFG-for-loop-unroll branch from e963223 to 6a80c39 Compare December 5, 2023 08:55
@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 5, 2023

Update: add ir test llvm/test/Transforms/SimplifyCFG/simplify-cfg-unroll.ll
(not sure it is minimal, llvm-reduce doesn't works well for it, I mannualy create it)

Constant iteration loop with unroll hint usually expected do unroll
by consumers, folding branches in such loop header at SimplifyCFG will
break unroll optimization.
@xiangzh1 xiangzh1 force-pushed the users/xiangzhangllvm/refine-simplify-CFG-for-loop-unroll branch from 6a80c39 to cbcc7f3 Compare December 5, 2023 09:04
@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 5, 2023

rebase

@nikic
Copy link
Contributor

nikic commented Dec 5, 2023

I checked, and for your test case, LoopUnroll recognizes the loop as an UpperBound unrolling candidate, but does not perform unrolling due to cost model.

The pragma unroll metadata currently only takes effect if there is an exact trip count, but not if there is an upper bound trip count. Making it work with an upper bound trip count as well should fix your case. See the code in shouldPragmaUnroll().

@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 5, 2023

I checked, and for your test case, LoopUnroll recognizes the loop as an UpperBound unrolling candidate, but does not perform unrolling due to cost model.

The pragma unroll metadata currently only takes effect if there is an exact trip count, but not if there is an upper bound trip count. Making it work with an upper bound trip count as well should fix your case. See the code in shouldPragmaUnroll().

First, many thanks for checking the test.
Yes, that's another way to unroll too, but it is branch folding affected the cost model, it didn't consider the later unroll. In fact we don't much want to do unroll with UpperBound. Even it unroll successful with UpperBound, it is not better than unrolled with known loop count. Espically for known small loop count which ususally can be full unrolled, in GPU this is much helpful to symplify address (offset) calculations, which is sensitive in SROA optimization and local mem use.

@nikic
Copy link
Contributor

nikic commented Dec 5, 2023

I think you are confusing upper bound and runtime unrolling. An upper bound unroll is a type of full unroll. In this case it would unroll it to 16 iterations, which is what you want, no?

@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 5, 2023

I think you are confusing upper bound and runtime unrolling. An upper bound unroll is a type of full unroll. In this case it would unroll it to 16 iterations, which is what you want, no?

Yes!: ) I thouth the upper bound is "partial unroll", and I am supprised if it can fully unroll to 16 iterations with the branch folding, anyway let me check shouldPragmaUnroll(), thank you a lot!

@xiangzh1
Copy link
Contributor Author

xiangzh1 commented Dec 7, 2023

Hi friends, I created a new PR at #74703, many thanks for reviewing!!

@xiangzh1 xiangzh1 force-pushed the users/xiangzhangllvm/refine-simplify-CFG-for-loop-unroll branch 2 times, most recently from c5043e5 to cbcc7f3 Compare December 8, 2023 01:52
@xiangzh1 xiangzh1 closed this Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang Clang issues not falling into any other category llvm:transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants