[SimplifyCFG] Not folding branch in loop header with constant iterations #74268

xiangzh1 · 2023-12-04T02:17:36Z

[SimplifyCFG] Not folding branch in constant loops which expected unroll

Constant iteration loop with unroll hint usually expected do unroll
by consumers, folding branches in such loop header at SimplifyCFG will
break unroll optimization.

for example:
#program unroll
for (int I = 0; I < ConstNum; ++I) { // ConstNum > 1
if (Cond2) {
break;
}
xxx loop body;
}
Folding these conditional branches may break loop unroll.

llvmbot · 2023-12-04T02:18:07Z

@llvm/pr-subscribers-clang

Author: None (xiangzh1)

Changes

Loop header with constant usually can be optimized in unroll, folding branch in such loop header at SimplifyCFG will break unroll optimization.
for example:
Escape folding "I < ConstNum" with "Cond2" due to loops of constant iterations can be easily optimized (e.g unroll).
for (int I = 0; I < ConstNum; ++I) { // ConstNum > 1
if (Cond2) {
break;
}
xxx loop body;
}
Folding these conditional branches may break loop optimizations.

Patch is 48.66 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/74268.diff

3 Files Affected:

(added) clang/test/CodeGenCUDA/simplify-cfg-unroll.cu (+364)
(modified) llvm/lib/Transforms/Utils/SimplifyCFG.cpp (+43)
(modified) llvm/test/Transforms/LoopVectorize/if-pred-non-void.ll (+46-45)

diff --git a/clang/test/CodeGenCUDA/simplify-cfg-unroll.cu b/clang/test/CodeGenCUDA/simplify-cfg-unroll.cu
new file mode 100644
index 0000000000000..ecb421f9fc85c
--- /dev/null
+++ b/clang/test/CodeGenCUDA/simplify-cfg-unroll.cu
@@ -0,0 +1,364 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 4
+// REQUIRES: amdgpu-registered-target
+// REQUIRES: x86-registered-target
+// RUN: %clang_cc1 -O2 "-aux-triple" "x86_64-unknown-linux-gnu" "-triple" "amdgcn-amd-amdhsa" \
+// RUN:    -fcuda-is-device "-aux-target-cpu" "x86-64" -emit-llvm -o - %s | FileCheck %s
+
+#include "Inputs/cuda.h"
+
+// CHECK-LABEL: define dso_local void @_Z4funciPPiiS_(
+// CHECK-SAME: i32 noundef [[IDX:%.*]], ptr nocapture noundef readonly [[ARR:%.*]], i32 noundef [[DIMS:%.*]], ptr nocapture noundef [[OUT:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CMP1:%.*]] = icmp eq i32 [[DIMS]], 0
+// CHECK-NEXT:    br i1 [[CMP1]], label [[CLEANUP:%.*]], label [[IF_END:%.*]]
+// CHECK:       if.end:
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARR]], align 8, !tbaa [[TBAA3:![0-9]+]]
+// CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[TMP0]], align 4, !tbaa [[TBAA7:![0-9]+]]
+// CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14:%.*]] = add nsw i32 [[TMP2]], [[TMP1]]
+// CHECK-NEXT:    store i32 [[ADD14]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 1
+// CHECK-NEXT:    [[TMP3:%.*]] = load i32, ptr [[ARRAYIDX11_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX13_1:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 1
+// CHECK-NEXT:    [[TMP4:%.*]] = load i32, ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1:%.*]] = add nsw i32 [[TMP4]], [[TMP3]]
+// CHECK-NEXT:    store i32 [[ADD14_1]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 2
+// CHECK-NEXT:    [[TMP5:%.*]] = load i32, ptr [[ARRAYIDX11_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX13_2:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 2
+// CHECK-NEXT:    [[TMP6:%.*]] = load i32, ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2:%.*]] = add nsw i32 [[TMP6]], [[TMP5]]
+// CHECK-NEXT:    store i32 [[ADD14_2]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 3
+// CHECK-NEXT:    [[TMP7:%.*]] = load i32, ptr [[ARRAYIDX11_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX13_3:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 3
+// CHECK-NEXT:    [[TMP8:%.*]] = load i32, ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3:%.*]] = add nsw i32 [[TMP8]], [[TMP7]]
+// CHECK-NEXT:    store i32 [[ADD14_3]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_1:%.*]] = icmp eq i32 [[DIMS]], 1
+// CHECK-NEXT:    br i1 [[CMP1_1]], label [[CLEANUP]], label [[IF_END_1:%.*]]
+// CHECK:       if.end.1:
+// CHECK-NEXT:    [[ARRAYIDX_1:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 1
+// CHECK-NEXT:    [[TMP9:%.*]] = load ptr, ptr [[ARRAYIDX_1]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP10:%.*]] = load i32, ptr [[TMP9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_129:%.*]] = add nsw i32 [[ADD14]], [[TMP10]]
+// CHECK-NEXT:    store i32 [[ADD14_129]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_1:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i64 1
+// CHECK-NEXT:    [[TMP11:%.*]] = load i32, ptr [[ARRAYIDX11_1_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_1:%.*]] = add nsw i32 [[ADD14_1]], [[TMP11]]
+// CHECK-NEXT:    store i32 [[ADD14_1_1]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_1:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i64 2
+// CHECK-NEXT:    [[TMP12:%.*]] = load i32, ptr [[ARRAYIDX11_2_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_1:%.*]] = add nsw i32 [[ADD14_2]], [[TMP12]]
+// CHECK-NEXT:    store i32 [[ADD14_2_1]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_1:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i64 3
+// CHECK-NEXT:    [[TMP13:%.*]] = load i32, ptr [[ARRAYIDX11_3_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_1:%.*]] = add nsw i32 [[ADD14_3]], [[TMP13]]
+// CHECK-NEXT:    store i32 [[ADD14_3_1]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_2:%.*]] = icmp eq i32 [[DIMS]], 2
+// CHECK-NEXT:    br i1 [[CMP1_2]], label [[CLEANUP]], label [[IF_END_2:%.*]]
+// CHECK:       if.end.2:
+// CHECK-NEXT:    [[ARRAYIDX_2:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 2
+// CHECK-NEXT:    [[TMP14:%.*]] = load ptr, ptr [[ARRAYIDX_2]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP15:%.*]] = load i32, ptr [[TMP14]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_230:%.*]] = add nsw i32 [[ADD14_129]], [[TMP15]]
+// CHECK-NEXT:    store i32 [[ADD14_230]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_2:%.*]] = getelementptr inbounds i32, ptr [[TMP14]], i64 1
+// CHECK-NEXT:    [[TMP16:%.*]] = load i32, ptr [[ARRAYIDX11_1_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_2:%.*]] = add nsw i32 [[ADD14_1_1]], [[TMP16]]
+// CHECK-NEXT:    store i32 [[ADD14_1_2]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_2:%.*]] = getelementptr inbounds i32, ptr [[TMP14]], i64 2
+// CHECK-NEXT:    [[TMP17:%.*]] = load i32, ptr [[ARRAYIDX11_2_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_2:%.*]] = add nsw i32 [[ADD14_2_1]], [[TMP17]]
+// CHECK-NEXT:    store i32 [[ADD14_2_2]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_2:%.*]] = getelementptr inbounds i32, ptr [[TMP14]], i64 3
+// CHECK-NEXT:    [[TMP18:%.*]] = load i32, ptr [[ARRAYIDX11_3_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_2:%.*]] = add nsw i32 [[ADD14_3_1]], [[TMP18]]
+// CHECK-NEXT:    store i32 [[ADD14_3_2]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_3:%.*]] = icmp eq i32 [[DIMS]], 3
+// CHECK-NEXT:    br i1 [[CMP1_3]], label [[CLEANUP]], label [[IF_END_3:%.*]]
+// CHECK:       if.end.3:
+// CHECK-NEXT:    [[ARRAYIDX_3:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 3
+// CHECK-NEXT:    [[TMP19:%.*]] = load ptr, ptr [[ARRAYIDX_3]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP20:%.*]] = load i32, ptr [[TMP19]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_331:%.*]] = add nsw i32 [[ADD14_230]], [[TMP20]]
+// CHECK-NEXT:    store i32 [[ADD14_331]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_3:%.*]] = getelementptr inbounds i32, ptr [[TMP19]], i64 1
+// CHECK-NEXT:    [[TMP21:%.*]] = load i32, ptr [[ARRAYIDX11_1_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_3:%.*]] = add nsw i32 [[ADD14_1_2]], [[TMP21]]
+// CHECK-NEXT:    store i32 [[ADD14_1_3]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_3:%.*]] = getelementptr inbounds i32, ptr [[TMP19]], i64 2
+// CHECK-NEXT:    [[TMP22:%.*]] = load i32, ptr [[ARRAYIDX11_2_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_3:%.*]] = add nsw i32 [[ADD14_2_2]], [[TMP22]]
+// CHECK-NEXT:    store i32 [[ADD14_2_3]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_3:%.*]] = getelementptr inbounds i32, ptr [[TMP19]], i64 3
+// CHECK-NEXT:    [[TMP23:%.*]] = load i32, ptr [[ARRAYIDX11_3_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_3:%.*]] = add nsw i32 [[ADD14_3_2]], [[TMP23]]
+// CHECK-NEXT:    store i32 [[ADD14_3_3]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_4:%.*]] = icmp eq i32 [[DIMS]], 4
+// CHECK-NEXT:    br i1 [[CMP1_4]], label [[CLEANUP]], label [[IF_END_4:%.*]]
+// CHECK:       if.end.4:
+// CHECK-NEXT:    [[ARRAYIDX_4:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 4
+// CHECK-NEXT:    [[TMP24:%.*]] = load ptr, ptr [[ARRAYIDX_4]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP25:%.*]] = load i32, ptr [[TMP24]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_4:%.*]] = add nsw i32 [[ADD14_331]], [[TMP25]]
+// CHECK-NEXT:    store i32 [[ADD14_4]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_4:%.*]] = getelementptr inbounds i32, ptr [[TMP24]], i64 1
+// CHECK-NEXT:    [[TMP26:%.*]] = load i32, ptr [[ARRAYIDX11_1_4]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_4:%.*]] = add nsw i32 [[ADD14_1_3]], [[TMP26]]
+// CHECK-NEXT:    store i32 [[ADD14_1_4]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_4:%.*]] = getelementptr inbounds i32, ptr [[TMP24]], i64 2
+// CHECK-NEXT:    [[TMP27:%.*]] = load i32, ptr [[ARRAYIDX11_2_4]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_4:%.*]] = add nsw i32 [[ADD14_2_3]], [[TMP27]]
+// CHECK-NEXT:    store i32 [[ADD14_2_4]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_4:%.*]] = getelementptr inbounds i32, ptr [[TMP24]], i64 3
+// CHECK-NEXT:    [[TMP28:%.*]] = load i32, ptr [[ARRAYIDX11_3_4]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_4:%.*]] = add nsw i32 [[ADD14_3_3]], [[TMP28]]
+// CHECK-NEXT:    store i32 [[ADD14_3_4]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_5:%.*]] = icmp eq i32 [[DIMS]], 5
+// CHECK-NEXT:    br i1 [[CMP1_5]], label [[CLEANUP]], label [[IF_END_5:%.*]]
+// CHECK:       if.end.5:
+// CHECK-NEXT:    [[ARRAYIDX_5:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 5
+// CHECK-NEXT:    [[TMP29:%.*]] = load ptr, ptr [[ARRAYIDX_5]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP30:%.*]] = load i32, ptr [[TMP29]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_5:%.*]] = add nsw i32 [[ADD14_4]], [[TMP30]]
+// CHECK-NEXT:    store i32 [[ADD14_5]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_5:%.*]] = getelementptr inbounds i32, ptr [[TMP29]], i64 1
+// CHECK-NEXT:    [[TMP31:%.*]] = load i32, ptr [[ARRAYIDX11_1_5]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_5:%.*]] = add nsw i32 [[ADD14_1_4]], [[TMP31]]
+// CHECK-NEXT:    store i32 [[ADD14_1_5]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_5:%.*]] = getelementptr inbounds i32, ptr [[TMP29]], i64 2
+// CHECK-NEXT:    [[TMP32:%.*]] = load i32, ptr [[ARRAYIDX11_2_5]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_5:%.*]] = add nsw i32 [[ADD14_2_4]], [[TMP32]]
+// CHECK-NEXT:    store i32 [[ADD14_2_5]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_5:%.*]] = getelementptr inbounds i32, ptr [[TMP29]], i64 3
+// CHECK-NEXT:    [[TMP33:%.*]] = load i32, ptr [[ARRAYIDX11_3_5]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_5:%.*]] = add nsw i32 [[ADD14_3_4]], [[TMP33]]
+// CHECK-NEXT:    store i32 [[ADD14_3_5]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_6:%.*]] = icmp eq i32 [[DIMS]], 6
+// CHECK-NEXT:    br i1 [[CMP1_6]], label [[CLEANUP]], label [[IF_END_6:%.*]]
+// CHECK:       if.end.6:
+// CHECK-NEXT:    [[ARRAYIDX_6:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 6
+// CHECK-NEXT:    [[TMP34:%.*]] = load ptr, ptr [[ARRAYIDX_6]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP35:%.*]] = load i32, ptr [[TMP34]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_6:%.*]] = add nsw i32 [[ADD14_5]], [[TMP35]]
+// CHECK-NEXT:    store i32 [[ADD14_6]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_6:%.*]] = getelementptr inbounds i32, ptr [[TMP34]], i64 1
+// CHECK-NEXT:    [[TMP36:%.*]] = load i32, ptr [[ARRAYIDX11_1_6]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_6:%.*]] = add nsw i32 [[ADD14_1_5]], [[TMP36]]
+// CHECK-NEXT:    store i32 [[ADD14_1_6]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_6:%.*]] = getelementptr inbounds i32, ptr [[TMP34]], i64 2
+// CHECK-NEXT:    [[TMP37:%.*]] = load i32, ptr [[ARRAYIDX11_2_6]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_6:%.*]] = add nsw i32 [[ADD14_2_5]], [[TMP37]]
+// CHECK-NEXT:    store i32 [[ADD14_2_6]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_6:%.*]] = getelementptr inbounds i32, ptr [[TMP34]], i64 3
+// CHECK-NEXT:    [[TMP38:%.*]] = load i32, ptr [[ARRAYIDX11_3_6]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_6:%.*]] = add nsw i32 [[ADD14_3_5]], [[TMP38]]
+// CHECK-NEXT:    store i32 [[ADD14_3_6]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_7:%.*]] = icmp eq i32 [[DIMS]], 7
+// CHECK-NEXT:    br i1 [[CMP1_7]], label [[CLEANUP]], label [[IF_END_7:%.*]]
+// CHECK:       if.end.7:
+// CHECK-NEXT:    [[ARRAYIDX_7:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 7
+// CHECK-NEXT:    [[TMP39:%.*]] = load ptr, ptr [[ARRAYIDX_7]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP40:%.*]] = load i32, ptr [[TMP39]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_7:%.*]] = add nsw i32 [[ADD14_6]], [[TMP40]]
+// CHECK-NEXT:    store i32 [[ADD14_7]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_7:%.*]] = getelementptr inbounds i32, ptr [[TMP39]], i64 1
+// CHECK-NEXT:    [[TMP41:%.*]] = load i32, ptr [[ARRAYIDX11_1_7]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_7:%.*]] = add nsw i32 [[ADD14_1_6]], [[TMP41]]
+// CHECK-NEXT:    store i32 [[ADD14_1_7]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_7:%.*]] = getelementptr inbounds i32, ptr [[TMP39]], i64 2
+// CHECK-NEXT:    [[TMP42:%.*]] = load i32, ptr [[ARRAYIDX11_2_7]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_7:%.*]] = add nsw i32 [[ADD14_2_6]], [[TMP42]]
+// CHECK-NEXT:    store i32 [[ADD14_2_7]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_7:%.*]] = getelementptr inbounds i32, ptr [[TMP39]], i64 3
+// CHECK-NEXT:    [[TMP43:%.*]] = load i32, ptr [[ARRAYIDX11_3_7]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_7:%.*]] = add nsw i32 [[ADD14_3_6]], [[TMP43]]
+// CHECK-NEXT:    store i32 [[ADD14_3_7]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_8:%.*]] = icmp eq i32 [[DIMS]], 8
+// CHECK-NEXT:    br i1 [[CMP1_8]], label [[CLEANUP]], label [[IF_END_8:%.*]]
+// CHECK:       if.end.8:
+// CHECK-NEXT:    [[ARRAYIDX_8:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 8
+// CHECK-NEXT:    [[TMP44:%.*]] = load ptr, ptr [[ARRAYIDX_8]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP45:%.*]] = load i32, ptr [[TMP44]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_8:%.*]] = add nsw i32 [[ADD14_7]], [[TMP45]]
+// CHECK-NEXT:    store i32 [[ADD14_8]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_8:%.*]] = getelementptr inbounds i32, ptr [[TMP44]], i64 1
+// CHECK-NEXT:    [[TMP46:%.*]] = load i32, ptr [[ARRAYIDX11_1_8]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_8:%.*]] = add nsw i32 [[ADD14_1_7]], [[TMP46]]
+// CHECK-NEXT:    store i32 [[ADD14_1_8]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_8:%.*]] = getelementptr inbounds i32, ptr [[TMP44]], i64 2
+// CHECK-NEXT:    [[TMP47:%.*]] = load i32, ptr [[ARRAYIDX11_2_8]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_8:%.*]] = add nsw i32 [[ADD14_2_7]], [[TMP47]]
+// CHECK-NEXT:    store i32 [[ADD14_2_8]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_8:%.*]] = getelementptr inbounds i32, ptr [[TMP44]], i64 3
+// CHECK-NEXT:    [[TMP48:%.*]] = load i32, ptr [[ARRAYIDX11_3_8]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_8:%.*]] = add nsw i32 [[ADD14_3_7]], [[TMP48]]
+// CHECK-NEXT:    store i32 [[ADD14_3_8]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_9:%.*]] = icmp eq i32 [[DIMS]], 9
+// CHECK-NEXT:    br i1 [[CMP1_9]], label [[CLEANUP]], label [[IF_END_9:%.*]]
+// CHECK:       if.end.9:
+// CHECK-NEXT:    [[ARRAYIDX_9:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 9
+// CHECK-NEXT:    [[TMP49:%.*]] = load ptr, ptr [[ARRAYIDX_9]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP50:%.*]] = load i32, ptr [[TMP49]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_9:%.*]] = add nsw i32 [[ADD14_8]], [[TMP50]]
+// CHECK-NEXT:    store i32 [[ADD14_9]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_9:%.*]] = getelementptr inbounds i32, ptr [[TMP49]], i64 1
+// CHECK-NEXT:    [[TMP51:%.*]] = load i32, ptr [[ARRAYIDX11_1_9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_9:%.*]] = add nsw i32 [[ADD14_1_8]], [[TMP51]]
+// CHECK-NEXT:    store i32 [[ADD14_1_9]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_9:%.*]] = getelementptr inbounds i32, ptr [[TMP49]], i64 2
+// CHECK-NEXT:    [[TMP52:%.*]] = load i32, ptr [[ARRAYIDX11_2_9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_9:%.*]] = add nsw i32 [[ADD14_2_8]], [[TMP52]]
+// CHECK-NEXT:    store i32 [[ADD14_2_9]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_9:%.*]] = getelementptr inbounds i32, ptr [[TMP49]], i64 3
+// CHECK-NEXT:    [[TMP53:%.*]] = load i32, ptr [[ARRAYIDX11_3_9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_9:%.*]] = add nsw i32 [[ADD14_3_8]], [[TMP53]]
+// CHECK-NEXT:    store i32 [[ADD14_3_9]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_10:%.*]] = icmp eq i32 [[DIMS]], 10
+// CHECK-NEXT:    br i1 [[CMP1_10]], label [[CLEANUP]], label [[IF_END_10:%.*]]
+// CHECK:       if.end.10:
+// CHECK-NEXT:    [[ARRAYIDX_10:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 10
+// CHECK-NEXT:    [[TMP54:%.*]] = load ptr, ptr [[ARRAYIDX_10]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP55:%.*]] = load i32, ptr [[TMP54]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_10:%.*]] = add nsw i32 [[ADD14_9]], [[TMP55]]
+// CHECK-NEXT:    store i32 [[ADD14_10]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_10:%.*]] = getelementptr inbounds i32, ptr [[TMP54]], i64 1
+// CHECK-NEXT:    [[TMP56:%.*]] = load i32, ptr [[ARRAYIDX11_1_10]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_10:%.*]] = add nsw i32 [[ADD14_1_9]], [[TMP56]]
+// CHECK-NEXT:    store i32 [[ADD14_1_10]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_10:%.*]] = getelementptr inbounds i32, ptr [[TMP54]], i64 2
+// CHECK-NEXT:    [[TMP57:%.*]] = load i32, ptr [[ARRAYIDX11_2_10]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_10:%.*]] = add nsw i32 [[ADD14_2_9]], [[TMP57]]
+// CHECK-NEXT:    store i32 [[ADD14_2_10]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_10:%.*]] = getelementptr inbounds i32, ptr [[TMP54]], i64 3
+// CHECK-NEXT:    [[TMP58:%.*]] = load i32, ptr [[ARRAYIDX11_3_10]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_10:%.*]] = add nsw i32 [[ADD14_3_9]], [[TMP58]]
+// CHECK-NEXT:    store i32 [[ADD14_3_10]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_11:%.*]] = icmp eq i32 [[DIMS]], 11
+// CHECK-NEXT:    br i1 [[CMP1_11]], label [[CLEANUP]], label [[IF_END_11:%.*]]
+// CHECK:       if.end.11:
+// CHECK-NEXT:    [[ARRAYIDX_11:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 11
+// CHECK-NEXT:    [[TMP59:%.*]] = load ptr, ptr [[ARRAYIDX_11]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP60:%.*]] = load i32, ptr [[TMP59]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_11:%.*]] = add nsw i32 [[ADD14_10]], [[TMP60]]
+// CHECK-NEXT:    store i32 [[ADD14_11]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_11:%.*]] = getelementptr inbounds i32, ptr [[TMP59]], i64 1
+// CHECK-NEXT:    [[TMP61:%.*]] = load i32, ptr [[ARRAYIDX11_1_11]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_11:%.*]] = add nsw i32 [[ADD14_1_10]], [[TMP61]]
+// CHECK-NEXT:    store i32 [[ADD14_1_11]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_11:%.*]] = getelement...
[truncated]

llvmbot · 2023-12-04T02:18:07Z

@llvm/pr-subscribers-llvm-transforms

Author: None (xiangzh1)

Changes

Loop header with constant usually can be optimized in unroll, folding branch in such loop header at SimplifyCFG will break unroll optimization.
for example:
Escape folding "I < ConstNum" with "Cond2" due to loops of constant iterations can be easily optimized (e.g unroll).
for (int I = 0; I < ConstNum; ++I) { // ConstNum > 1
if (Cond2) {
break;
}
xxx loop body;
}
Folding these conditional branches may break loop optimizations.

Patch is 48.66 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/74268.diff

3 Files Affected:

(added) clang/test/CodeGenCUDA/simplify-cfg-unroll.cu (+364)
(modified) llvm/lib/Transforms/Utils/SimplifyCFG.cpp (+43)
(modified) llvm/test/Transforms/LoopVectorize/if-pred-non-void.ll (+46-45)

diff --git a/clang/test/CodeGenCUDA/simplify-cfg-unroll.cu b/clang/test/CodeGenCUDA/simplify-cfg-unroll.cu
new file mode 100644
index 0000000000000..ecb421f9fc85c
--- /dev/null
+++ b/clang/test/CodeGenCUDA/simplify-cfg-unroll.cu
@@ -0,0 +1,364 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 4
+// REQUIRES: amdgpu-registered-target
+// REQUIRES: x86-registered-target
+// RUN: %clang_cc1 -O2 "-aux-triple" "x86_64-unknown-linux-gnu" "-triple" "amdgcn-amd-amdhsa" \
+// RUN:    -fcuda-is-device "-aux-target-cpu" "x86-64" -emit-llvm -o - %s | FileCheck %s
+
+#include "Inputs/cuda.h"
+
+// CHECK-LABEL: define dso_local void @_Z4funciPPiiS_(
+// CHECK-SAME: i32 noundef [[IDX:%.*]], ptr nocapture noundef readonly [[ARR:%.*]], i32 noundef [[DIMS:%.*]], ptr nocapture noundef [[OUT:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CMP1:%.*]] = icmp eq i32 [[DIMS]], 0
+// CHECK-NEXT:    br i1 [[CMP1]], label [[CLEANUP:%.*]], label [[IF_END:%.*]]
+// CHECK:       if.end:
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARR]], align 8, !tbaa [[TBAA3:![0-9]+]]
+// CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[TMP0]], align 4, !tbaa [[TBAA7:![0-9]+]]
+// CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14:%.*]] = add nsw i32 [[TMP2]], [[TMP1]]
+// CHECK-NEXT:    store i32 [[ADD14]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 1
+// CHECK-NEXT:    [[TMP3:%.*]] = load i32, ptr [[ARRAYIDX11_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX13_1:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 1
+// CHECK-NEXT:    [[TMP4:%.*]] = load i32, ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1:%.*]] = add nsw i32 [[TMP4]], [[TMP3]]
+// CHECK-NEXT:    store i32 [[ADD14_1]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 2
+// CHECK-NEXT:    [[TMP5:%.*]] = load i32, ptr [[ARRAYIDX11_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX13_2:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 2
+// CHECK-NEXT:    [[TMP6:%.*]] = load i32, ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2:%.*]] = add nsw i32 [[TMP6]], [[TMP5]]
+// CHECK-NEXT:    store i32 [[ADD14_2]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 3
+// CHECK-NEXT:    [[TMP7:%.*]] = load i32, ptr [[ARRAYIDX11_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX13_3:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 3
+// CHECK-NEXT:    [[TMP8:%.*]] = load i32, ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3:%.*]] = add nsw i32 [[TMP8]], [[TMP7]]
+// CHECK-NEXT:    store i32 [[ADD14_3]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_1:%.*]] = icmp eq i32 [[DIMS]], 1
+// CHECK-NEXT:    br i1 [[CMP1_1]], label [[CLEANUP]], label [[IF_END_1:%.*]]
+// CHECK:       if.end.1:
+// CHECK-NEXT:    [[ARRAYIDX_1:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 1
+// CHECK-NEXT:    [[TMP9:%.*]] = load ptr, ptr [[ARRAYIDX_1]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP10:%.*]] = load i32, ptr [[TMP9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_129:%.*]] = add nsw i32 [[ADD14]], [[TMP10]]
+// CHECK-NEXT:    store i32 [[ADD14_129]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_1:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i64 1
+// CHECK-NEXT:    [[TMP11:%.*]] = load i32, ptr [[ARRAYIDX11_1_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_1:%.*]] = add nsw i32 [[ADD14_1]], [[TMP11]]
+// CHECK-NEXT:    store i32 [[ADD14_1_1]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_1:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i64 2
+// CHECK-NEXT:    [[TMP12:%.*]] = load i32, ptr [[ARRAYIDX11_2_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_1:%.*]] = add nsw i32 [[ADD14_2]], [[TMP12]]
+// CHECK-NEXT:    store i32 [[ADD14_2_1]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_1:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i64 3
+// CHECK-NEXT:    [[TMP13:%.*]] = load i32, ptr [[ARRAYIDX11_3_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_1:%.*]] = add nsw i32 [[ADD14_3]], [[TMP13]]
+// CHECK-NEXT:    store i32 [[ADD14_3_1]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_2:%.*]] = icmp eq i32 [[DIMS]], 2
+// CHECK-NEXT:    br i1 [[CMP1_2]], label [[CLEANUP]], label [[IF_END_2:%.*]]
+// CHECK:       if.end.2:
+// CHECK-NEXT:    [[ARRAYIDX_2:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 2
+// CHECK-NEXT:    [[TMP14:%.*]] = load ptr, ptr [[ARRAYIDX_2]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP15:%.*]] = load i32, ptr [[TMP14]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_230:%.*]] = add nsw i32 [[ADD14_129]], [[TMP15]]
+// CHECK-NEXT:    store i32 [[ADD14_230]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_2:%.*]] = getelementptr inbounds i32, ptr [[TMP14]], i64 1
+// CHECK-NEXT:    [[TMP16:%.*]] = load i32, ptr [[ARRAYIDX11_1_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_2:%.*]] = add nsw i32 [[ADD14_1_1]], [[TMP16]]
+// CHECK-NEXT:    store i32 [[ADD14_1_2]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_2:%.*]] = getelementptr inbounds i32, ptr [[TMP14]], i64 2
+// CHECK-NEXT:    [[TMP17:%.*]] = load i32, ptr [[ARRAYIDX11_2_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_2:%.*]] = add nsw i32 [[ADD14_2_1]], [[TMP17]]
+// CHECK-NEXT:    store i32 [[ADD14_2_2]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_2:%.*]] = getelementptr inbounds i32, ptr [[TMP14]], i64 3
+// CHECK-NEXT:    [[TMP18:%.*]] = load i32, ptr [[ARRAYIDX11_3_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_2:%.*]] = add nsw i32 [[ADD14_3_1]], [[TMP18]]
+// CHECK-NEXT:    store i32 [[ADD14_3_2]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_3:%.*]] = icmp eq i32 [[DIMS]], 3
+// CHECK-NEXT:    br i1 [[CMP1_3]], label [[CLEANUP]], label [[IF_END_3:%.*]]
+// CHECK:       if.end.3:
+// CHECK-NEXT:    [[ARRAYIDX_3:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 3
+// CHECK-NEXT:    [[TMP19:%.*]] = load ptr, ptr [[ARRAYIDX_3]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP20:%.*]] = load i32, ptr [[TMP19]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_331:%.*]] = add nsw i32 [[ADD14_230]], [[TMP20]]
+// CHECK-NEXT:    store i32 [[ADD14_331]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_3:%.*]] = getelementptr inbounds i32, ptr [[TMP19]], i64 1
+// CHECK-NEXT:    [[TMP21:%.*]] = load i32, ptr [[ARRAYIDX11_1_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_3:%.*]] = add nsw i32 [[ADD14_1_2]], [[TMP21]]
+// CHECK-NEXT:    store i32 [[ADD14_1_3]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_3:%.*]] = getelementptr inbounds i32, ptr [[TMP19]], i64 2
+// CHECK-NEXT:    [[TMP22:%.*]] = load i32, ptr [[ARRAYIDX11_2_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_3:%.*]] = add nsw i32 [[ADD14_2_2]], [[TMP22]]
+// CHECK-NEXT:    store i32 [[ADD14_2_3]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_3:%.*]] = getelementptr inbounds i32, ptr [[TMP19]], i64 3
+// CHECK-NEXT:    [[TMP23:%.*]] = load i32, ptr [[ARRAYIDX11_3_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_3:%.*]] = add nsw i32 [[ADD14_3_2]], [[TMP23]]
+// CHECK-NEXT:    store i32 [[ADD14_3_3]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_4:%.*]] = icmp eq i32 [[DIMS]], 4
+// CHECK-NEXT:    br i1 [[CMP1_4]], label [[CLEANUP]], label [[IF_END_4:%.*]]
+// CHECK:       if.end.4:
+// CHECK-NEXT:    [[ARRAYIDX_4:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 4
+// CHECK-NEXT:    [[TMP24:%.*]] = load ptr, ptr [[ARRAYIDX_4]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP25:%.*]] = load i32, ptr [[TMP24]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_4:%.*]] = add nsw i32 [[ADD14_331]], [[TMP25]]
+// CHECK-NEXT:    store i32 [[ADD14_4]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_4:%.*]] = getelementptr inbounds i32, ptr [[TMP24]], i64 1
+// CHECK-NEXT:    [[TMP26:%.*]] = load i32, ptr [[ARRAYIDX11_1_4]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_4:%.*]] = add nsw i32 [[ADD14_1_3]], [[TMP26]]
+// CHECK-NEXT:    store i32 [[ADD14_1_4]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_4:%.*]] = getelementptr inbounds i32, ptr [[TMP24]], i64 2
+// CHECK-NEXT:    [[TMP27:%.*]] = load i32, ptr [[ARRAYIDX11_2_4]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_4:%.*]] = add nsw i32 [[ADD14_2_3]], [[TMP27]]
+// CHECK-NEXT:    store i32 [[ADD14_2_4]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_4:%.*]] = getelementptr inbounds i32, ptr [[TMP24]], i64 3
+// CHECK-NEXT:    [[TMP28:%.*]] = load i32, ptr [[ARRAYIDX11_3_4]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_4:%.*]] = add nsw i32 [[ADD14_3_3]], [[TMP28]]
+// CHECK-NEXT:    store i32 [[ADD14_3_4]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_5:%.*]] = icmp eq i32 [[DIMS]], 5
+// CHECK-NEXT:    br i1 [[CMP1_5]], label [[CLEANUP]], label [[IF_END_5:%.*]]
+// CHECK:       if.end.5:
+// CHECK-NEXT:    [[ARRAYIDX_5:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 5
+// CHECK-NEXT:    [[TMP29:%.*]] = load ptr, ptr [[ARRAYIDX_5]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP30:%.*]] = load i32, ptr [[TMP29]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_5:%.*]] = add nsw i32 [[ADD14_4]], [[TMP30]]
+// CHECK-NEXT:    store i32 [[ADD14_5]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_5:%.*]] = getelementptr inbounds i32, ptr [[TMP29]], i64 1
+// CHECK-NEXT:    [[TMP31:%.*]] = load i32, ptr [[ARRAYIDX11_1_5]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_5:%.*]] = add nsw i32 [[ADD14_1_4]], [[TMP31]]
+// CHECK-NEXT:    store i32 [[ADD14_1_5]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_5:%.*]] = getelementptr inbounds i32, ptr [[TMP29]], i64 2
+// CHECK-NEXT:    [[TMP32:%.*]] = load i32, ptr [[ARRAYIDX11_2_5]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_5:%.*]] = add nsw i32 [[ADD14_2_4]], [[TMP32]]
+// CHECK-NEXT:    store i32 [[ADD14_2_5]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_5:%.*]] = getelementptr inbounds i32, ptr [[TMP29]], i64 3
+// CHECK-NEXT:    [[TMP33:%.*]] = load i32, ptr [[ARRAYIDX11_3_5]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_5:%.*]] = add nsw i32 [[ADD14_3_4]], [[TMP33]]
+// CHECK-NEXT:    store i32 [[ADD14_3_5]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_6:%.*]] = icmp eq i32 [[DIMS]], 6
+// CHECK-NEXT:    br i1 [[CMP1_6]], label [[CLEANUP]], label [[IF_END_6:%.*]]
+// CHECK:       if.end.6:
+// CHECK-NEXT:    [[ARRAYIDX_6:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 6
+// CHECK-NEXT:    [[TMP34:%.*]] = load ptr, ptr [[ARRAYIDX_6]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP35:%.*]] = load i32, ptr [[TMP34]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_6:%.*]] = add nsw i32 [[ADD14_5]], [[TMP35]]
+// CHECK-NEXT:    store i32 [[ADD14_6]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_6:%.*]] = getelementptr inbounds i32, ptr [[TMP34]], i64 1
+// CHECK-NEXT:    [[TMP36:%.*]] = load i32, ptr [[ARRAYIDX11_1_6]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_6:%.*]] = add nsw i32 [[ADD14_1_5]], [[TMP36]]
+// CHECK-NEXT:    store i32 [[ADD14_1_6]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_6:%.*]] = getelementptr inbounds i32, ptr [[TMP34]], i64 2
+// CHECK-NEXT:    [[TMP37:%.*]] = load i32, ptr [[ARRAYIDX11_2_6]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_6:%.*]] = add nsw i32 [[ADD14_2_5]], [[TMP37]]
+// CHECK-NEXT:    store i32 [[ADD14_2_6]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_6:%.*]] = getelementptr inbounds i32, ptr [[TMP34]], i64 3
+// CHECK-NEXT:    [[TMP38:%.*]] = load i32, ptr [[ARRAYIDX11_3_6]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_6:%.*]] = add nsw i32 [[ADD14_3_5]], [[TMP38]]
+// CHECK-NEXT:    store i32 [[ADD14_3_6]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_7:%.*]] = icmp eq i32 [[DIMS]], 7
+// CHECK-NEXT:    br i1 [[CMP1_7]], label [[CLEANUP]], label [[IF_END_7:%.*]]
+// CHECK:       if.end.7:
+// CHECK-NEXT:    [[ARRAYIDX_7:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 7
+// CHECK-NEXT:    [[TMP39:%.*]] = load ptr, ptr [[ARRAYIDX_7]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP40:%.*]] = load i32, ptr [[TMP39]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_7:%.*]] = add nsw i32 [[ADD14_6]], [[TMP40]]
+// CHECK-NEXT:    store i32 [[ADD14_7]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_7:%.*]] = getelementptr inbounds i32, ptr [[TMP39]], i64 1
+// CHECK-NEXT:    [[TMP41:%.*]] = load i32, ptr [[ARRAYIDX11_1_7]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_7:%.*]] = add nsw i32 [[ADD14_1_6]], [[TMP41]]
+// CHECK-NEXT:    store i32 [[ADD14_1_7]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_7:%.*]] = getelementptr inbounds i32, ptr [[TMP39]], i64 2
+// CHECK-NEXT:    [[TMP42:%.*]] = load i32, ptr [[ARRAYIDX11_2_7]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_7:%.*]] = add nsw i32 [[ADD14_2_6]], [[TMP42]]
+// CHECK-NEXT:    store i32 [[ADD14_2_7]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_7:%.*]] = getelementptr inbounds i32, ptr [[TMP39]], i64 3
+// CHECK-NEXT:    [[TMP43:%.*]] = load i32, ptr [[ARRAYIDX11_3_7]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_7:%.*]] = add nsw i32 [[ADD14_3_6]], [[TMP43]]
+// CHECK-NEXT:    store i32 [[ADD14_3_7]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_8:%.*]] = icmp eq i32 [[DIMS]], 8
+// CHECK-NEXT:    br i1 [[CMP1_8]], label [[CLEANUP]], label [[IF_END_8:%.*]]
+// CHECK:       if.end.8:
+// CHECK-NEXT:    [[ARRAYIDX_8:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 8
+// CHECK-NEXT:    [[TMP44:%.*]] = load ptr, ptr [[ARRAYIDX_8]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP45:%.*]] = load i32, ptr [[TMP44]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_8:%.*]] = add nsw i32 [[ADD14_7]], [[TMP45]]
+// CHECK-NEXT:    store i32 [[ADD14_8]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_8:%.*]] = getelementptr inbounds i32, ptr [[TMP44]], i64 1
+// CHECK-NEXT:    [[TMP46:%.*]] = load i32, ptr [[ARRAYIDX11_1_8]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_8:%.*]] = add nsw i32 [[ADD14_1_7]], [[TMP46]]
+// CHECK-NEXT:    store i32 [[ADD14_1_8]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_8:%.*]] = getelementptr inbounds i32, ptr [[TMP44]], i64 2
+// CHECK-NEXT:    [[TMP47:%.*]] = load i32, ptr [[ARRAYIDX11_2_8]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_8:%.*]] = add nsw i32 [[ADD14_2_7]], [[TMP47]]
+// CHECK-NEXT:    store i32 [[ADD14_2_8]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_8:%.*]] = getelementptr inbounds i32, ptr [[TMP44]], i64 3
+// CHECK-NEXT:    [[TMP48:%.*]] = load i32, ptr [[ARRAYIDX11_3_8]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_8:%.*]] = add nsw i32 [[ADD14_3_7]], [[TMP48]]
+// CHECK-NEXT:    store i32 [[ADD14_3_8]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_9:%.*]] = icmp eq i32 [[DIMS]], 9
+// CHECK-NEXT:    br i1 [[CMP1_9]], label [[CLEANUP]], label [[IF_END_9:%.*]]
+// CHECK:       if.end.9:
+// CHECK-NEXT:    [[ARRAYIDX_9:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 9
+// CHECK-NEXT:    [[TMP49:%.*]] = load ptr, ptr [[ARRAYIDX_9]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP50:%.*]] = load i32, ptr [[TMP49]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_9:%.*]] = add nsw i32 [[ADD14_8]], [[TMP50]]
+// CHECK-NEXT:    store i32 [[ADD14_9]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_9:%.*]] = getelementptr inbounds i32, ptr [[TMP49]], i64 1
+// CHECK-NEXT:    [[TMP51:%.*]] = load i32, ptr [[ARRAYIDX11_1_9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_9:%.*]] = add nsw i32 [[ADD14_1_8]], [[TMP51]]
+// CHECK-NEXT:    store i32 [[ADD14_1_9]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_9:%.*]] = getelementptr inbounds i32, ptr [[TMP49]], i64 2
+// CHECK-NEXT:    [[TMP52:%.*]] = load i32, ptr [[ARRAYIDX11_2_9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_9:%.*]] = add nsw i32 [[ADD14_2_8]], [[TMP52]]
+// CHECK-NEXT:    store i32 [[ADD14_2_9]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_9:%.*]] = getelementptr inbounds i32, ptr [[TMP49]], i64 3
+// CHECK-NEXT:    [[TMP53:%.*]] = load i32, ptr [[ARRAYIDX11_3_9]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_9:%.*]] = add nsw i32 [[ADD14_3_8]], [[TMP53]]
+// CHECK-NEXT:    store i32 [[ADD14_3_9]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_10:%.*]] = icmp eq i32 [[DIMS]], 10
+// CHECK-NEXT:    br i1 [[CMP1_10]], label [[CLEANUP]], label [[IF_END_10:%.*]]
+// CHECK:       if.end.10:
+// CHECK-NEXT:    [[ARRAYIDX_10:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 10
+// CHECK-NEXT:    [[TMP54:%.*]] = load ptr, ptr [[ARRAYIDX_10]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP55:%.*]] = load i32, ptr [[TMP54]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_10:%.*]] = add nsw i32 [[ADD14_9]], [[TMP55]]
+// CHECK-NEXT:    store i32 [[ADD14_10]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_10:%.*]] = getelementptr inbounds i32, ptr [[TMP54]], i64 1
+// CHECK-NEXT:    [[TMP56:%.*]] = load i32, ptr [[ARRAYIDX11_1_10]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_10:%.*]] = add nsw i32 [[ADD14_1_9]], [[TMP56]]
+// CHECK-NEXT:    store i32 [[ADD14_1_10]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_10:%.*]] = getelementptr inbounds i32, ptr [[TMP54]], i64 2
+// CHECK-NEXT:    [[TMP57:%.*]] = load i32, ptr [[ARRAYIDX11_2_10]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_2_10:%.*]] = add nsw i32 [[ADD14_2_9]], [[TMP57]]
+// CHECK-NEXT:    store i32 [[ADD14_2_10]], ptr [[ARRAYIDX13_2]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_3_10:%.*]] = getelementptr inbounds i32, ptr [[TMP54]], i64 3
+// CHECK-NEXT:    [[TMP58:%.*]] = load i32, ptr [[ARRAYIDX11_3_10]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_3_10:%.*]] = add nsw i32 [[ADD14_3_9]], [[TMP58]]
+// CHECK-NEXT:    store i32 [[ADD14_3_10]], ptr [[ARRAYIDX13_3]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[CMP1_11:%.*]] = icmp eq i32 [[DIMS]], 11
+// CHECK-NEXT:    br i1 [[CMP1_11]], label [[CLEANUP]], label [[IF_END_11:%.*]]
+// CHECK:       if.end.11:
+// CHECK-NEXT:    [[ARRAYIDX_11:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 11
+// CHECK-NEXT:    [[TMP59:%.*]] = load ptr, ptr [[ARRAYIDX_11]], align 8, !tbaa [[TBAA3]]
+// CHECK-NEXT:    [[TMP60:%.*]] = load i32, ptr [[TMP59]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_11:%.*]] = add nsw i32 [[ADD14_10]], [[TMP60]]
+// CHECK-NEXT:    store i32 [[ADD14_11]], ptr [[OUT]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_1_11:%.*]] = getelementptr inbounds i32, ptr [[TMP59]], i64 1
+// CHECK-NEXT:    [[TMP61:%.*]] = load i32, ptr [[ARRAYIDX11_1_11]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ADD14_1_11:%.*]] = add nsw i32 [[ADD14_1_10]], [[TMP61]]
+// CHECK-NEXT:    store i32 [[ADD14_1_11]], ptr [[ARRAYIDX13_1]], align 4, !tbaa [[TBAA7]]
+// CHECK-NEXT:    [[ARRAYIDX11_2_11:%.*]] = getelement...
[truncated]

xiangzh1 · 2023-12-04T02:42:45Z

Seems the check fail "dr2xx.cpp Line 1297: conversion from 'T' to 'unsigned long long' is ambiguous" has no relation with this change.

bcl5980 · 2023-12-04T03:37:03Z

We should add TTI check for the condition. I believe don't unroll on X86 is a correct decision.
Only like AMDGPU or NVPTX these GPU backend with heavy stack cost need this.

And I think you need to precommit tests first.

xiangzh1 · 2023-12-04T06:16:22Z

We should add TTI check for the condition. I believe don't unroll on X86 is a correct decision. Only like AMDGPU or NVPTX these GPU backend with heavy stack cost need this.

And I think you need to precommit tests first.

In fact, there is no direct/strong relation with stack cost, it mostly base on unroll or not (or other loop optimizations). Maybe we should check "unroll" info (e.g #pragma unroll, any targets with this hint should try best to unroll too) before do or not do this folding. A little trouble is loop info is not well established now.

bcl5980 · 2023-12-04T06:40:16Z

We should add TTI check for the condition. I believe don't unroll on X86 is a correct decision. Only like AMDGPU or NVPTX these GPU backend with heavy stack cost need this.
And I think you need to precommit tests first.

In fact, there is no direct/strong relation with stack cost, it mostly base on unroll or not (or other loop optimizations). Maybe we should check "unroll" info (e.g #pragma unroll, any targets with this hint should try best to unroll too) before do or not do this folding. A little trouble is loop info is not well established now.

Yeah, it is a problem we unroll or not. But this case generally I believe we don't want to unroll it.

xiangzh1 · 2023-12-04T08:42:06Z

And I think you need to precommit tests first.
Done
In fact, there is no direct/strong relation with stack cost, it mostly base on unroll or not (or other loop optimizations). Maybe we should check "unroll" info (e.g #pragma unroll, any targets with this hint should try best to unroll too) before do or not do this folding. A little trouble is loop info is not well established now.

Yeah, it is a problem we unroll or not. But this case generally I believe we don't want to unroll it.

Update to handle unroll hint
Thanks for reviewing!

bcl5980 · 2023-12-04T10:30:52Z

We need at lease one more IR test.

nikic

LoopUnroll supports upper bound unrolling. Why is it not working in this case?

xiangzh1 · 2023-12-04T12:05:07Z

We need at lease one more IR test.

Let me try

xiangzh1 · 2023-12-04T12:08:54Z

LoopUnroll supports upper bound unrolling. Why is it not working in this case?
for example:
#program unroll
for (int I = 0; I < LoopCount; ++I) { // ConstNum > 1
if (Cond2) {
break;
}
xxx loop body;
}

After the branches fodling, the old loop condition "I < LoopCount" changed/disapeared, I think unroll can not make sure the "upper bound"

nikic · 2023-12-04T13:02:11Z

Can you please share the IR before the unroll pass?

xiangzh1 · 2023-12-04T13:24:46Z

Can you please share the IR before the unroll pass?
Sure:

with this patch:

4523 ; *** IR Dump After LoopDeletionPass on for.body ***
4524
4525 ; Preheader:
4526 entry:
4527 br label %for.body
4528
4529 ; Loop:
4530 for.body: ; preds = %entry, %for.body7
4531 %Dim.027 = phi i32 [ 0, %entry ], [ %inc16, %for.body7 ]
4532 %cmp1 = icmp eq i32 %Dim.027, %Dims
4533 br i1 %cmp1, label %cleanup, label %if.end
4534
4535 if.end: ; preds = %for.body
4536 %idxprom = zext nneg i32 %Dim.027 to i64
4537 %arrayidx = getelementptr inbounds ptr, ptr %Arr, i64 %idxprom
4538 br label %for.body7
4539
4540 for.body7: ; preds = %if.end
4541 %0 = load ptr, ptr %arrayidx, align 8, !tbaa !3
4542 %1 = load i32, ptr %0, align 4, !tbaa !7
4543 %2 = load i32, ptr %Out, align 4, !tbaa !7
4544 %add14 = add nsw i32 %2, %1
4545 store i32 %add14, ptr %Out, align 4, !tbaa !7
4546 tail call void @_Z3barv() #2
4547 %3 = load ptr, ptr %arrayidx, align 8, !tbaa !3
4548 %arrayidx11.1 = getelementptr inbounds i32, ptr %3, i64 1
4549 %4 = load i32, ptr %arrayidx11.1, align 4, !tbaa !7
4550 %arrayidx13.1 = getelementptr inbounds i32, ptr %Out, i64 1
4551 %5 = load i32, ptr %arrayidx13.1, align 4, !tbaa !7
4552 %add14.1 = add nsw i32 %5, %4
4553 store i32 %add14.1, ptr %arrayidx13.1, align 4, !tbaa !7
4554 tail call void @_Z3barv() #2
4555 %6 = load ptr, ptr %arrayidx, align 8, !tbaa !3
4556 %arrayidx11.2 = getelementptr inbounds i32, ptr %6, i64 2
4557 %7 = load i32, ptr %arrayidx11.2, align 4, !tbaa !7
4558 %arrayidx13.2 = getelementptr inbounds i32, ptr %Out, i64 2
4559 %8 = load i32, ptr %arrayidx13.2, align 4, !tbaa !7
4560 %add14.2 = add nsw i32 %8, %7
4561 store i32 %add14.2, ptr %arrayidx13.2, align 4, !tbaa !7
4562 tail call void @_Z3barv() #2
4563 %9 = load ptr, ptr %arrayidx, align 8, !tbaa !3
4564 %arrayidx11.3 = getelementptr inbounds i32, ptr %9, i64 3
4565 %10 = load i32, ptr %arrayidx11.3, align 4, !tbaa !7
4566 %arrayidx13.3 = getelementptr inbounds i32, ptr %Out, i64 3
4567 %11 = load i32, ptr %arrayidx13.3, align 4, !tbaa !7
4568 %add14.3 = add nsw i32 %11, %10
4569 store i32 %add14.3, ptr %arrayidx13.3, align 4, !tbaa !7
4570 tail call void @_Z3barv() #2
4571 %inc16 = add nuw nsw i32 %Dim.027, 1
4572 %exitcond = icmp ne i32 %inc16, 16
4573 br i1 %exitcond, label %for.body, label %cleanup, !llvm.loop !9
4574
4575 ; Exit blocks
4576 cleanup: ; preds = %for.body, %for.body7
4577 ret void
4578
4579 cleanup: ; preds = %for.body, %for.body7
4580 ret void

4581 ; *** IR Dump After LoopFullUnrollPass on for.body (invalidated) ***

without this patch:
3829 ; *** IR Dump After LoopDeletionPass on if.end ***
3830
3831 ; Preheader:
3832 if.end.preheader: ; preds = %entry
3833 %0 = add i32 %Dims, -1
3834 %umin = call i32 @llvm.umin.i32(i32 %0, i32 15)
3835 %1 = add nuw nsw i32 %umin, 1
3836 %wide.trip.count = zext i32 %1 to i64
3837 br label %if.end
3838
3839 ; Loop:
3840 if.end: ; preds = %if.end.preheader, %for.body7
3841 %indvars.iv = phi i64 [ 0, %if.end.preheader ], [ %indvars.iv.next, %for.body7 ]
3842 %arrayidx = getelementptr inbounds ptr, ptr %Arr, i64 %indvars.iv
3843 br label %for.body7
3844
3845 for.body7: ; preds = %if.end
3846 %2 = load ptr, ptr %arrayidx, align 8, !tbaa !3
3847 %3 = load i32, ptr %2, align 4, !tbaa !7
3848 %4 = load i32, ptr %Out, align 4, !tbaa !7
3849 %add14 = add nsw i32 %4, %3
3850 store i32 %add14, ptr %Out, align 4, !tbaa !7
3851 tail call void @_Z3barv() #3
3852 %5 = load ptr, ptr %arrayidx, align 8, !tbaa !3
3853 %arrayidx11.1 = getelementptr inbounds i32, ptr %5, i64 1
3854 %6 = load i32, ptr %arrayidx11.1, align 4, !tbaa !7
3855 %arrayidx13.1 = getelementptr inbounds i32, ptr %Out, i64 1
3856 %7 = load i32, ptr %arrayidx13.1, align 4, !tbaa !7
3857 %add14.1 = add nsw i32 %7, %6
3858 store i32 %add14.1, ptr %arrayidx13.1, align 4, !tbaa !7
3859 tail call void @_Z3barv() #3
3860 %8 = load ptr, ptr %arrayidx, align 8, !tbaa !3
3861 %arrayidx11.2 = getelementptr inbounds i32, ptr %8, i64 2
3862 %9 = load i32, ptr %arrayidx11.2, align 4, !tbaa !7
3863 %arrayidx13.2 = getelementptr inbounds i32, ptr %Out, i64 2
3864 %10 = load i32, ptr %arrayidx13.2, align 4, !tbaa !7
3865 %add14.2 = add nsw i32 %10, %9
3866 store i32 %add14.2, ptr %arrayidx13.2, align 4, !tbaa !7
3867 tail call void @_Z3barv() #3
3868 %11 = load ptr, ptr %arrayidx, align 8, !tbaa !3
3869 %arrayidx11.3 = getelementptr inbounds i32, ptr %11, i64 3
3870 %12 = load i32, ptr %arrayidx11.3, align 4, !tbaa !7
3871 %arrayidx13.3 = getelementptr inbounds i32, ptr %Out, i64 3
3872 %13 = load i32, ptr %arrayidx13.3, align 4, !tbaa !7
3873 %add14.3 = add nsw i32 %13, %12
3874 store i32 %add14.3, ptr %arrayidx13.3, align 4, !tbaa !7
3875 tail call void @_Z3barv() #3
3876 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
3877 %exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
3878 br i1 %exitcond, label %cleanup.loopexit, label %if.end, !llvm.loop !9
3879
3880 ; Exit blocks
3881 cleanup.loopexit: ; preds = %for.body7
3882 br label %cleanup
3883 ; *** IR Dump After LoopFullUnrollPass on if.end ***

xiangzh1 · 2023-12-04T13:41:43Z

I first guess the trip.count maybe too small, but I changed the iteration Num from 16 to 1600, without this patch, it still not unroll.

bcl5980 · 2023-12-05T01:21:29Z

AMDGPU can not unorll this case:

https://godbolt.org/z/4Pq3bnzTT

But the same code in X86 looks can unroll:

https://godbolt.org/z/zr8aTG1KW

We may need to continue debug on it.

xiangzh1 · 2023-12-05T01:36:17Z

AMDGPU can not unorll this case:

https://godbolt.org/z/4Pq3bnzTT

But the same code in X86 looks can unroll:

https://godbolt.org/z/zr8aTG1KW

We may need to continue debug on it.

X86 do very conservative unroll too，its upper bound send to 4 (default is 8), if we not fold the loop branch, it can fully unroll (16)

xiangzh1 · 2023-12-05T01:46:12Z

I think we should follow this principle：
if a loop required to be unroll later, we should not distroy the loop count info.

bcl5980 · 2023-12-05T04:09:55Z

I think we should follow this principle： if a loop required to be unroll later, we should not distroy the loop count info.

The ideal is right. But I think what Nikic say is loop unroll should handle the case( upper bound unrolling). But It doesn't work. We need to find why loop unroll doesn't work (maybe in UnrollRuntimeLoopRemainder), then can check if it can do in loop unroll or stop the SimpilfyCfg's transform.

AMDGPU can not unorll this case:
https://godbolt.org/z/4Pq3bnzTT
But the same code in X86 looks can unroll:
https://godbolt.org/z/zr8aTG1KW
We may need to continue debug on it.

X86 do very conservative unroll too，its upper bound send to 4 (default is 8), if we not fold the loop branch, it can fully unroll (16)

So where is the different X86 can partial unroll but AMDGPU can not unroll at all?

bcl5980 · 2023-12-05T04:49:41Z

https://godbolt.org/z/cMeE61bhf
Loop unroll with -unroll-runtime can partial unroll the case.
@nikic It looks if we don't avoid the transform, it will become a runtime unroll. The case before simplifycfg is https://godbolt.org/z/5MoYM8rGn.
@xiangzh1 's solution looks fine to me if we do not involve loopInfo in simplifycfg. And we still need a mininal IR test for it.

xiangzh1 · 2023-12-05T06:28:09Z

So where is the different X86 can partial unroll but AMDGPU can not unroll at all?
https://godbolt.org/z/cMeE61bhf Loop unroll with -unroll-runtime can partial unroll the case. @nikic It looks if we don't avoid the transform, it will become a runtime unroll. The case before simplifycfg is https://godbolt.org/z/5MoYM8rGn. @xiangzh1 's solution looks fine to me if we do not involve loopInfo in simplifycfg. And we still need a mininal IR test for it.

1 In fact, I didn't much care about the different unroll between different targets. The loop unroll pass consider the TTI port, it is make sense to me "one do partial unroll or not" or "partial unroll with different unroll count".
I more care about the Known loop count for unroll become Unkown. This do big change for unroll (even successful). For example, loop with small Known loop count can usually be fully unrolled, which usually much simplify the address (offset) calculations in old iterations (then we can do a lot of others optimizations, e.g, SROA, for these simplifed calculations). But these don't work for Unkown loop count.

2 I am creating the mininal IR test. (I'll replace current .cu test with it, duo to I use -O2 in current test)

thanks again!

xiangzh1 · 2023-12-05T09:01:26Z

Update: add ir test llvm/test/Transforms/SimplifyCFG/simplify-cfg-unroll.ll
(not sure it is minimal, llvm-reduce doesn't works well for it, I mannualy create it)

Constant iteration loop with unroll hint usually expected do unroll by consumers, folding branches in such loop header at SimplifyCFG will break unroll optimization.

xiangzh1 · 2023-12-05T09:05:38Z

rebase

nikic · 2023-12-05T09:21:50Z

I checked, and for your test case, LoopUnroll recognizes the loop as an UpperBound unrolling candidate, but does not perform unrolling due to cost model.

The pragma unroll metadata currently only takes effect if there is an exact trip count, but not if there is an upper bound trip count. Making it work with an upper bound trip count as well should fix your case. See the code in shouldPragmaUnroll().

xiangzh1 · 2023-12-05T10:06:44Z

I checked, and for your test case, LoopUnroll recognizes the loop as an UpperBound unrolling candidate, but does not perform unrolling due to cost model.

The pragma unroll metadata currently only takes effect if there is an exact trip count, but not if there is an upper bound trip count. Making it work with an upper bound trip count as well should fix your case. See the code in shouldPragmaUnroll().

First, many thanks for checking the test.
Yes, that's another way to unroll too, but it is branch folding affected the cost model, it didn't consider the later unroll. In fact we don't much want to do unroll with UpperBound. Even it unroll successful with UpperBound, it is not better than unrolled with known loop count. Espically for known small loop count which ususally can be full unrolled, in GPU this is much helpful to symplify address (offset) calculations, which is sensitive in SROA optimization and local mem use.

nikic · 2023-12-05T10:11:47Z

I think you are confusing upper bound and runtime unrolling. An upper bound unroll is a type of full unroll. In this case it would unroll it to 16 iterations, which is what you want, no?

xiangzh1 · 2023-12-05T11:42:18Z

I think you are confusing upper bound and runtime unrolling. An upper bound unroll is a type of full unroll. In this case it would unroll it to 16 iterations, which is what you want, no?

Yes！: ) I thouth the upper bound is "partial unroll", and I am supprised if it can fully unroll to 16 iterations with the branch folding, anyway let me check shouldPragmaUnroll(), thank you a lot!

xiangzh1 · 2023-12-07T08:45:50Z

Hi friends, I created a new PR at #74703, many thanks for reviewing!!

llvmbot added clang Clang issues not falling into any other category llvm:transforms labels Dec 4, 2023

xiangzh1 requested review from topperc and rotateright December 4, 2023 02:18

xiangzh1 requested a review from bcl5980 December 4, 2023 06:07

xiangzh1 force-pushed the users/xiangzhangllvm/refine-simplify-CFG-for-loop-unroll branch from 9f3ff6f to e963223 Compare December 4, 2023 08:34

bcl5980 requested a review from nikic December 4, 2023 10:30

bcl5980 requested a review from goldsteinn December 4, 2023 10:33

nikic reviewed Dec 4, 2023

View reviewed changes

xiangzh1 force-pushed the users/xiangzhangllvm/refine-simplify-CFG-for-loop-unroll branch from e963223 to 6a80c39 Compare December 5, 2023 08:55

xiangzh1 added 2 commits December 5, 2023 17:03

[SimplifyCFG] Pre-commit test for folding branches in simplify cfg

af7995f

[SimplifyCFG] Not folding branch in constant loops which expected unroll

cbcc7f3

Constant iteration loop with unroll hint usually expected do unroll by consumers, folding branches in such loop header at SimplifyCFG will break unroll optimization.

xiangzh1 force-pushed the users/xiangzhangllvm/refine-simplify-CFG-for-loop-unroll branch from 6a80c39 to cbcc7f3 Compare December 5, 2023 09:04

xiangzh1 mentioned this pull request Dec 7, 2023

[LoopUnroll] Make use of MaxTripCount for loops with "#pragma unroll" #74703

Merged

xiangzh1 force-pushed the users/xiangzhangllvm/refine-simplify-CFG-for-loop-unroll branch 2 times, most recently from c5043e5 to cbcc7f3 Compare December 8, 2023 01:52

xiangzh1 closed this Dec 8, 2023

[SimplifyCFG] Not folding branch in loop header with constant iterations #74268

[SimplifyCFG] Not folding branch in loop header with constant iterations #74268

Uh oh!

Conversation

xiangzh1 commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Dec 4, 2023

Uh oh!

llvmbot commented Dec 4, 2023

Uh oh!

xiangzh1 commented Dec 4, 2023

Uh oh!

bcl5980 commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiangzh1 commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bcl5980 commented Dec 4, 2023

Uh oh!

xiangzh1 commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bcl5980 commented Dec 4, 2023

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

xiangzh1 commented Dec 4, 2023

Uh oh!

xiangzh1 commented Dec 4, 2023

Uh oh!

nikic commented Dec 4, 2023

Uh oh!

xiangzh1 commented Dec 4, 2023

Uh oh!

xiangzh1 commented Dec 4, 2023

Uh oh!

bcl5980 commented Dec 5, 2023

Uh oh!

xiangzh1 commented Dec 5, 2023

Uh oh!

xiangzh1 commented Dec 5, 2023

Uh oh!

bcl5980 commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bcl5980 commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiangzh1 commented Dec 5, 2023

Uh oh!

xiangzh1 commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiangzh1 commented Dec 5, 2023

Uh oh!

nikic commented Dec 5, 2023

Uh oh!

xiangzh1 commented Dec 5, 2023

Uh oh!

nikic commented Dec 5, 2023

Uh oh!

xiangzh1 commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiangzh1 commented Dec 7, 2023

Uh oh!

Uh oh!

xiangzh1 commented Dec 4, 2023 •

edited

Loading

bcl5980 commented Dec 4, 2023 •

edited

Loading

xiangzh1 commented Dec 4, 2023 •

edited

Loading

xiangzh1 commented Dec 4, 2023 •

edited

Loading

bcl5980 commented Dec 5, 2023 •

edited

Loading

bcl5980 commented Dec 5, 2023 •

edited

Loading

xiangzh1 commented Dec 5, 2023 •

edited

Loading

xiangzh1 commented Dec 5, 2023 •

edited

Loading