Skip to content

Commit e408d1d

Browse files
committed
[SLP]Improve graph reordering.
Reworked reordering algorithm. Originally, the compiler just tried to detect the most common order in the reordarable nodes (loads, stores, extractelements,extractvalues) and then fully rebuilding the graph in the best order. This was not effecient, since it required an extra memory and time for building/rebuilding tree, double the use of the scheduling budget, which could lead to missing vectorization due to exausted scheduling resources. Patch provide 2-way approach for graph reodering problem. At first, all reordering is done in-place, it doe not required tree deleting/rebuilding, it just rotates the scalars/orders/reuses masks in the graph node. The first step (top-to bottom) rotates the whole graph, similarly to the previous implementation. Compiler counts the number of the most used orders of the graph nodes with the same vectorization factor and then rotates the subgraph with the given vectorization factor to the most used order, if it is not empty. Then repeats the same procedure for the subgraphs with the smaller vectorization factor. We can do this because we still need to reshuffle smaller subgraph when buildiong operands for the graph nodes with lasrger vectorization factor, we can rotate just subgraph, not the whole graph. The second step (bottom-to-top) scans through the leaves and tries to detect the users of the leaves which can be reordered. If the leaves can be reorder in the best fashion, they are reordered and their user too. It allows to remove double shuffles to the same ordering of the operands in many cases and just reorder the user operations instead. Plus, it moves the final shuffles closer to the top of the graph and in many cases allows to remove extra shuffle because the same procedure is repeated again and we can again merge some reordering masks and reorder user nodes instead of the operands. Also, patch improves cost model for gathering of loads, which improves x264 benchmark in some cases. Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264, +3% for 508.namd, improves most of other benchmarks. The compile and link time are almost the same, though in some cases it should be better (we're not doing an extra instruction scheduling anymore) + we may vectorize more code for the large basic blocks again because of saving scheduling budget. Differential Revision: https://reviews.llvm.org/D105020
1 parent 2183212 commit e408d1d

File tree

11 files changed

+880
-474
lines changed

11 files changed

+880
-474
lines changed

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

Lines changed: 716 additions & 291 deletions
Large diffs are not rendered by default.

llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll

Lines changed: 38 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,14 @@ target triple = "aarch64--linux-gnu"
66

77
define <2 x i64> @build_vec_v2i64(<2 x i64> %v0, <2 x i64> %v1) {
88
; CHECK-LABEL: @build_vec_v2i64(
9-
; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <2 x i64> [[V0:%.*]], <2 x i64> undef, <2 x i32> <i32 1, i32 0>
10-
; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <2 x i64> [[V1:%.*]], <2 x i64> undef, <2 x i32> <i32 1, i32 0>
11-
; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i64> [[TMP1]], [[TMP2]]
12-
; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i64> [[TMP1]], [[TMP2]]
13-
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i64> [[TMP3]], <2 x i64> [[TMP4]], <2 x i32> <i32 0, i32 3>
14-
; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i64> [[V0]], [[V1]]
15-
; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i64> [[V0]], [[V1]]
16-
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i64> [[TMP6]], <2 x i64> [[TMP7]], <2 x i32> <i32 0, i32 3>
17-
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i64> [[TMP8]], [[TMP5]]
18-
; CHECK-NEXT: ret <2 x i64> [[TMP9]]
9+
; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i64> [[V0:%.*]], [[V1:%.*]]
10+
; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i64> [[V0]], [[V1]]
11+
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i64> [[TMP1]], <2 x i64> [[TMP2]], <2 x i32> <i32 1, i32 2>
12+
; CHECK-NEXT: [[TMP4:%.*]] = add <2 x i64> [[V0]], [[V1]]
13+
; CHECK-NEXT: [[TMP5:%.*]] = sub <2 x i64> [[V0]], [[V1]]
14+
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i64> [[TMP4]], <2 x i64> [[TMP5]], <2 x i32> <i32 0, i32 3>
15+
; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i64> [[TMP6]], [[TMP3]]
16+
; CHECK-NEXT: ret <2 x i64> [[TMP7]]
1917
;
2018
%v0.0 = extractelement <2 x i64> %v0, i32 0
2119
%v0.1 = extractelement <2 x i64> %v0, i32 1
@@ -74,16 +72,14 @@ define void @store_chain_v2i64(i64* %a, i64* %b, i64* %c) {
7472

7573
define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
7674
; CHECK-LABEL: @build_vec_v4i32(
77-
; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[V0:%.*]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
78-
; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <4 x i32> [[V1:%.*]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
79-
; CHECK-NEXT: [[TMP3:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]
80-
; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[TMP1]], [[TMP2]]
81-
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
82-
; CHECK-NEXT: [[TMP6:%.*]] = add <4 x i32> [[V0]], [[V1]]
83-
; CHECK-NEXT: [[TMP7:%.*]] = sub <4 x i32> [[V0]], [[V1]]
84-
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
85-
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]
86-
; CHECK-NEXT: ret <4 x i32> [[TMP9]]
75+
; CHECK-NEXT: [[TMP1:%.*]] = add <4 x i32> [[V0:%.*]], [[V1:%.*]]
76+
; CHECK-NEXT: [[TMP2:%.*]] = sub <4 x i32> [[V0]], [[V1]]
77+
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 3, i32 6>
78+
; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[V0]], [[V1]]
79+
; CHECK-NEXT: [[TMP5:%.*]] = sub <4 x i32> [[V0]], [[V1]]
80+
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
81+
; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[TMP6]], [[TMP3]]
82+
; CHECK-NEXT: ret <4 x i32> [[TMP7]]
8783
;
8884
%v0.0 = extractelement <4 x i32> %v0, i32 0
8985
%v0.1 = extractelement <4 x i32> %v0, i32 1
@@ -114,16 +110,14 @@ define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
114110

115111
define <4 x i32> @build_vec_v4i32_reuse_0(<2 x i32> %v0, <2 x i32> %v1) {
116112
; CHECK-LABEL: @build_vec_v4i32_reuse_0(
117-
; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <2 x i32> [[V0:%.*]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>
118-
; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <2 x i32> [[V1:%.*]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>
119-
; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i32> [[TMP1]], [[TMP2]]
120-
; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i32> [[TMP1]], [[TMP2]]
121-
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> [[TMP4]], <2 x i32> <i32 0, i32 3>
122-
; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[V0]], [[V1]]
123-
; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i32> [[V0]], [[V1]]
124-
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP6]], <2 x i32> [[TMP7]], <2 x i32> <i32 0, i32 3>
125-
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP8]], [[TMP5]]
126-
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
113+
; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i32> [[V0:%.*]], [[V1:%.*]]
114+
; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i32> [[V0]], [[V1]]
115+
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2>
116+
; CHECK-NEXT: [[TMP4:%.*]] = add <2 x i32> [[V0]], [[V1]]
117+
; CHECK-NEXT: [[TMP5:%.*]] = sub <2 x i32> [[V0]], [[V1]]
118+
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP4]], <2 x i32> [[TMP5]], <2 x i32> <i32 0, i32 3>
119+
; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i32> [[TMP6]], [[TMP3]]
120+
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
127121
; CHECK-NEXT: ret <4 x i32> [[SHUFFLE]]
128122
;
129123
%v0.0 = extractelement <2 x i32> %v0, i32 0
@@ -229,22 +223,20 @@ define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) {
229223

230224
define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
231225
; CHECK-LABEL: @reduction_v4i32(
232-
; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[V0:%.*]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
233-
; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <4 x i32> [[V1:%.*]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
234-
; CHECK-NEXT: [[TMP3:%.*]] = sub <4 x i32> [[TMP1]], [[TMP2]]
235-
; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]
236-
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>
237-
; CHECK-NEXT: [[TMP6:%.*]] = sub <4 x i32> [[V0]], [[V1]]
238-
; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[V0]], [[V1]]
239-
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>
240-
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]
241-
; CHECK-NEXT: [[TMP10:%.*]] = lshr <4 x i32> [[TMP9]], <i32 15, i32 15, i32 15, i32 15>
242-
; CHECK-NEXT: [[TMP11:%.*]] = and <4 x i32> [[TMP10]], <i32 65537, i32 65537, i32 65537, i32 65537>
243-
; CHECK-NEXT: [[TMP12:%.*]] = mul nuw <4 x i32> [[TMP11]], <i32 65535, i32 65535, i32 65535, i32 65535>
244-
; CHECK-NEXT: [[TMP13:%.*]] = add <4 x i32> [[TMP12]], [[TMP9]]
245-
; CHECK-NEXT: [[TMP14:%.*]] = xor <4 x i32> [[TMP13]], [[TMP12]]
246-
; CHECK-NEXT: [[TMP15:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP14]])
247-
; CHECK-NEXT: ret i32 [[TMP15]]
226+
; CHECK-NEXT: [[TMP1:%.*]] = sub <4 x i32> [[V0:%.*]], [[V1:%.*]]
227+
; CHECK-NEXT: [[TMP2:%.*]] = add <4 x i32> [[V0]], [[V1]]
228+
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 7, i32 2>
229+
; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[V0]], [[V1]]
230+
; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[V0]], [[V1]]
231+
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>
232+
; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[TMP6]], [[TMP3]]
233+
; CHECK-NEXT: [[TMP8:%.*]] = lshr <4 x i32> [[TMP7]], <i32 15, i32 15, i32 15, i32 15>
234+
; CHECK-NEXT: [[TMP9:%.*]] = and <4 x i32> [[TMP8]], <i32 65537, i32 65537, i32 65537, i32 65537>
235+
; CHECK-NEXT: [[TMP10:%.*]] = mul nuw <4 x i32> [[TMP9]], <i32 65535, i32 65535, i32 65535, i32 65535>
236+
; CHECK-NEXT: [[TMP11:%.*]] = add <4 x i32> [[TMP10]], [[TMP7]]
237+
; CHECK-NEXT: [[TMP12:%.*]] = xor <4 x i32> [[TMP11]], [[TMP10]]
238+
; CHECK-NEXT: [[TMP13:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP12]])
239+
; CHECK-NEXT: ret i32 [[TMP13]]
248240
;
249241
%v0.0 = extractelement <4 x i32> %v0, i32 0
250242
%v0.1 = extractelement <4 x i32> %v0, i32 1

llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll

Lines changed: 38 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,14 @@ target triple = "aarch64--linux-gnu"
66

77
define <2 x i64> @build_vec_v2i64(<2 x i64> %v0, <2 x i64> %v1) {
88
; CHECK-LABEL: @build_vec_v2i64(
9-
; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <2 x i64> [[V0:%.*]], <2 x i64> undef, <2 x i32> <i32 1, i32 0>
10-
; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <2 x i64> [[V1:%.*]], <2 x i64> undef, <2 x i32> <i32 1, i32 0>
11-
; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i64> [[TMP1]], [[TMP2]]
12-
; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i64> [[TMP1]], [[TMP2]]
13-
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i64> [[TMP3]], <2 x i64> [[TMP4]], <2 x i32> <i32 0, i32 3>
14-
; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i64> [[V0]], [[V1]]
15-
; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i64> [[V0]], [[V1]]
16-
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i64> [[TMP6]], <2 x i64> [[TMP7]], <2 x i32> <i32 0, i32 3>
17-
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i64> [[TMP8]], [[TMP5]]
18-
; CHECK-NEXT: ret <2 x i64> [[TMP9]]
9+
; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i64> [[V0:%.*]], [[V1:%.*]]
10+
; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i64> [[V0]], [[V1]]
11+
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i64> [[TMP1]], <2 x i64> [[TMP2]], <2 x i32> <i32 1, i32 2>
12+
; CHECK-NEXT: [[TMP4:%.*]] = add <2 x i64> [[V0]], [[V1]]
13+
; CHECK-NEXT: [[TMP5:%.*]] = sub <2 x i64> [[V0]], [[V1]]
14+
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i64> [[TMP4]], <2 x i64> [[TMP5]], <2 x i32> <i32 0, i32 3>
15+
; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i64> [[TMP6]], [[TMP3]]
16+
; CHECK-NEXT: ret <2 x i64> [[TMP7]]
1917
;
2018
%v0.0 = extractelement <2 x i64> %v0, i32 0
2119
%v0.1 = extractelement <2 x i64> %v0, i32 1
@@ -74,16 +72,14 @@ define void @store_chain_v2i64(i64* %a, i64* %b, i64* %c) {
7472

7573
define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
7674
; CHECK-LABEL: @build_vec_v4i32(
77-
; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[V0:%.*]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
78-
; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <4 x i32> [[V1:%.*]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
79-
; CHECK-NEXT: [[TMP3:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]
80-
; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[TMP1]], [[TMP2]]
81-
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
82-
; CHECK-NEXT: [[TMP6:%.*]] = add <4 x i32> [[V0]], [[V1]]
83-
; CHECK-NEXT: [[TMP7:%.*]] = sub <4 x i32> [[V0]], [[V1]]
84-
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
85-
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]
86-
; CHECK-NEXT: ret <4 x i32> [[TMP9]]
75+
; CHECK-NEXT: [[TMP1:%.*]] = add <4 x i32> [[V0:%.*]], [[V1:%.*]]
76+
; CHECK-NEXT: [[TMP2:%.*]] = sub <4 x i32> [[V0]], [[V1]]
77+
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 3, i32 6>
78+
; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[V0]], [[V1]]
79+
; CHECK-NEXT: [[TMP5:%.*]] = sub <4 x i32> [[V0]], [[V1]]
80+
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
81+
; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[TMP6]], [[TMP3]]
82+
; CHECK-NEXT: ret <4 x i32> [[TMP7]]
8783
;
8884
%v0.0 = extractelement <4 x i32> %v0, i32 0
8985
%v0.1 = extractelement <4 x i32> %v0, i32 1
@@ -114,16 +110,14 @@ define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
114110

115111
define <4 x i32> @build_vec_v4i32_reuse_0(<2 x i32> %v0, <2 x i32> %v1) {
116112
; CHECK-LABEL: @build_vec_v4i32_reuse_0(
117-
; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <2 x i32> [[V0:%.*]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>
118-
; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <2 x i32> [[V1:%.*]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>
119-
; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i32> [[TMP1]], [[TMP2]]
120-
; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i32> [[TMP1]], [[TMP2]]
121-
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> [[TMP4]], <2 x i32> <i32 0, i32 3>
122-
; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[V0]], [[V1]]
123-
; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i32> [[V0]], [[V1]]
124-
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP6]], <2 x i32> [[TMP7]], <2 x i32> <i32 0, i32 3>
125-
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP8]], [[TMP5]]
126-
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
113+
; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i32> [[V0:%.*]], [[V1:%.*]]
114+
; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i32> [[V0]], [[V1]]
115+
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2>
116+
; CHECK-NEXT: [[TMP4:%.*]] = add <2 x i32> [[V0]], [[V1]]
117+
; CHECK-NEXT: [[TMP5:%.*]] = sub <2 x i32> [[V0]], [[V1]]
118+
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP4]], <2 x i32> [[TMP5]], <2 x i32> <i32 0, i32 3>
119+
; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i32> [[TMP6]], [[TMP3]]
120+
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
127121
; CHECK-NEXT: ret <4 x i32> [[SHUFFLE]]
128122
;
129123
%v0.0 = extractelement <2 x i32> %v0, i32 0
@@ -229,22 +223,20 @@ define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) {
229223

230224
define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
231225
; CHECK-LABEL: @reduction_v4i32(
232-
; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[V0:%.*]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
233-
; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <4 x i32> [[V1:%.*]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
234-
; CHECK-NEXT: [[TMP3:%.*]] = sub <4 x i32> [[TMP1]], [[TMP2]]
235-
; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]
236-
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>
237-
; CHECK-NEXT: [[TMP6:%.*]] = sub <4 x i32> [[V0]], [[V1]]
238-
; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[V0]], [[V1]]
239-
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>
240-
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]
241-
; CHECK-NEXT: [[TMP10:%.*]] = lshr <4 x i32> [[TMP9]], <i32 15, i32 15, i32 15, i32 15>
242-
; CHECK-NEXT: [[TMP11:%.*]] = and <4 x i32> [[TMP10]], <i32 65537, i32 65537, i32 65537, i32 65537>
243-
; CHECK-NEXT: [[TMP12:%.*]] = mul nuw <4 x i32> [[TMP11]], <i32 65535, i32 65535, i32 65535, i32 65535>
244-
; CHECK-NEXT: [[TMP13:%.*]] = add <4 x i32> [[TMP12]], [[TMP9]]
245-
; CHECK-NEXT: [[TMP14:%.*]] = xor <4 x i32> [[TMP13]], [[TMP12]]
246-
; CHECK-NEXT: [[TMP15:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP14]])
247-
; CHECK-NEXT: ret i32 [[TMP15]]
226+
; CHECK-NEXT: [[TMP1:%.*]] = sub <4 x i32> [[V0:%.*]], [[V1:%.*]]
227+
; CHECK-NEXT: [[TMP2:%.*]] = add <4 x i32> [[V0]], [[V1]]
228+
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 7, i32 2>
229+
; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[V0]], [[V1]]
230+
; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[V0]], [[V1]]
231+
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>
232+
; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[TMP6]], [[TMP3]]
233+
; CHECK-NEXT: [[TMP8:%.*]] = lshr <4 x i32> [[TMP7]], <i32 15, i32 15, i32 15, i32 15>
234+
; CHECK-NEXT: [[TMP9:%.*]] = and <4 x i32> [[TMP8]], <i32 65537, i32 65537, i32 65537, i32 65537>
235+
; CHECK-NEXT: [[TMP10:%.*]] = mul nuw <4 x i32> [[TMP9]], <i32 65535, i32 65535, i32 65535, i32 65535>
236+
; CHECK-NEXT: [[TMP11:%.*]] = add <4 x i32> [[TMP10]], [[TMP7]]
237+
; CHECK-NEXT: [[TMP12:%.*]] = xor <4 x i32> [[TMP11]], [[TMP10]]
238+
; CHECK-NEXT: [[TMP13:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP12]])
239+
; CHECK-NEXT: ret i32 [[TMP13]]
248240
;
249241
%v0.0 = extractelement <4 x i32> %v0, i32 0
250242
%v0.1 = extractelement <4 x i32> %v0, i32 1

0 commit comments

Comments
 (0)