Skip to content

Commit d5a7a48

Browse files
committed
[SLP]Reduce number of alternate instruction, where possible
Patch tries to remove wide alternate operations. Currently SLP vectorizer emits something like this: ``` %0 = add i32 %1 = sub i32 %2 = add i32 %3 = sub i32 %4 = add i32 %5 = sub i32 %6 = add i32 %7 = sub i32 transformes to %v1 = add <8 x i32> %v2 = sub <8 x i32> %res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15> ``` i.e. half of the results are just unused. This leads to increased register pressure and potentially doubles number of operations. Patch introduces SplitVectorize mode, where it splits the operations by opcodes and produces instead something like this: ``` %v1 = add <4 x i32> %v2 = sub <4 x i32> %res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7> ``` It allows to improve the performance by reducing number of ops. Also, it turns on some other improvements, like improved graph reordering. -O3+LTO, AVX512 Metric: size..text Program size..text results results0 diff test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 277800.00 280536.00 1.0% test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81802.00 82426.00 0.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 790552.00 790952.00 0.1% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383795.00 383987.00 0.1% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2075541.00 2076501.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2075541.00 2076501.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 312702.00 312766.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2049374.00 2049358.00 -0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1091836.00 1091772.00 -0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 852339.00 852211.00 -0.0% test-suite :: MultiSource/Applications/oggenc/oggenc.test 190651.00 190523.00 -0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44203.00 44155.00 -0.1% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12997.00 12981.00 -0.1% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 668971.00 658427.00 -1.6% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 668971.00 658427.00 -1.6% Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not inlined FreeBench/pifft - extra stores <8 x double> vectorized, some other extra vectorizations CINT2006/464.h264ref - some smaller code + changes similar to x264 JM/ldecod - changes similar x264 CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - significantly compact vector code Benchmarks/Bullet - small variations CFP2017rate/526.blender_r - small variations CFP2017rate/510.parest_r - small variations CINT2006/400.perlbench - extra vector code JM/lencod - extra store <16 x i32> and other changes similar x264 Applications/oggenc - extra store <16 x i8>, small variations DOE-ProxyApps-C/miniGMG - small variations Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code CINT2017speed/625.x264_s CINT2017rate/525.x264_r - the number of instructions increased, but looks like they are more performant. E.g., for function x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the current version and 59 for the new version. -O3+LTO, march=rva32u64 CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac function vectorized, idct4x4dc stopped being vectorized (looks like issue with shuffles cost) CINT2006/400.perlbench - better vector code CINT2006/445.gobmk - some variations in vector code CINT2006/464.h264ref - extra code vectorized CINT2017rate/500.perlbench_r - small variations -O3+LTO, mcpu=sifive-p470 Metric: size..text Program size..text results results0 diff test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 587336.00 587668.00 0.1% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 643308.00 643614.00 0.0% test-suite :: MultiSource/Applications/d/make_dparser.test 79678.00 79710.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277322.00 277420.00 0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 933660.00 933682.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 148038.00 148024.00 -0.0% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 283036.00 283008.00 -0.0% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 540582.00 511772.00 -5.3% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 540582.00 511772.00 -5.3% CINT2006/464.h264ref - extra vector code in find_sad_16x16 JM/lencod - extra vector code in find_sad_16x16 d/make_dparser - smaller vector code Benchmarks/Bullet - small variations CINT2006/400.perlbench - smaller vector code CFP2017rate/526.blender_r - small variations, extra store <8 x float> in the loop, extra store <8 x i8> in loop CINT2017rate/500.perlbench_r CINT2017speed/600.perlbench_s - small variations MiBench/consumer-lame - small variations JM/ldecod - extra vector code mediabench/g721/g721encode - small variations CINT2017rate/525.x264_r CINT2017speed/625.x264_s - reduced number of wide operations and shuffles, saving the registers, similar to X86, extra code in pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some TTI costs) Reviewers: RKSimon, hiraditya Reviewed By: RKSimon Pull Request: llvm#123360
1 parent 5cba1f1 commit d5a7a48

22 files changed

+1326
-854
lines changed

llvm/include/llvm/Analysis/TargetTransformInfo.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1759,6 +1759,10 @@ class TargetTransformInfo {
17591759
/// scalable version of the vectorized loop.
17601760
bool preferFixedOverScalableIfEqualCost() const;
17611761

1762+
/// \returns True if target prefers SLP vectorizer with altermate opcode
1763+
/// vectorization, false - otherwise.
1764+
bool preferAlternateOpcodeVectorization() const;
1765+
17621766
/// \returns True if the target prefers reductions in loop.
17631767
bool preferInLoopReduction(unsigned Opcode, Type *Ty,
17641768
ReductionFlags Flags) const;
@@ -2306,6 +2310,7 @@ class TargetTransformInfo::Concept {
23062310
unsigned ChainSizeInBytes,
23072311
VectorType *VecTy) const = 0;
23082312
virtual bool preferFixedOverScalableIfEqualCost() const = 0;
2313+
virtual bool preferAlternateOpcodeVectorization() const = 0;
23092314
virtual bool preferInLoopReduction(unsigned Opcode, Type *Ty,
23102315
ReductionFlags) const = 0;
23112316
virtual bool preferPredicatedReductionSelect(unsigned Opcode, Type *Ty,
@@ -3106,6 +3111,9 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
31063111
bool preferFixedOverScalableIfEqualCost() const override {
31073112
return Impl.preferFixedOverScalableIfEqualCost();
31083113
}
3114+
bool preferAlternateOpcodeVectorization() const override {
3115+
return Impl.preferAlternateOpcodeVectorization();
3116+
}
31093117
bool preferInLoopReduction(unsigned Opcode, Type *Ty,
31103118
ReductionFlags Flags) const override {
31113119
return Impl.preferInLoopReduction(Opcode, Ty, Flags);

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -989,6 +989,8 @@ class TargetTransformInfoImplBase {
989989

990990
bool preferFixedOverScalableIfEqualCost() const { return false; }
991991

992+
bool preferAlternateOpcodeVectorization() const { return true; }
993+
992994
bool preferInLoopReduction(unsigned Opcode, Type *Ty,
993995
TTI::ReductionFlags Flags) const {
994996
return false;

llvm/lib/Analysis/TargetTransformInfo.cpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1360,6 +1360,10 @@ bool TargetTransformInfo::preferFixedOverScalableIfEqualCost() const {
13601360
return TTIImpl->preferFixedOverScalableIfEqualCost();
13611361
}
13621362

1363+
bool TargetTransformInfo::preferAlternateOpcodeVectorization() const {
1364+
return TTIImpl->preferAlternateOpcodeVectorization();
1365+
}
1366+
13631367
bool TargetTransformInfo::preferInLoopReduction(unsigned Opcode, Type *Ty,
13641368
ReductionFlags Flags) const {
13651369
return TTIImpl->preferInLoopReduction(Opcode, Ty, Flags);

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,8 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
111111

112112
unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const;
113113

114+
bool preferAlternateOpcodeVectorization() const { return false; }
115+
114116
bool preferEpilogueVectorization() const {
115117
// Epilogue vectorization is usually unprofitable - tail folding or
116118
// a smaller VF would have been better. This a blunt hammer - we

llvm/lib/Target/X86/X86TargetTransformInfo.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -292,6 +292,7 @@ class X86TTIImpl : public BasicTTIImplBase<X86TTIImpl> {
292292

293293
TTI::MemCmpExpansionOptions enableMemCmpExpansion(bool OptSize,
294294
bool IsZeroCmp) const;
295+
bool preferAlternateOpcodeVectorization() const { return false; }
295296
bool prefersVectorizedAddressing() const;
296297
bool supportsEfficientVectorElementLoadStore() const;
297298
bool enableInterleavedAccessVectorization();

0 commit comments

Comments
 (0)