Skip to content

[SLP]Vectorize gathered loads #107461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

alexey-bataev
Copy link
Member

@alexey-bataev alexey-bataev commented Sep 5, 2024

Final gather/buildvector nodes may have scalar loads, which are not
vectorized (since they are part of the gather nodes) but may form full
vector loads, being combined. This patch walks over all gather nodes,
"gathering" and sorting gathered scalar loads and then tries to build
vector loads, which later are reshuffled between the gather nodes.
It allows later to add support for segmented loads (kind of AOS to SOA
load kind for RISC-V RVV) and may help with the removal of the alternat
e opcodes support.
Currently, alternate nodes may depend on each other because of the
consecutive loads between their operands. Because of that we cannot
simply remove alternate vectorization. But this approach may help to
remove most of the stuff for it, since we'll be able to vectorize loads
in between lanes.

Metric: size..text, AVX512

Program size..text
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 238381.00 250669.00 5.2%
test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 25753.00 26329.00 2.2%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-psadbw.test 3028.00 3092.00 2.1%
test-suite :: MultiSource/Benchmarks/Rodinia/hotspot/hotspot.test 4243.00 4275.00 0.8%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 649765.00 653877.00 0.6%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 649765.00 653877.00 0.6%
test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test 4199.00 4222.00 0.5%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12933.00 12997.00 0.5%
test-suite :: SingleSource/Benchmarks/Misc/flops.test 8282.00 8314.00 0.4%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-unpack_msasm.test 10065.00 10097.00 0.3%
test-suite :: SingleSource/Benchmarks/Misc-C++/Large/ray.test 5160.00 5176.00 0.3%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12472220.00 12509612.00 0.3%
test-suite :: MultiSource/Benchmarks/Prolangs-C++/city/city.test 6908.00 6924.00 0.2%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 202830.00 203278.00 0.2%
test-suite :: SingleSource/Benchmarks/CoyoteBench/fftbench.test 9133.00 9149.00 0.2%
test-suite :: MultiSource/Benchmarks/Olden/power/power.test 6792.00 6803.00 0.2%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1395585.00 1397473.00 0.1%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1395585.00 1397473.00 0.1%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 97662.00 97758.00 0.1%
test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 595179.00 595739.00 0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 70603.00 70667.00 0.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/unix-smail/unix-smail.test 19877.00 19893.00 0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test 90231.00 90279.00 0.1%
test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test 33738.00 33754.00 0.0%
test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test 13262.00 13268.00 0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1139964.00 1140460.00 0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 849507.00 849875.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1158379.00 1158859.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test 38724.00 38740.00 0.0%
test-suite :: External/SPEC/CFP2006/470.lbm/470.lbm.test 15180.00 15186.00 0.0%
test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test 15484.00 15490.00 0.0%
test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test 167391.00 167455.00 0.0%
test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.test 137448.00 137496.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2030254.00 2030766.00 0.0%
test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test 302870.00 302934.00 0.0%
test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test 303126.00 303190.00 0.0%
test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test 241107.00 241155.00 0.0%
test-suite :: External/SPEC/CFP2006/482.sphinx3/482.sphinx3.test 162974.00 163006.00 0.0%
test-suite :: MultiSource/Applications/siod/siod.test 167168.00 167200.00 0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1048796.00 1048988.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 201623.00 201655.00 0.0%
test-suite :: MultiSource/Applications/sqlite3/sqlite3.test 501734.00 501798.00 0.0%
test-suite :: MultiSource/Applications/ClamAV/clamscan.test 580888.00 580952.00 0.0%
test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test 168319.00 168335.00 0.0%
test-suite :: MicroBenchmarks/ImageProcessing/Interpolation/Interpolation.test 226022.00 226038.00 0.0%
test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-flt/StatementReordering-flt.test 118011.00 118015.00 0.0%
test-suite :: External/SPEC/CINT2006/471.omnetpp/471.omnetpp.test 550589.00 550605.00 0.0%
test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 3072477.00 3072541.00 0.0%
test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test 2385563.00 2385579.00 0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 389171.00 389155.00 -0.0%
test-suite :: MultiSource/Applications/lua/lua.test 234764.00 234748.00 -0.0%
test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test 227694.00 227678.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test 119819.00 119807.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test 117995.00 117983.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/InductionVariable-flt/InductionVariable-flt.test 123610.00 123594.00 -0.0%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81414.00 81398.00 -0.0%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 782040.00 781880.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 9597420.00 9595292.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 9597420.00 9595292.00 -0.0%
test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 911832.00 911608.00 -0.0%
test-suite :: MultiSource/Applications/oggenc/oggenc.test 192507.00 192459.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt.test 122843.00 122811.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test 122292.00 122260.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 777363.00 777155.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt.test 123265.00 123205.00 -0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 315534.00 315358.00 -0.1%
test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.test 128163.00 128083.00 -0.1%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 6562.00 6555.00 -0.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test 23428.00 23396.00 -0.1%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22749.00 22717.00 -0.1%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39549.00 39485.00 -0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39546.00 39482.00 -0.2%
test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test 57214.00 57118.00 -0.2%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 413668.00 412804.00 -0.2%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1044047.00 1041487.00 -0.2%
test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test 12414.00 12382.00 -0.3%
test-suite :: MultiSource/Benchmarks/Prolangs-C/gnugo/gnugo.test 31161.00 30969.00 -0.6%
test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test 224726.00 223254.00 -0.7%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93512.00 92824.00 -0.7%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 281151.00 278463.00 -1.0%
test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2820.00 2788.00 -1.1%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 156819.00 154739.00 -1.3%
test-suite :: MultiSource/Benchmarks/MiBench/security-blowfish/security-blowfish.test 11560.00 11160.00 -3.5%
test-suite :: MultiSource/Benchmarks/McCat/08-main/main.test 6734.00 6382.00 -5.2%
results results0 diff

ASCI_Purple/SMG2000 - extra vector code
VPlanNativePath/outer-loop-vect - extra vectorization, better vector
code
AVX512BWVL/Vector-AVX512BWVL-psadbw - better vector code
Rodinia/hotspot - small variations
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - extra vector code, better vectorization
BenchmarkGame/n-body - better vector code.
AVX512BWVL/Vector-AVX512BWVL-unpack_msasm - small variations
Misc/flops - extra vector code
AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - small variations
Misc-C++/Large - better vector code
CFP2017rate/526.blender_r - extra vector code
Prolangs-C++/city - extra vector code
MiBench/consumer-lame - extra vector code
CoyoteBench/fftbench - extra vector code
Olden/power - better vector code
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - extra vector code
CINT2017rate/531.deepsjeng_r - extra vector code
CFP2006/447.dealII - small variations
DOE-ProxyApps-C/miniAMR - small variations
Prolangs-C/unix-smail - small variations
DOE-ProxyApps-C++/PENNANT - small variations
CINT2006/473.astar - small variations
CFP2006/453.povray - small variations
JM/lencod - extra vector code
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/CoMD - small variations
CFP2006/470.lbm - extra vector code
CFP2017speed/619.lbm_s
CFP2017rate/519.lbm_r - extra vector code
CINT2006/456.hmmer - extra code vectorized
TSVC/ControlFlow-dbl - extra vector code
CFP2017rate/510.parest_r - better vector code
LCALS/SubsetALambdaLoops - extra code vectorized
LCALS/SubsetARawLoops - extra code vectorized
CFP2006/444.namd - extra code vectorized
CFP2006/482.sphinx3 - better vector code
Applications/siod - better vector code
Benchmarks/7zip - better vector code
DOE-ProxyApps-C++/CLAMR - extra code vectorized
Applications/sqlite3 - extra code vectorized
Applications/ClamAV - smaller vector code
MallocBench/gs - small variations
MicroBenchmarks/ImageProcessing - small variations
TSVC/StatementReordering-flt - extra code vectorized
CINT2006/471.omnetpp - small variations
CINT2006/403.gcc - extra code vectorized
CINT2006/483.xalancbmk - extra code vectorized
JM/ldecod - small variations
Applications/lua - extra code vectorized
mafft/pairlocalalign - small variations
TSVC/NodeSplitting-flt - extra code vectorized
TSVC/Recurrences-flt - extra code vectorized
TSVC/InductionVariable-flt - extra code vectorized
FreeBench/pifft - small variations
CINT2006/464.h264ref - extra code vectorized
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - some extra code vectorized, extra code inlined
CINT2006/445.gobmk - small variations
Applications/oggenc - small variations
TSVC/LoopRestructuring-flt - extra code vectorized
TSVC/CrossingThresholds-flt - extra code vectorized
CFP2017rate/508.namd_r - small variations
TSVC/ControlFlow-flt - extra code vectorized
mediabench/g721 - small variations
Prolangs-C/compiler - small variations
FreeBench/fourinarow - better vector code
MiBench/telecomm-gsm - small variation in vector code
mediabench/gsm - same
Prolangs-C/bison - small variations
Adobe-C++/loop_unroll - extra code vectorized
Benchmarks/tramp3d-v4 - extra code gets inlined, small changes in vetor
code
McCat/18-imp - variations in vector code
Prolangs-C/gnugo - variations in vector code
MallocBench/espresso - extra code vectorized
DOE-ProxyApps-C++/miniFE - small variations in vector code
Prolangs-C/TimberWolfMC - extra code vectorized, small changes in
previously vectorized code.
Olden/tsp - small changes in vector code
CFP2006/433.milc - extra code gets inlined, vectorized 2 x stores to 4 x stores
MiBench/security-blowfish - extra code vectorized
McCat/08-main - better vector code.

Metric: size..text, RISCV, sifive-p670

Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 63580.00 64020.00 0.7%
test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21388.00 21406.00 0.1%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 296992.00 297088.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 968112.00 968208.00 0.0%
test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.test 45160.00 45164.00 0.0%
test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 2635902.00 2635854.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 2635902.00 2635854.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7568730.00 7568578.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7568730.00 7568578.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test 49764.00 49762.00 -0.0%
test-suite :: MultiSource/Applications/sqlite3/sqlite3.test 449132.00 449108.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 695932.00 695892.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 508820.00 508788.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 508820.00 508788.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9594152.00 9593336.00 -0.0%
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 166522.00 166490.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 722252.00 722092.00 -0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 27554.00 27546.00 -0.0%
test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10900.00 10896.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.test 46754.00 46732.00 -0.0%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 631570.00 631226.00 -0.1%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 850698.00 850218.00 -0.1%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24816.00 24800.00 -0.1%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24814.00 24798.00 -0.1%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1599946.00 1598394.00 -0.1%
test-suite :: MultiSource/Applications/hbd/hbd.test 27236.00 27204.00 -0.1%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 293848.00 293480.00 -0.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test 20160.00 20048.00 -0.6%
test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test 182088.00 181040.00 -0.6%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4788.00 4748.00 -0.8%

DOE-ProxyApps-C++/miniFE - extra vector code
MiBench/automotive-susan - small variations
Benchmarks/Bullet - extra vector code
CFP2017rate/511.povray_r - slightly better vector code
TSVC/StatementReordering-dbl - small variations
CINT2017rate/523.xalancbmk_r
CINT2017speed/623.xalancbmk_s - extra vector code
CINT2017rate/502.gcc_r
CINT2017speed/602.gcc_s - extra vector code
TSVC/CrossingThresholds-flt - small variations
Applications/sqlite3 - extra vector code
JM/lencod - extra vector code, small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - small variations
CFP2017rate/526.blender_r - extra vector code, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vectorizer/VPlanNativePath/outer-loop-vect - small variations
TSVC/CrossingThresholds-dbl - small variations
Benchmarks/tramp3d-v4 - small variations
Benchmarks/7zip - extra vector code
MiBench/telecomm-gsm - small variations
mediabench/gsm/toast - small variations
CFP2017rate/510.parest_r - extra vector code
Applications/hbd - extra vector code
JM/ldecod - better vector code
Prolangs-C/compiler - extra vector code
MallocBench/espresso - extra vector code
mediabench/g721/g721encode - extra vectorization

Created using spr 1.3.5
@llvmbot
Copy link
Member

llvmbot commented Sep 5, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Alexey Bataev (alexey-bataev)

Changes

When building the vectorization graph, the compiler may end up with the
consecutive loads in the different branches, which end up to be
gathered. We can scan these loads and try to load them as final
vectorized load and then reshuffle between the branches to avoid extra
scalar loads in the code.

Metric: size..text, AVX512

Program size..text
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 238381.00 250669.00 5.2%
test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 25753.00 26329.00 2.2%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-psadbw.test 3028.00 3092.00 2.1%
test-suite :: MultiSource/Benchmarks/Rodinia/hotspot/hotspot.test 4243.00 4275.00 0.8%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 649765.00 653877.00 0.6%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 649765.00 653877.00 0.6%
test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test 4199.00 4222.00 0.5%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12933.00 12997.00 0.5%
test-suite :: SingleSource/Benchmarks/Misc/flops.test 8282.00 8314.00 0.4%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-unpack_msasm.test 10065.00 10097.00 0.3%
test-suite :: SingleSource/Benchmarks/Misc-C++/Large/ray.test 5160.00 5176.00 0.3%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12472220.00 12509612.00 0.3%
test-suite :: MultiSource/Benchmarks/Prolangs-C++/city/city.test 6908.00 6924.00 0.2%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 202830.00 203278.00 0.2%
test-suite :: SingleSource/Benchmarks/CoyoteBench/fftbench.test 9133.00 9149.00 0.2%
test-suite :: MultiSource/Benchmarks/Olden/power/power.test 6792.00 6803.00 0.2%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1395585.00 1397473.00 0.1%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1395585.00 1397473.00 0.1%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 97662.00 97758.00 0.1%
test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 595179.00 595739.00 0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 70603.00 70667.00 0.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/unix-smail/unix-smail.test 19877.00 19893.00 0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test 90231.00 90279.00 0.1%
test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test 33738.00 33754.00 0.0%
test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test 13262.00 13268.00 0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1139964.00 1140460.00 0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 849507.00 849875.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1158379.00 1158859.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test 38724.00 38740.00 0.0%
test-suite :: External/SPEC/CFP2006/470.lbm/470.lbm.test 15180.00 15186.00 0.0%
test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test 15484.00 15490.00 0.0%
test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test 167391.00 167455.00 0.0%
test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.test 137448.00 137496.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2030254.00 2030766.00 0.0%
test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test 302870.00 302934.00 0.0%
test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test 303126.00 303190.00 0.0%
test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test 241107.00 241155.00 0.0%
test-suite :: External/SPEC/CFP2006/482.sphinx3/482.sphinx3.test 162974.00 163006.00 0.0%
test-suite :: MultiSource/Applications/siod/siod.test 167168.00 167200.00 0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1048796.00 1048988.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 201623.00 201655.00 0.0%
test-suite :: MultiSource/Applications/sqlite3/sqlite3.test 501734.00 501798.00 0.0%
test-suite :: MultiSource/Applications/ClamAV/clamscan.test 580888.00 580952.00 0.0%
test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test 168319.00 168335.00 0.0%
test-suite :: MicroBenchmarks/ImageProcessing/Interpolation/Interpolation.test 226022.00 226038.00 0.0%
test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-flt/StatementReordering-flt.test 118011.00 118015.00 0.0%
test-suite :: External/SPEC/CINT2006/471.omnetpp/471.omnetpp.test 550589.00 550605.00 0.0%
test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 3072477.00 3072541.00 0.0%
test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test 2385563.00 2385579.00 0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 389171.00 389155.00 -0.0%
test-suite :: MultiSource/Applications/lua/lua.test 234764.00 234748.00 -0.0%
test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test 227694.00 227678.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test 119819.00 119807.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test 117995.00 117983.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/InductionVariable-flt/InductionVariable-flt.test 123610.00 123594.00 -0.0%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81414.00 81398.00 -0.0%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 782040.00 781880.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 9597420.00 9595292.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 9597420.00 9595292.00 -0.0%
test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 911832.00 911608.00 -0.0%
test-suite :: MultiSource/Applications/oggenc/oggenc.test 192507.00 192459.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt.test 122843.00 122811.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test 122292.00 122260.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 777363.00 777155.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt.test 123265.00 123205.00 -0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 315534.00 315358.00 -0.1%
test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.test 128163.00 128083.00 -0.1%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 6562.00 6555.00 -0.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test 23428.00 23396.00 -0.1%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22749.00 22717.00 -0.1%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39549.00 39485.00 -0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39546.00 39482.00 -0.2%
test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test 57214.00 57118.00 -0.2%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 413668.00 412804.00 -0.2%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1044047.00 1041487.00 -0.2%
test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test 12414.00 12382.00 -0.3%
test-suite :: MultiSource/Benchmarks/Prolangs-C/gnugo/gnugo.test 31161.00 30969.00 -0.6%
test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test 224726.00 223254.00 -0.7%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93512.00 92824.00 -0.7%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 281151.00 278463.00 -1.0%
test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2820.00 2788.00 -1.1%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 156819.00 154739.00 -1.3%
test-suite :: MultiSource/Benchmarks/MiBench/security-blowfish/security-blowfish.test 11560.00 11160.00 -3.5%
test-suite :: MultiSource/Benchmarks/McCat/08-main/main.test 6734.00 6382.00 -5.2%
results results0 diff

ASCI_Purple/SMG2000 - extra vector code
VPlanNativePath/outer-loop-vect - extra vectorization, better vector
code
AVX512BWVL/Vector-AVX512BWVL-psadbw - better vector code
Rodinia/hotspot - small variations
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - extra vector code, better vectorization
BenchmarkGame/n-body - better vector code.
AVX512BWVL/Vector-AVX512BWVL-unpack_msasm - small variations
Misc/flops - extra vector code
AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - small variations
Misc-C++/Large - better vector code
CFP2017rate/526.blender_r - extra vector code
Prolangs-C++/city - extra vector code
MiBench/consumer-lame - extra vector code
CoyoteBench/fftbench - extra vector code
Olden/power - better vector code
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - extra vector code
CINT2017rate/531.deepsjeng_r - extra vector code
CFP2006/447.dealII - small variations
DOE-ProxyApps-C/miniAMR - small variations
Prolangs-C/unix-smail - small variations
DOE-ProxyApps-C++/PENNANT - small variations
CINT2006/473.astar - small variations
CFP2006/453.povray - small variations
JM/lencod - extra vector code
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/CoMD - small variations
CFP2006/470.lbm - extra vector code
CFP2017speed/619.lbm_s
CFP2017rate/519.lbm_r - extra vector code
CINT2006/456.hmmer - extra code vectorized
TSVC/ControlFlow-dbl - extra vector code
CFP2017rate/510.parest_r - better vector code
LCALS/SubsetALambdaLoops - extra code vectorized
LCALS/SubsetARawLoops - extra code vectorized
CFP2006/444.namd - extra code vectorized
CFP2006/482.sphinx3 - better vector code
Applications/siod - better vector code
Benchmarks/7zip - better vector code
DOE-ProxyApps-C++/CLAMR - extra code vectorized
Applications/sqlite3 - extra code vectorized
Applications/ClamAV - smaller vector code
MallocBench/gs - small variations
MicroBenchmarks/ImageProcessing - small variations
TSVC/StatementReordering-flt - extra code vectorized
CINT2006/471.omnetpp - small variations
CINT2006/403.gcc - extra code vectorized
CINT2006/483.xalancbmk - extra code vectorized
JM/ldecod - small variations
Applications/lua - extra code vectorized
mafft/pairlocalalign - small variations
TSVC/NodeSplitting-flt - extra code vectorized
TSVC/Recurrences-flt - extra code vectorized
TSVC/InductionVariable-flt - extra code vectorized
FreeBench/pifft - small variations
CINT2006/464.h264ref - extra code vectorized
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - some extra code vectorized, extra code inlined
CINT2006/445.gobmk - small variations
Applications/oggenc - small variations
TSVC/LoopRestructuring-flt - extra code vectorized
TSVC/CrossingThresholds-flt - extra code vectorized
CFP2017rate/508.namd_r - small variations
TSVC/ControlFlow-flt - extra code vectorized
mediabench/g721 - small variations
Prolangs-C/compiler - small variations
FreeBench/fourinarow - better vector code
MiBench/telecomm-gsm - small variation in vector code
mediabench/gsm - same
Prolangs-C/bison - small variations
Adobe-C++/loop_unroll - extra code vectorized
Benchmarks/tramp3d-v4 - extra code gets inlined, small changes in vetor
code
McCat/18-imp - variations in vector code
Prolangs-C/gnugo - variations in vector code
MallocBench/espresso - extra code vectorized
DOE-ProxyApps-C++/miniFE - small variations in vector code
Prolangs-C/TimberWolfMC - extra code vectorized, small changes in
previously vectorized code.
Olden/tsp - small changes in vector code
CFP2006/433.milc - extra code gets inlined, vectorized 2 x stores to 4 x stores
MiBench/security-blowfish - extra code vectorized
McCat/08-main - better vector code.

Metric: size..text, RISCV, sifive-p670

Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 63580.00 64020.00 0.7%
test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21388.00 21406.00 0.1%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 296992.00 297088.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 968112.00 968208.00 0.0%
test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.test 45160.00 45164.00 0.0%
test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 2635902.00 2635854.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 2635902.00 2635854.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7568730.00 7568578.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7568730.00 7568578.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test 49764.00 49762.00 -0.0%
test-suite :: MultiSource/Applications/sqlite3/sqlite3.test 449132.00 449108.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 695932.00 695892.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 508820.00 508788.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 508820.00 508788.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9594152.00 9593336.00 -0.0%
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 166522.00 166490.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 722252.00 722092.00 -0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 27554.00 27546.00 -0.0%
test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10900.00 10896.00 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.test 46754.00 46732.00 -0.0%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 631570.00 631226.00 -0.1%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 850698.00 850218.00 -0.1%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24816.00 24800.00 -0.1%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24814.00 24798.00 -0.1%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1599946.00 1598394.00 -0.1%
test-suite :: MultiSource/Applications/hbd/hbd.test 27236.00 27204.00 -0.1%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 293848.00 293480.00 -0.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test 20160.00 20048.00 -0.6%
test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test 182088.00 181040.00 -0.6%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4788.00 4748.00 -0.8%

DOE-ProxyApps-C++/miniFE - extra vector code
MiBench/automotive-susan - small variations
Benchmarks/Bullet - extra vector code
CFP2017rate/511.povray_r - slightly better vector code
TSVC/StatementReordering-dbl - small variations
CINT2017rate/523.xalancbmk_r
CINT2017speed/623.xalancbmk_s - extra vector code
CINT2017rate/502.gcc_r
CINT2017speed/602.gcc_s - extra vector code
TSVC/CrossingThresholds-flt - small variations
Applications/sqlite3 - extra vector code
JM/lencod - extra vector code, small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - small variations
CFP2017rate/526.blender_r - extra vector code, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vectorizer/VPlanNativePath/outer-loop-vect - small variations
TSVC/CrossingThresholds-dbl - small variations
Benchmarks/tramp3d-v4 - small variations
Benchmarks/7zip - extra vector code
MiBench/telecomm-gsm - small variations
mediabench/gsm/toast - small variations
CFP2017rate/510.parest_r - extra vector code
Applications/hbd - extra vector code
JM/ldecod - better vector code
Prolangs-C/compiler - extra vector code
MallocBench/espresso - extra vector code
mediabench/g721/g721encode - extra vectorization


Patch is 175.49 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/107461.diff

21 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+548-31)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s116.ll (+4-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/vec3-calls.ll (+25-2)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll (+9-19)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/crash_dequeue.ll (+4-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/horizontal-minmax.ll (+12-15)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/load-merge-inseltpoison.ll (+7-14)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll (+7-14)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll (+11-17)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll (+129-141)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll (+129-141)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/pr48879-sroa.ll (+54-108)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reorder-possible-strided-node.ll (+12-24)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/scatter-vectorize-reorder.ll (+7-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/sin-sqrt.ll (+33-66)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/split-load8_2-unord.ll (+8-18)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/supernode.ll (+29-51)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/vec3-calls.ll (+6-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/vec_list_bias-inseltpoison.ll (+20-19)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/vec_list_bias.ll (+20-19)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/vec_list_bias_external_insert_shuffled.ll (+11-11)
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 60476398e5ca75..dcf23c3b0e0c71 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -1342,6 +1342,7 @@ class BoUpSLP {
     MustGather.clear();
     NonScheduledFirst.clear();
     EntryToLastInstruction.clear();
+    GatheredLoadsEntriesFirst = NoGatheredLoads;
     ExternalUses.clear();
     ExternalUsesAsOriginalScalar.clear();
     for (auto &Iter : BlocksSchedules) {
@@ -1358,7 +1359,11 @@ class BoUpSLP {
     ValueToGatherNodes.clear();
   }
 
-  unsigned getTreeSize() const { return VectorizableTree.size(); }
+  unsigned getTreeSize() const {
+    return GatheredLoadsEntriesFirst == NoGatheredLoads
+               ? VectorizableTree.size()
+               : GatheredLoadsEntriesFirst;
+  }
 
   /// Perform LICM and CSE on the newly generated gather sequences.
   void optimizeGatherSequence();
@@ -1478,11 +1483,14 @@ class BoUpSLP {
   /// \param VL0 main load value.
   /// \param Order returned order of load instructions.
   /// \param PointerOps returned list of pointer operands.
+  /// \param BestVF return best vector factor, if recursive check found better
+  /// vectorization sequences rather than masked gather.
   /// \param TryRecursiveCheck used to check if long masked gather can be
   /// represented as a serie of loads/insert subvector, if profitable.
   LoadsState canVectorizeLoads(ArrayRef<Value *> VL, const Value *VL0,
                                SmallVectorImpl<unsigned> &Order,
                                SmallVectorImpl<Value *> &PointerOps,
+                               unsigned *BestVF = nullptr,
                                bool TryRecursiveCheck = true) const;
 
   OptimizationRemarkEmitter *getORE() { return ORE; }
@@ -2965,6 +2973,12 @@ class BoUpSLP {
   /// be beneficial even the tree height is tiny.
   bool isFullyVectorizableTinyTree(bool ForReduction) const;
 
+  /// Run through the list of all gathered loads in the graph and try to find
+  /// vector loads/masked gathers instead of regular gathers. Later these loads
+  /// are reshufled to build final gathered nodes.
+  void tryToVectorizeGatheredLoads(
+      ArrayRef<SmallVector<std::pair<LoadInst *, int>>> GatheredLoads);
+
   /// Reorder commutative or alt operands to get better probability of
   /// generating vectorized code.
   static void reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,
@@ -3037,7 +3051,7 @@ class BoUpSLP {
     }
 
     bool isOperandGatherNode(const EdgeInfo &UserEI) const {
-      return isGather() && (Idx > 0 || !UserTreeIndices.empty()) &&
+      return isGather() && !UserTreeIndices.empty() &&
              UserTreeIndices.front().EdgeIdx == UserEI.EdgeIdx &&
              UserTreeIndices.front().UserTE == UserEI.UserTE;
     }
@@ -3384,6 +3398,12 @@ class BoUpSLP {
     assert(((!Bundle && EntryState == TreeEntry::NeedToGather) ||
             (Bundle && EntryState != TreeEntry::NeedToGather)) &&
            "Need to vectorize gather entry?");
+    // Gathered loads still gathered? Do not create entry, use the original one.
+    if (GatheredLoadsEntriesFirst != NoGatheredLoads &&
+        EntryState == TreeEntry::NeedToGather &&
+        S.getOpcode() == Instruction::Load && UserTreeIdx.EdgeIdx == UINT_MAX &&
+        !UserTreeIdx.UserTE)
+      return nullptr;
     VectorizableTree.push_back(std::make_unique<TreeEntry>(VectorizableTree));
     TreeEntry *Last = VectorizableTree.back().get();
     Last->Idx = VectorizableTree.size() - 1;
@@ -3528,6 +3548,10 @@ class BoUpSLP {
       DenseMap<Value *, SmallPtrSet<const TreeEntry *, 4>>;
   ValueToGatherNodesMap ValueToGatherNodes;
 
+  /// The index of the first gathered load entry in the VectorizeTree.
+  constexpr static int NoGatheredLoads = -1;
+  int GatheredLoadsEntriesFirst = NoGatheredLoads;
+
   /// This POD struct describes one external user in the vectorized tree.
   struct ExternalUser {
     ExternalUser(Value *S, llvm::User *U, int L)
@@ -4699,15 +4723,19 @@ getShuffleCost(const TargetTransformInfo &TTI, TTI::ShuffleKind Kind,
   return TTI.getShuffleCost(Kind, Tp, Mask, CostKind, Index, SubTp, Args);
 }
 
-BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
-    ArrayRef<Value *> VL, const Value *VL0, SmallVectorImpl<unsigned> &Order,
-    SmallVectorImpl<Value *> &PointerOps, bool TryRecursiveCheck) const {
+BoUpSLP::LoadsState
+BoUpSLP::canVectorizeLoads(ArrayRef<Value *> VL, const Value *VL0,
+                           SmallVectorImpl<unsigned> &Order,
+                           SmallVectorImpl<Value *> &PointerOps,
+                           unsigned *BestVF, bool TryRecursiveCheck) const {
   // Check that a vectorized load would load the same memory as a scalar
   // load. For example, we don't want to vectorize loads that are smaller
   // than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM
   // treats loading/storing it as an i8 struct. If we vectorize loads/stores
   // from such a struct, we read/write packed bits disagreeing with the
   // unvectorized version.
+  if (BestVF)
+    *BestVF = 0;
   Type *ScalarTy = VL0->getType();
 
   if (DL->getTypeSizeInBits(ScalarTy) != DL->getTypeAllocSizeInBits(ScalarTy))
@@ -4823,7 +4851,10 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
   // strided/masked gather loads. Returns true if vectorized + shuffles
   // representation is better than just gather.
   auto CheckForShuffledLoads = [&, &TTI = *TTI](Align CommonAlignment,
+                                                unsigned *BestVF,
                                                 bool ProfitableGatherPointers) {
+    if (BestVF)
+      *BestVF = 0;
     // Compare masked gather cost and loads + insert subvector costs.
     TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
     auto [ScalarGEPCost, VectorGEPCost] =
@@ -4885,10 +4916,14 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
         SmallVector<unsigned> Order;
         SmallVector<Value *> PointerOps;
         LoadsState LS =
-            canVectorizeLoads(Slice, Slice.front(), Order, PointerOps,
+            canVectorizeLoads(Slice, Slice.front(), Order, PointerOps, BestVF,
                               /*TryRecursiveCheck=*/false);
         // Check that the sorted loads are consecutive.
         if (LS == LoadsState::Gather) {
+          if (BestVF) {
+            DemandedElts.setAllBits();
+            break;
+          }
           DemandedElts.setBits(Cnt, Cnt + VF);
           continue;
         }
@@ -4981,8 +5016,11 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
       // consider it as a gather node. It will be better estimated
       // later.
       if (MaskedGatherCost >= VecLdCost &&
-          VecLdCost - GatherCost < -SLPCostThreshold)
+          VecLdCost - GatherCost < -SLPCostThreshold) {
+        if (BestVF)
+          *BestVF = VF;
         return true;
+      }
     }
     return MaskedGatherCost - GatherCost >= -SLPCostThreshold;
   };
@@ -5006,7 +5044,8 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
       // Check if potential masked gather can be represented as series
       // of loads + insertsubvectors.
       if (TryRecursiveCheck &&
-          CheckForShuffledLoads(CommonAlignment, ProfitableGatherPointers)) {
+          CheckForShuffledLoads(CommonAlignment, BestVF,
+                                ProfitableGatherPointers)) {
         // If masked gather cost is higher - better to vectorize, so
         // consider it as a gather node. It will be better estimated
         // later.
@@ -5427,6 +5466,16 @@ BoUpSLP::getReorderingData(const TreeEntry &TE, bool TopToBottom) {
     if (TE.Scalars.size() >= 4)
       if (std::optional<OrdersType> Order = findPartiallyOrderedLoads(TE))
         return Order;
+    // Check if can include the order of vectorized loads. For masked gathers do
+    // extra analysis later, so include such nodes into a special list.
+    if (TE.isGather() && TE.getOpcode() == Instruction::Load) {
+      SmallVector<Value *> PointerOps;
+      OrdersType CurrentOrder;
+      LoadsState Res = canVectorizeLoads(TE.Scalars, TE.Scalars.front(),
+                                         CurrentOrder, PointerOps);
+      if (Res == LoadsState::Vectorize || Res == LoadsState::StridedVectorize)
+        return std::move(CurrentOrder);
+    }
     if (std::optional<OrdersType> CurrentOrder = findReusedOrderedScalars(TE))
       return CurrentOrder;
   }
@@ -6325,6 +6374,404 @@ void BoUpSLP::buildTree(ArrayRef<Value *> Roots) {
   buildTree_rec(Roots, 0, EdgeInfo());
 }
 
+/// Tries to find subvector of loads and builds new vector of only loads if can
+/// be profitable.
+static void gatherPossiblyVectorizableLoads(
+    const BoUpSLP &R, ArrayRef<Value *> VL, const DataLayout &DL,
+    ScalarEvolution &SE, const TargetTransformInfo &TTI,
+    SmallVectorImpl<SmallVector<std::pair<LoadInst *, int>>> &GatheredLoads,
+    bool AddNew = true) {
+  if (VL.empty())
+    return;
+  if (!isValidElementType(VL.front()->getType()))
+    return;
+  Type *ScalarTy = VL.front()->getType();
+  int NumScalars = VL.size();
+  auto *VecTy = getWidenedType(ScalarTy, NumScalars);
+  int NumParts = TTI.getNumberOfParts(VecTy);
+  if (NumParts == 0 || NumParts >= NumScalars)
+    NumParts = 1;
+  unsigned VF = PowerOf2Ceil(NumScalars / NumParts);
+  SmallVector<SmallVector<std::pair<LoadInst *, int>>> ClusteredLoads;
+  for (int I : seq<int>(0, NumParts)) {
+    for (Value *V :
+         VL.slice(I * VF, std::min<unsigned>(VF, VL.size() - I * VF))) {
+      auto *LI = dyn_cast<LoadInst>(V);
+      if (!LI)
+        continue;
+      if (R.isDeleted(LI) || R.isVectorized(LI) || !LI->isSimple())
+        continue;
+      bool IsFound = false;
+      for (auto &Data : ClusteredLoads) {
+        if (LI->getParent() != Data.front().first->getParent())
+          continue;
+        std::optional<int> Dist =
+            getPointersDiff(LI->getType(), LI->getPointerOperand(),
+                            Data.front().first->getType(),
+                            Data.front().first->getPointerOperand(), DL, SE,
+                            /*StrictCheck=*/true);
+        if (Dist && all_of(Data, [&](const std::pair<LoadInst *, int> &Pair) {
+              IsFound |= Pair.first == LI;
+              return IsFound || Pair.second != *Dist;
+            })) {
+          if (!IsFound)
+            Data.emplace_back(LI, *Dist);
+          IsFound = true;
+          break;
+        }
+      }
+      if (!IsFound)
+        ClusteredLoads.emplace_back().emplace_back(LI, 0);
+    }
+  }
+  auto FindMatchingLoads =
+      [&](ArrayRef<std::pair<LoadInst *, int>> Loads,
+          SmallVectorImpl<SmallVector<std::pair<LoadInst *, int>>>
+              &GatheredLoads,
+          SetVector<unsigned> &ToAdd, SetVector<unsigned> &Repeated,
+          int &Offset, unsigned &Start) {
+        SmallVector<std::pair<int, int>> Res;
+        if (Loads.empty())
+          return GatheredLoads.end();
+        LoadInst *LI = Loads.front().first;
+        for (auto [Idx, Data] : enumerate(GatheredLoads)) {
+          if (Idx < Start)
+            continue;
+          ToAdd.clear();
+          if (LI->getParent() != Data.front().first->getParent())
+            continue;
+          std::optional<int> Dist =
+              getPointersDiff(LI->getType(), LI->getPointerOperand(),
+                              Data.front().first->getType(),
+                              Data.front().first->getPointerOperand(), DL, SE,
+                              /*StrictCheck=*/true);
+          if (Dist) {
+            // Found matching gathered loads - check if all loads are unique or
+            // can be effectively vectorized.
+            unsigned NumUniques = 0;
+            for (auto [Cnt, Pair] : enumerate(Loads)) {
+              bool Used = any_of(
+                  Data, [&, &P = Pair](const std::pair<LoadInst *, int> &PD) {
+                    return PD.first == P.first;
+                  });
+              if (none_of(Data,
+                          [&, &P = Pair](const std::pair<LoadInst *, int> &PD) {
+                            return *Dist + P.second == PD.second;
+                          }) &&
+                  !Used) {
+                ++NumUniques;
+                ToAdd.insert(Cnt);
+              }
+              if (Used)
+                Repeated.insert(Cnt);
+            }
+            if (NumUniques > 0 &&
+                (Loads.size() == NumUniques ||
+                 (Loads.size() - NumUniques >= 2 &&
+                  Loads.size() - NumUniques >= Loads.size() / 2 &&
+                  (isPowerOf2_64(Data.size() + NumUniques) ||
+                   PowerOf2Ceil(Data.size()) <
+                       PowerOf2Ceil(Data.size() + NumUniques))))) {
+              Offset = *Dist;
+              Start = Idx + 1;
+              return std::next(GatheredLoads.begin(), Idx);
+            }
+          }
+        }
+        ToAdd.clear();
+        return GatheredLoads.end();
+      };
+  for (ArrayRef<std::pair<LoadInst *, int>> Data : ClusteredLoads) {
+    unsigned Start = 0;
+    SetVector<unsigned> ToAdd, LocalToAdd, Repeated;
+    int Offset = 0;
+    auto *It = FindMatchingLoads(Data, GatheredLoads, LocalToAdd, Repeated,
+                                 Offset, Start);
+    while (It != GatheredLoads.end()) {
+      assert(!LocalToAdd.empty() && "Expected some elements to add.");
+      for (unsigned Idx : LocalToAdd)
+        It->emplace_back(Data[Idx].first, Data[Idx].second + Offset);
+      ToAdd.insert(LocalToAdd.begin(), LocalToAdd.end());
+      It = FindMatchingLoads(Data, GatheredLoads, LocalToAdd, Repeated, Offset,
+                             Start);
+    }
+    if (any_of(seq<unsigned>(Data.size()), [&](unsigned Idx) {
+          return !ToAdd.contains(Idx) && !Repeated.contains(Idx);
+        })) {
+      auto AddNewLoads =
+          [&](SmallVectorImpl<std::pair<LoadInst *, int>> &Loads) {
+            for (unsigned Idx : seq<unsigned>(Data.size())) {
+              if (ToAdd.contains(Idx) || Repeated.contains(Idx))
+                continue;
+              Loads.push_back(Data[Idx]);
+            }
+          };
+      if (!AddNew) {
+        LoadInst *LI = Data.front().first;
+        It = find_if(
+            GatheredLoads, [&](ArrayRef<std::pair<LoadInst *, int>> PD) {
+              return PD.front().first->getParent() == LI->getParent() &&
+                     PD.front().first->getType() == LI->getType();
+            });
+        while (It != GatheredLoads.end()) {
+          AddNewLoads(*It);
+          It = std::find_if(
+              std::next(It), GatheredLoads.end(),
+              [&](ArrayRef<std::pair<LoadInst *, int>> PD) {
+                return PD.front().first->getParent() == LI->getParent() &&
+                       PD.front().first->getType() == LI->getType();
+              });
+        }
+      }
+      GatheredLoads.emplace_back().append(Data.begin(), Data.end());
+      AddNewLoads(GatheredLoads.emplace_back());
+    }
+  }
+}
+
+void BoUpSLP::tryToVectorizeGatheredLoads(
+    ArrayRef<SmallVector<std::pair<LoadInst *, int>>> GatheredLoads) {
+  GatheredLoadsEntriesFirst = VectorizableTree.size();
+
+  // Sort loads by distance.
+  auto LoadSorter = [](const std::pair<LoadInst *, int> &L1,
+                       const std::pair<LoadInst *, int> &L2) {
+    return L1.second > L2.second;
+  };
+
+  auto GetVectorizedRanges = [this](
+                                 ArrayRef<LoadInst *> Loads,
+                                 BoUpSLP::ValueSet &VectorizedLoads,
+                                 SmallVectorImpl<LoadInst *> &NonVectorized) {
+    SmallVector<std::pair<ArrayRef<Value *>, LoadsState>> Results;
+    unsigned StartIdx = 0;
+    SmallVector<int> CandidateVFs;
+    if (VectorizeNonPowerOf2 && isPowerOf2_32(Loads.size() + 1))
+      CandidateVFs.push_back(Loads.size());
+    for (int NumElts = bit_floor(Loads.size()); NumElts > 1; NumElts /= 2) {
+      CandidateVFs.push_back(NumElts);
+      if (VectorizeNonPowerOf2 && NumElts > 2)
+        CandidateVFs.push_back(NumElts - 1);
+    }
+
+    for (int NumElts : CandidateVFs) {
+      SmallVector<unsigned> MaskedGatherVectorized;
+      for (unsigned Cnt = StartIdx, E = Loads.size(); Cnt + NumElts <= E;
+           ++Cnt) {
+        ArrayRef<LoadInst *> Slice = ArrayRef(Loads).slice(Cnt, NumElts);
+        if (VectorizedLoads.count(Slice.front()) ||
+            VectorizedLoads.count(Slice.back()))
+          continue;
+        // Check if it is profitable to try vectorizing gathered loads. It is
+        // profitable if we have more than 3 consecutive loads or if we have
+        // less but all users are vectorized or deleted.
+        bool AllowToVectorize =
+            NumElts >= 3 ||
+            any_of(VectorizableTree, [=](const std::unique_ptr<TreeEntry> &TE) {
+              return TE->isGather() && TE->Scalars.size() == 2 &&
+                     (equal(TE->Scalars, Slice) ||
+                      equal(TE->Scalars, reverse(Slice)));
+            });
+        // Check if it is profitable to vectorize 2-elements loads.
+        if (NumElts == 2) {
+          bool IsLegalBroadcastLoad = TTI->isLegalBroadcastLoad(
+              Slice.front()->getType(), ElementCount::getFixed(NumElts));
+          auto CheckIfAllowed = [=](ArrayRef<LoadInst *> Slice) {
+            for (LoadInst *LI : Slice) {
+              // If single use/user - allow to vectorize.
+              if (LI->hasOneUse())
+                continue;
+              // 1. Check if number of uses equal number of users.
+              // 2. All users are deleted.
+              // 3. The load broadcasts are not allowed or the load is not
+              // broadcasted.
+              if (std::distance(LI->user_begin(), LI->user_end()) !=
+                  LI->getNumUses())
+                return false;
+              for (User *U : LI->users()) {
+                if (auto *UI = dyn_cast<Instruction>(U); UI && isDeleted(UI))
+                  continue;
+                if (const TreeEntry *UTE = getTreeEntry(U)) {
+                  if (!IsLegalBroadcastLoad)
+                    // The broadcast is illegal - vectorize loads.
+                    continue;
+                  for (int I = 0, End = UTE->getNumOperands(); I < End; ++I) {
+                    if (all_of(UTE->getOperand(I),
+                               [LI](Value *V) { return V == LI; }))
+                      // Found legal broadcast - do not vectorize.
+                      return false;
+                  }
+                }
+              }
+            }
+            return true;
+          };
+          AllowToVectorize = CheckIfAllowed(Slice);
+        }
+        if (AllowToVectorize) {
+          SmallVector<Value *> PointerOps;
+          OrdersType CurrentOrder;
+          // Try to build vector load.
+          ArrayRef<Value *> Values(
+              reinterpret_cast<Value *const *>(Slice.begin()), Slice.size());
+          unsigned BestVF = 0;
+          LoadsState LS = canVectorizeLoads(Values, Slice.front(), CurrentOrder,
+                                            PointerOps, &BestVF);
+          if (LS != LoadsState::Gather ||
+              (BestVF > 1 && static_cast<unsigned>(NumElts) == 2 * BestVF)) {
+            if (LS == LoadsState::ScatterVectorize) {
+              if (MaskedGatherVectorized.empty() ||
+                  Cnt >= MaskedGatherVectorized.back() + NumElts)
+                MaskedGatherVectorized.push_back(Cnt);
+              continue;
+            }
+            if (LS != LoadsState::Gather) {
+              Results.emplace_back(Values, LS);
+              VectorizedLoads.insert(Slice.begin(), Slice.end());
+              // If we vectorized initial block, no need to try to vectorize it
+              // again.
+              if (Cnt == StartIdx)
+                StartIdx += NumElts;
+            }
+            // Erase last masked gather candidate, if another candidate within
+            // the range is found to be better.
+            if (!MaskedGatherVectorized.empty() &&
+                Cnt < MaskedGatherVectorized.back() + NumElts)
+              MaskedGatherVectorized.pop_back();
+            Cnt += NumElts - 1;
+            continue;
+          }
+        }
+        // Check if the whole array was vectori...
[truncated]

Created using spr 1.3.5
@alexey-bataev
Copy link
Member Author

This patch should allow support for segmented (interleaved) loads for RISCV and allows us to move away from alternate opcodes in most cases

Copy link

github-actions bot commented Sep 5, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Created using spr 1.3.5
Created using spr 1.3.5
@alexey-bataev alexey-bataev changed the title [SLP]Vectorize of gathered loads [SLP]Vectorize gathered loads Sep 9, 2024
Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minors - but its not an easy patch to review tbh!

return;
if (!isValidElementType(VL.front()->getType()))
return;
Type *ScalarTy = VL.front()->getType();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type *ScalarTy = VL.front()->getType();
if (!isValidElementType(ScalarTy))
    return;

@@ -11804,17 +12282,23 @@ Instruction &BoUpSLP::getLastInstructionInBundle(const TreeEntry *E) {
return *Res;
// Get the basic block this bundle is in. All instructions in the bundle
// should be in this block (except for extractelement-like instructions with
// constant indeces).
// constant indecies or gathered loads).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indices

@alexey-bataev
Copy link
Member Author

alexey-bataev commented Sep 10, 2024

A couple of minors - but its not an easy patch to review tbh!

Yep, I understand, actually already split the patch into several pieces. This patch is a single part, cannot split it further.

The idea behind this one is pretty simple. Final gather/buildvector nodes may have scalar loads, which are not vectorized (since they are part of the gather nodes) but may form full vector loads, being combined. This patch walks over all gather nodes, "gathering" and sorting gathered scalar loads and then tries to build vector loads, which later are reshuffled between the gather nodes. As I said in the description, it allows later to add support for segmented loads (kind of AOS to SOA load kind for RISC-V RVV) and may help with the removal of the alternate opcodes support.
Currently, alternate nodes may depend on each other because of the consecutive loads between their operands. Because of that we cannot simply remove alternate vectorization. But this approach may help to remove most of the stuff for it, since we'll be able to vectorize loads in between

Created using spr 1.3.5
@RKSimon
Copy link
Collaborator

RKSimon commented Sep 10, 2024

Thanks - can you make sure that explanation is in the patch summary please?

@alexey-bataev
Copy link
Member Author

The idea behind this one is pretty simple. Final gather/buildvector nodes may have scalar loads, which are not vectorized (since they are part of the gather nodes) but may form full vector loads, being combined. This patch walks over all gather nodes, "gathering" and sorting gathered scalar loads and then tries to build vector loads, which later are reshuffled between the gather nodes. As I said in the description, it allows later to add support for segmented loads (kind of AOS to SOA load kind for RISC-V RVV) and may help with the removal of the alternate opcodes support.
Currently, alternate nodes may depend on each other because of the consecutive loads between their operands. Because of that we cannot simply remove alternate vectorization. But this approach may help to remove most of the stuff for it, since we'll be able to vectorize loads in between

Done

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - cheers

; NON-POWER-OF-2-NEXT: [[CONV82:%.*]] = fptrunc double [[TMP5]] to float
; NON-POWER-OF-2-NEXT: store float [[CONV82]], ptr [[ARRAYIDX80]], align 4
; NON-POWER-OF-2-NEXT: ret void
;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be the same as the CHECK case - maybe use CHECK as common prefix and add a POWER-OF-2 prefix?

Created using spr 1.3.5
@alexey-bataev alexey-bataev merged commit de1f5b9 into main Sep 17, 2024
8 checks passed
@alexey-bataev alexey-bataev deleted the users/alexey-bataev/spr/slpvectorize-of-gathered-loads branch September 17, 2024 10:57
@nikic
Copy link
Contributor

nikic commented Sep 17, 2024

I've reverted this due to very large compile-time impact, esp. on lencod, see: http://llvm-compile-time-tracker.com/compare.php?from=b1339abb713063363e7804124b8fb3d84143a003&to=de1f5b96adcea52bf7c9670c46123fe1197050d2&stat=instructions:u

For non-LTO cases, me_distortion.c is one of the files with largest impact (~80%).

tmsri pushed a commit to tmsri/llvm-project that referenced this pull request Sep 19, 2024
Final gather/buildvector nodes may have scalar loads, which are not
vectorized (since they are part of the gather nodes) but may form full
vector loads, being combined. This patch walks over all gather nodes,
"gathering" and sorting gathered scalar loads and then tries to build
vector loads, which later are reshuffled between the gather nodes.
It allows later to add support for segmented loads (kind of AOS to SOA
load kind for RISC-V RVV) and may help with the removal of the alternat
e opcodes support.
Currently, alternate nodes may depend on each other because of the
consecutive loads between their operands. Because of that we cannot
simply remove alternate vectorization. But this approach may help to
remove most of the stuff for it, since we'll be able to vectorize loads
in between lanes.

Metric: size..text, AVX512

Program                                                                                                                                                size..text
                                                                                 test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   238381.00   250669.00  5.2%
                                                                  test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test    25753.00    26329.00  2.2%
                                                                  test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-psadbw.test     3028.00     3092.00  2.1%
                                                                                     test-suite :: MultiSource/Benchmarks/Rodinia/hotspot/hotspot.test     4243.00     4275.00  0.8%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   649765.00   653877.00  0.6%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   649765.00   653877.00  0.6%
                                                                                       test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test     4199.00     4222.00  0.5%
                                                             test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test    12933.00    12997.00  0.5%
                                                                                                 test-suite :: SingleSource/Benchmarks/Misc/flops.test     8282.00     8314.00  0.4%
                                                            test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-unpack_msasm.test    10065.00    10097.00  0.3%
                                                                                         test-suite :: SingleSource/Benchmarks/Misc-C++/Large/ray.test     5160.00     5176.00  0.3%
                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12472220.00 12509612.00  0.3%
                                                                                      test-suite :: MultiSource/Benchmarks/Prolangs-C++/city/city.test     6908.00     6924.00  0.2%
                                                                         test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   202830.00   203278.00  0.2%
                                                                                       test-suite :: SingleSource/Benchmarks/CoyoteBench/fftbench.test     9133.00     9149.00  0.2%
                                                                                           test-suite :: MultiSource/Benchmarks/Olden/power/power.test     6792.00     6803.00  0.2%
                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1395585.00  1397473.00  0.1%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1395585.00  1397473.00  0.1%
                                                                        test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test    97662.00    97758.00  0.1%
                                                                                        test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   595179.00   595739.00  0.1%
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test    70603.00    70667.00  0.1%
                                                                            test-suite :: MultiSource/Benchmarks/Prolangs-C/unix-smail/unix-smail.test    19877.00    19893.00  0.1%
                                                                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test    90231.00    90279.00  0.1%
                                                                                         test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test    33738.00    33754.00  0.0%
                                                                                     test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test    13262.00    13268.00  0.0%
                                                                                        test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  1139964.00  1140460.00  0.0%
                                                                                          test-suite :: MultiSource/Applications/JM/lencod/lencod.test   849507.00   849875.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  1158379.00  1158859.00  0.0%
                                                                                   test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test    38724.00    38740.00  0.0%
                                                                                              test-suite :: External/SPEC/CFP2006/470.lbm/470.lbm.test    15180.00    15186.00  0.0%
                                                                                      test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test    15484.00    15490.00  0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test   167391.00   167455.00  0.0%
                                                                        test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.test   137448.00   137496.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2030254.00  2030766.00  0.0%
                                                                              test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test   302870.00   302934.00  0.0%
                                                                                    test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test   303126.00   303190.00  0.0%
                                                                                            test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test   241107.00   241155.00  0.0%
                                                                                      test-suite :: External/SPEC/CFP2006/482.sphinx3/482.sphinx3.test   162974.00   163006.00  0.0%
                                                                                                 test-suite :: MultiSource/Applications/siod/siod.test   167168.00   167200.00  0.0%
                                                                                         test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1048796.00  1048988.00  0.0%
                                                                               test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test   201623.00   201655.00  0.0%
                                                                                           test-suite :: MultiSource/Applications/sqlite3/sqlite3.test   501734.00   501798.00  0.0%
test-suite :: MultiSource/Applications/ClamAV/clamscan.test   580888.00   580952.00  0.0%
                                                                                           test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test   168319.00   168335.00  0.0%
                                                                        test-suite :: MicroBenchmarks/ImageProcessing/Interpolation/Interpolation.test   226022.00   226038.00  0.0%
                                                        test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-flt/StatementReordering-flt.test   118011.00   118015.00  0.0%
                                                                                     test-suite :: External/SPEC/CINT2006/471.omnetpp/471.omnetpp.test   550589.00   550605.00  0.0%
                                                                                             test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test  3072477.00  3072541.00  0.0%
                                                                                 test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test  2385563.00  2385579.00  0.0%
                                                                                          test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   389171.00   389155.00 -0.0%
                                                                                                   test-suite :: MultiSource/Applications/lua/lua.test   234764.00   234748.00 -0.0%
                                                                                        test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test   227694.00   227678.00 -0.0%
                                                                    test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test   119819.00   119807.00 -0.0%
                                                                        test-suite :: MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test   117995.00   117983.00 -0.0%
                                                            test-suite :: MultiSource/Benchmarks/TSVC/InductionVariable-flt/InductionVariable-flt.test   123610.00   123594.00 -0.0%
                                                                                       test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test    81414.00    81398.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   782040.00   781880.00 -0.0%
                                                                                    test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test  9597420.00  9595292.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test  9597420.00  9595292.00 -0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   911832.00   911608.00 -0.0%
                                                                                             test-suite :: MultiSource/Applications/oggenc/oggenc.test   192507.00   192459.00 -0.0%
                                                            test-suite :: MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt.test   122843.00   122811.00 -0.0%
                                                          test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test   122292.00   122260.00 -0.0%
                                                                                    test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test   777363.00   777155.00 -0.0%
                                                                            test-suite :: MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt.test   123265.00   123205.00 -0.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   315534.00   315358.00 -0.1%
                                                                        test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.test   128163.00   128083.00 -0.1%
                                                                           test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test     6562.00     6555.00 -0.1%
                                                                                test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test    23428.00    23396.00 -0.1%
                                                                             test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test    22749.00    22717.00 -0.1%
                                                                           test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    39549.00    39485.00 -0.2%
                                                                                  test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    39546.00    39482.00 -0.2%
                                                                                    test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test    57214.00    57118.00 -0.2%
                                                                                      test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   413668.00   412804.00 -0.2%
                                                                                       test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  1044047.00  1041487.00 -0.2%
                                                                                            test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test    12414.00    12382.00 -0.3%
                                                                                      test-suite :: MultiSource/Benchmarks/Prolangs-C/gnugo/gnugo.test    31161.00    30969.00 -0.6%
                                                                               test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test   224726.00   223254.00 -0.7%
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test    93512.00    92824.00 -0.7%
                                                                        test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   281151.00   278463.00 -1.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test     2820.00     2788.00 -1.1%
                                                                                            test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test   156819.00   154739.00 -1.3%
                                                                 test-suite :: MultiSource/Benchmarks/MiBench/security-blowfish/security-blowfish.test    11560.00    11160.00 -3.5%
                                                                                          test-suite :: MultiSource/Benchmarks/McCat/08-main/main.test     6734.00     6382.00 -5.2%
                                                                                                                                                       results     results0    diff

ASCI_Purple/SMG2000 - extra vector code
VPlanNativePath/outer-loop-vect - extra vectorization, better vector
code
AVX512BWVL/Vector-AVX512BWVL-psadbw - better vector code
Rodinia/hotspot - small variations
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - extra vector code, better vectorization
BenchmarkGame/n-body - better vector code.
AVX512BWVL/Vector-AVX512BWVL-unpack_msasm - small variations
Misc/flops - extra vector code
AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - small variations
Misc-C++/Large - better vector code
CFP2017rate/526.blender_r - extra vector code
Prolangs-C++/city - extra vector code
MiBench/consumer-lame - extra vector code
CoyoteBench/fftbench - extra vector code
Olden/power - better vector code
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - extra vector code
CINT2017rate/531.deepsjeng_r - extra vector code
CFP2006/447.dealII - small variations
DOE-ProxyApps-C/miniAMR - small variations
Prolangs-C/unix-smail - small variations
DOE-ProxyApps-C++/PENNANT - small variations
CINT2006/473.astar - small variations
CFP2006/453.povray - small variations
JM/lencod - extra vector code
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/CoMD - small variations
CFP2006/470.lbm - extra vector code
CFP2017speed/619.lbm_s
CFP2017rate/519.lbm_r - extra vector code
CINT2006/456.hmmer - extra code vectorized
TSVC/ControlFlow-dbl - extra vector code
CFP2017rate/510.parest_r - better vector code
LCALS/SubsetALambdaLoops - extra code vectorized
LCALS/SubsetARawLoops - extra code vectorized
CFP2006/444.namd - extra code vectorized
CFP2006/482.sphinx3 - better vector code
Applications/siod - better vector code
Benchmarks/7zip - better vector code
DOE-ProxyApps-C++/CLAMR - extra code vectorized
Applications/sqlite3 - extra code vectorized
Applications/ClamAV - smaller vector code
MallocBench/gs - small variations
MicroBenchmarks/ImageProcessing - small variations
TSVC/StatementReordering-flt - extra code vectorized
CINT2006/471.omnetpp - small variations
CINT2006/403.gcc - extra code vectorized
CINT2006/483.xalancbmk - extra code vectorized
JM/ldecod - small variations
Applications/lua - extra code vectorized
mafft/pairlocalalign - small variations
TSVC/NodeSplitting-flt - extra code vectorized
TSVC/Recurrences-flt - extra code vectorized
TSVC/InductionVariable-flt - extra code vectorized
FreeBench/pifft - small variations
CINT2006/464.h264ref - extra code vectorized
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - some extra code vectorized, extra code inlined
CINT2006/445.gobmk - small variations
Applications/oggenc - small variations
TSVC/LoopRestructuring-flt - extra code vectorized
TSVC/CrossingThresholds-flt - extra code vectorized
CFP2017rate/508.namd_r - small variations
TSVC/ControlFlow-flt - extra code vectorized
mediabench/g721 - small variations
Prolangs-C/compiler - small variations
FreeBench/fourinarow - better vector code
MiBench/telecomm-gsm - small variation in vector code
mediabench/gsm - same
Prolangs-C/bison - small variations
Adobe-C++/loop_unroll - extra code vectorized
Benchmarks/tramp3d-v4 - extra code gets inlined, small changes in vetor
code
McCat/18-imp - variations in vector code
Prolangs-C/gnugo - variations in vector code
MallocBench/espresso - extra code vectorized
DOE-ProxyApps-C++/miniFE - small variations in vector code
Prolangs-C/TimberWolfMC - extra code vectorized, small changes in
previously vectorized code.
Olden/tsp - small changes in vector code
CFP2006/433.milc - extra code gets inlined, vectorized 2 x stores to 4 x stores
MiBench/security-blowfish - extra code vectorized
McCat/08-main - better vector code.

Metric: size..text, RISCV, sifive-p670

Program                                                                                                                                                size..text
                                                                                                                                                       results    results0   diff
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   63580.00   64020.00  0.7%
                                                                   test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test   21388.00   21406.00  0.1%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  296992.00  297088.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  968112.00  968208.00  0.0%
                                                        test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.test   45160.00   45164.00  0.0%
                                                                         test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 2635902.00 2635854.00 -0.0%
                                                                        test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 2635902.00 2635854.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7568730.00 7568578.00 -0.0%
                                                                                    test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7568730.00 7568578.00 -0.0%
                                                          test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test   49764.00   49762.00 -0.0%
                                                                                           test-suite :: MultiSource/Applications/sqlite3/sqlite3.test  449132.00  449108.00 -0.0%
                                                                                          test-suite :: MultiSource/Applications/JM/lencod/lencod.test  695932.00  695892.00 -0.0%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  508820.00  508788.00 -0.0%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  508820.00  508788.00 -0.0%
                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9594152.00 9593336.00 -0.0%
                                                                                 test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test  166522.00  166490.00 -0.0%
                                                                                    test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test  722252.00  722092.00 -0.0%
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   27554.00   27546.00 -0.0%
                                                                  test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test   10900.00   10896.00 -0.0%
                                                          test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.test   46754.00   46732.00 -0.0%
                                                                                       test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  631570.00  631226.00 -0.1%
                                                                                         test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  850698.00  850218.00 -0.1%
                                                                           test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test   24816.00   24800.00 -0.1%
                                                                                  test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test   24814.00   24798.00 -0.1%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1599946.00 1598394.00 -0.1%
                                                                                                   test-suite :: MultiSource/Applications/hbd/hbd.test   27236.00   27204.00 -0.1%
                                                                                          test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  293848.00  293480.00 -0.1%
                                                                                test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test   20160.00   20048.00 -0.6%
                                                                               test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test  182088.00  181040.00 -0.6%
                                                                           test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    4788.00    4748.00 -0.8%

DOE-ProxyApps-C++/miniFE - extra vector code
MiBench/automotive-susan - small variations
Benchmarks/Bullet - extra vector code
CFP2017rate/511.povray_r - slightly better vector code
TSVC/StatementReordering-dbl - small variations
CINT2017rate/523.xalancbmk_r
CINT2017speed/623.xalancbmk_s - extra vector code
CINT2017rate/502.gcc_r
CINT2017speed/602.gcc_s - extra vector code
TSVC/CrossingThresholds-flt - small variations
Applications/sqlite3 - extra vector code
JM/lencod - extra vector code, small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - small variations
CFP2017rate/526.blender_r - extra vector code, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vectorizer/VPlanNativePath/outer-loop-vect - small variations
TSVC/CrossingThresholds-dbl - small variations
Benchmarks/tramp3d-v4 - small variations
Benchmarks/7zip - extra vector code
MiBench/telecomm-gsm - small variations
mediabench/gsm/toast - small variations
CFP2017rate/510.parest_r - extra vector code
Applications/hbd - extra vector code
JM/ldecod - better vector code
Prolangs-C/compiler - extra vector code
MallocBench/espresso - extra vector code
mediabench/g721/g721encode - extra vectorization

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: llvm#107461
@alexey-bataev
Copy link
Member Author

I've reverted this due to very large compile-time impact, esp. on lencod, see: http://llvm-compile-time-tracker.com/compare.php?from=b1339abb713063363e7804124b8fb3d84143a003&to=de1f5b96adcea52bf7c9670c46123fe1197050d2&stat=instructions:u

For non-LTO cases, me_distortion.c is one of the files with largest impact (~80%).

Reduced the time for lencode. There is just one significant regression in bullet (btSolve2LinearConstraint.cpp, +4%) remains, but there is no way to reduce it without affecting the vectorization result

alexey-bataev added a commit that referenced this pull request Sep 21, 2024
Final gather/buildvector nodes may have scalar loads, which are not
vectorized (since they are part of the gather nodes) but may form full
vector loads, being combined. This patch walks over all gather nodes,
"gathering" and sorting gathered scalar loads and then tries to build
vector loads, which later are reshuffled between the gather nodes.
It allows later to add support for segmented loads (kind of AOS to SOA
load kind for RISC-V RVV) and may help with the removal of the alternat
e opcodes support.
Currently, alternate nodes may depend on each other because of the
consecutive loads between their operands. Because of that we cannot
simply remove alternate vectorization. But this approach may help to
remove most of the stuff for it, since we'll be able to vectorize loads
in between lanes.

Metric: size..text, AVX512

Program                                                                                                                                                size..text
                                                                                 test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   238381.00   250669.00  5.2%
                                                                  test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test    25753.00    26329.00  2.2%
                                                                  test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-psadbw.test     3028.00     3092.00  2.1%
                                                                                     test-suite :: MultiSource/Benchmarks/Rodinia/hotspot/hotspot.test     4243.00     4275.00  0.8%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   649765.00   653877.00  0.6%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   649765.00   653877.00  0.6%
                                                                                       test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test     4199.00     4222.00  0.5%
                                                             test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test    12933.00    12997.00  0.5%
                                                                                                 test-suite :: SingleSource/Benchmarks/Misc/flops.test     8282.00     8314.00  0.4%
                                                            test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-unpack_msasm.test    10065.00    10097.00  0.3%
                                                                                         test-suite :: SingleSource/Benchmarks/Misc-C++/Large/ray.test     5160.00     5176.00  0.3%
                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12472220.00 12509612.00  0.3%
                                                                                      test-suite :: MultiSource/Benchmarks/Prolangs-C++/city/city.test     6908.00     6924.00  0.2%
                                                                         test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   202830.00   203278.00  0.2%
                                                                                       test-suite :: SingleSource/Benchmarks/CoyoteBench/fftbench.test     9133.00     9149.00  0.2%
                                                                                           test-suite :: MultiSource/Benchmarks/Olden/power/power.test     6792.00     6803.00  0.2%
                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1395585.00  1397473.00  0.1%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1395585.00  1397473.00  0.1%
                                                                        test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test    97662.00    97758.00  0.1%
                                                                                        test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   595179.00   595739.00  0.1%
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test    70603.00    70667.00  0.1%
                                                                            test-suite :: MultiSource/Benchmarks/Prolangs-C/unix-smail/unix-smail.test    19877.00    19893.00  0.1%
                                                                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test    90231.00    90279.00  0.1%
                                                                                         test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test    33738.00    33754.00  0.0%
                                                                                     test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test    13262.00    13268.00  0.0%
                                                                                        test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  1139964.00  1140460.00  0.0%
                                                                                          test-suite :: MultiSource/Applications/JM/lencod/lencod.test   849507.00   849875.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  1158379.00  1158859.00  0.0%
                                                                                   test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test    38724.00    38740.00  0.0%
                                                                                              test-suite :: External/SPEC/CFP2006/470.lbm/470.lbm.test    15180.00    15186.00  0.0%
                                                                                      test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test    15484.00    15490.00  0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test   167391.00   167455.00  0.0%
                                                                        test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.test   137448.00   137496.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2030254.00  2030766.00  0.0%
                                                                              test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test   302870.00   302934.00  0.0%
                                                                                    test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test   303126.00   303190.00  0.0%
                                                                                            test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test   241107.00   241155.00  0.0%
                                                                                      test-suite :: External/SPEC/CFP2006/482.sphinx3/482.sphinx3.test   162974.00   163006.00  0.0%
                                                                                                 test-suite :: MultiSource/Applications/siod/siod.test   167168.00   167200.00  0.0%
                                                                                         test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1048796.00  1048988.00  0.0%
                                                                               test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test   201623.00   201655.00  0.0%
                                                                                           test-suite :: MultiSource/Applications/sqlite3/sqlite3.test   501734.00   501798.00  0.0%
test-suite :: MultiSource/Applications/ClamAV/clamscan.test   580888.00   580952.00  0.0%
                                                                                           test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test   168319.00   168335.00  0.0%
                                                                        test-suite :: MicroBenchmarks/ImageProcessing/Interpolation/Interpolation.test   226022.00   226038.00  0.0%
                                                        test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-flt/StatementReordering-flt.test   118011.00   118015.00  0.0%
                                                                                     test-suite :: External/SPEC/CINT2006/471.omnetpp/471.omnetpp.test   550589.00   550605.00  0.0%
                                                                                             test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test  3072477.00  3072541.00  0.0%
                                                                                 test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test  2385563.00  2385579.00  0.0%
                                                                                          test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   389171.00   389155.00 -0.0%
                                                                                                   test-suite :: MultiSource/Applications/lua/lua.test   234764.00   234748.00 -0.0%
                                                                                        test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test   227694.00   227678.00 -0.0%
                                                                    test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test   119819.00   119807.00 -0.0%
                                                                        test-suite :: MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test   117995.00   117983.00 -0.0%
                                                            test-suite :: MultiSource/Benchmarks/TSVC/InductionVariable-flt/InductionVariable-flt.test   123610.00   123594.00 -0.0%
                                                                                       test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test    81414.00    81398.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   782040.00   781880.00 -0.0%
                                                                                    test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test  9597420.00  9595292.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test  9597420.00  9595292.00 -0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   911832.00   911608.00 -0.0%
                                                                                             test-suite :: MultiSource/Applications/oggenc/oggenc.test   192507.00   192459.00 -0.0%
                                                            test-suite :: MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt.test   122843.00   122811.00 -0.0%
                                                          test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test   122292.00   122260.00 -0.0%
                                                                                    test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test   777363.00   777155.00 -0.0%
                                                                            test-suite :: MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt.test   123265.00   123205.00 -0.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   315534.00   315358.00 -0.1%
                                                                        test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.test   128163.00   128083.00 -0.1%
                                                                           test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test     6562.00     6555.00 -0.1%
                                                                                test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test    23428.00    23396.00 -0.1%
                                                                             test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test    22749.00    22717.00 -0.1%
                                                                           test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    39549.00    39485.00 -0.2%
                                                                                  test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    39546.00    39482.00 -0.2%
                                                                                    test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test    57214.00    57118.00 -0.2%
                                                                                      test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   413668.00   412804.00 -0.2%
                                                                                       test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  1044047.00  1041487.00 -0.2%
                                                                                            test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test    12414.00    12382.00 -0.3%
                                                                                      test-suite :: MultiSource/Benchmarks/Prolangs-C/gnugo/gnugo.test    31161.00    30969.00 -0.6%
                                                                               test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test   224726.00   223254.00 -0.7%
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test    93512.00    92824.00 -0.7%
                                                                        test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   281151.00   278463.00 -1.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test     2820.00     2788.00 -1.1%
                                                                                            test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test   156819.00   154739.00 -1.3%
                                                                 test-suite :: MultiSource/Benchmarks/MiBench/security-blowfish/security-blowfish.test    11560.00    11160.00 -3.5%
                                                                                          test-suite :: MultiSource/Benchmarks/McCat/08-main/main.test     6734.00     6382.00 -5.2%
                                                                                                                                                       results     results0    diff

ASCI_Purple/SMG2000 - extra vector code
VPlanNativePath/outer-loop-vect - extra vectorization, better vector
code
AVX512BWVL/Vector-AVX512BWVL-psadbw - better vector code
Rodinia/hotspot - small variations
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - extra vector code, better vectorization
BenchmarkGame/n-body - better vector code.
AVX512BWVL/Vector-AVX512BWVL-unpack_msasm - small variations
Misc/flops - extra vector code
AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - small variations
Misc-C++/Large - better vector code
CFP2017rate/526.blender_r - extra vector code
Prolangs-C++/city - extra vector code
MiBench/consumer-lame - extra vector code
CoyoteBench/fftbench - extra vector code
Olden/power - better vector code
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - extra vector code
CINT2017rate/531.deepsjeng_r - extra vector code
CFP2006/447.dealII - small variations
DOE-ProxyApps-C/miniAMR - small variations
Prolangs-C/unix-smail - small variations
DOE-ProxyApps-C++/PENNANT - small variations
CINT2006/473.astar - small variations
CFP2006/453.povray - small variations
JM/lencod - extra vector code
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/CoMD - small variations
CFP2006/470.lbm - extra vector code
CFP2017speed/619.lbm_s
CFP2017rate/519.lbm_r - extra vector code
CINT2006/456.hmmer - extra code vectorized
TSVC/ControlFlow-dbl - extra vector code
CFP2017rate/510.parest_r - better vector code
LCALS/SubsetALambdaLoops - extra code vectorized
LCALS/SubsetARawLoops - extra code vectorized
CFP2006/444.namd - extra code vectorized
CFP2006/482.sphinx3 - better vector code
Applications/siod - better vector code
Benchmarks/7zip - better vector code
DOE-ProxyApps-C++/CLAMR - extra code vectorized
Applications/sqlite3 - extra code vectorized
Applications/ClamAV - smaller vector code
MallocBench/gs - small variations
MicroBenchmarks/ImageProcessing - small variations
TSVC/StatementReordering-flt - extra code vectorized
CINT2006/471.omnetpp - small variations
CINT2006/403.gcc - extra code vectorized
CINT2006/483.xalancbmk - extra code vectorized
JM/ldecod - small variations
Applications/lua - extra code vectorized
mafft/pairlocalalign - small variations
TSVC/NodeSplitting-flt - extra code vectorized
TSVC/Recurrences-flt - extra code vectorized
TSVC/InductionVariable-flt - extra code vectorized
FreeBench/pifft - small variations
CINT2006/464.h264ref - extra code vectorized
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - some extra code vectorized, extra code inlined
CINT2006/445.gobmk - small variations
Applications/oggenc - small variations
TSVC/LoopRestructuring-flt - extra code vectorized
TSVC/CrossingThresholds-flt - extra code vectorized
CFP2017rate/508.namd_r - small variations
TSVC/ControlFlow-flt - extra code vectorized
mediabench/g721 - small variations
Prolangs-C/compiler - small variations
FreeBench/fourinarow - better vector code
MiBench/telecomm-gsm - small variation in vector code
mediabench/gsm - same
Prolangs-C/bison - small variations
Adobe-C++/loop_unroll - extra code vectorized
Benchmarks/tramp3d-v4 - extra code gets inlined, small changes in vetor
code
McCat/18-imp - variations in vector code
Prolangs-C/gnugo - variations in vector code
MallocBench/espresso - extra code vectorized
DOE-ProxyApps-C++/miniFE - small variations in vector code
Prolangs-C/TimberWolfMC - extra code vectorized, small changes in
previously vectorized code.
Olden/tsp - small changes in vector code
CFP2006/433.milc - extra code gets inlined, vectorized 2 x stores to 4 x stores
MiBench/security-blowfish - extra code vectorized
McCat/08-main - better vector code.

Metric: size..text, RISCV, sifive-p670

Program                                                                                                                                                size..text
                                                                                                                                                       results    results0   diff
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   63580.00   64020.00  0.7%
                                                                   test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test   21388.00   21406.00  0.1%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  296992.00  297088.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  968112.00  968208.00  0.0%
                                                        test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.test   45160.00   45164.00  0.0%
                                                                         test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 2635902.00 2635854.00 -0.0%
                                                                        test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 2635902.00 2635854.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7568730.00 7568578.00 -0.0%
                                                                                    test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7568730.00 7568578.00 -0.0%
                                                          test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test   49764.00   49762.00 -0.0%
                                                                                           test-suite :: MultiSource/Applications/sqlite3/sqlite3.test  449132.00  449108.00 -0.0%
                                                                                          test-suite :: MultiSource/Applications/JM/lencod/lencod.test  695932.00  695892.00 -0.0%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  508820.00  508788.00 -0.0%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  508820.00  508788.00 -0.0%
                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9594152.00 9593336.00 -0.0%
                                                                                 test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test  166522.00  166490.00 -0.0%
                                                                                    test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test  722252.00  722092.00 -0.0%
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   27554.00   27546.00 -0.0%
                                                                  test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test   10900.00   10896.00 -0.0%
                                                          test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.test   46754.00   46732.00 -0.0%
                                                                                       test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  631570.00  631226.00 -0.1%
                                                                                         test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  850698.00  850218.00 -0.1%
                                                                           test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test   24816.00   24800.00 -0.1%
                                                                                  test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test   24814.00   24798.00 -0.1%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1599946.00 1598394.00 -0.1%
                                                                                                   test-suite :: MultiSource/Applications/hbd/hbd.test   27236.00   27204.00 -0.1%
                                                                                          test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  293848.00  293480.00 -0.1%
                                                                                test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test   20160.00   20048.00 -0.6%
                                                                               test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test  182088.00  181040.00 -0.6%
                                                                           test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    4788.00    4748.00 -0.8%

DOE-ProxyApps-C++/miniFE - extra vector code
MiBench/automotive-susan - small variations
Benchmarks/Bullet - extra vector code
CFP2017rate/511.povray_r - slightly better vector code
TSVC/StatementReordering-dbl - small variations
CINT2017rate/523.xalancbmk_r
CINT2017speed/623.xalancbmk_s - extra vector code
CINT2017rate/502.gcc_r
CINT2017speed/602.gcc_s - extra vector code
TSVC/CrossingThresholds-flt - small variations
Applications/sqlite3 - extra vector code
JM/lencod - extra vector code, small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - small variations
CFP2017rate/526.blender_r - extra vector code, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vectorizer/VPlanNativePath/outer-loop-vect - small variations
TSVC/CrossingThresholds-dbl - small variations
Benchmarks/tramp3d-v4 - small variations
Benchmarks/7zip - extra vector code
MiBench/telecomm-gsm - small variations
mediabench/gsm/toast - small variations
CFP2017rate/510.parest_r - extra vector code
Applications/hbd - extra vector code
JM/ldecod - better vector code
Prolangs-C/compiler - extra vector code
MallocBench/espresso - extra vector code
mediabench/g721/g721encode - extra vectorization

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: #107461
alexey-bataev added a commit that referenced this pull request Sep 21, 2024
Final gather/buildvector nodes may have scalar loads, which are not
vectorized (since they are part of the gather nodes) but may form full
vector loads, being combined. This patch walks over all gather nodes,
"gathering" and sorting gathered scalar loads and then tries to build
vector loads, which later are reshuffled between the gather nodes.
It allows later to add support for segmented loads (kind of AOS to SOA
load kind for RISC-V RVV) and may help with the removal of the alternat
e opcodes support.
Currently, alternate nodes may depend on each other because of the
consecutive loads between their operands. Because of that we cannot
simply remove alternate vectorization. But this approach may help to
remove most of the stuff for it, since we'll be able to vectorize loads
in between lanes.

Metric: size..text, AVX512

Program                                                                                                                                                size..text
                                                                                 test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   238381.00   250669.00  5.2%
                                                                  test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test    25753.00    26329.00  2.2%
                                                                  test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-psadbw.test     3028.00     3092.00  2.1%
                                                                                     test-suite :: MultiSource/Benchmarks/Rodinia/hotspot/hotspot.test     4243.00     4275.00  0.8%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   649765.00   653877.00  0.6%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   649765.00   653877.00  0.6%
                                                                                       test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test     4199.00     4222.00  0.5%
                                                             test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test    12933.00    12997.00  0.5%
                                                                                                 test-suite :: SingleSource/Benchmarks/Misc/flops.test     8282.00     8314.00  0.4%
                                                            test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-unpack_msasm.test    10065.00    10097.00  0.3%
                                                                                         test-suite :: SingleSource/Benchmarks/Misc-C++/Large/ray.test     5160.00     5176.00  0.3%
                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12472220.00 12509612.00  0.3%
                                                                                      test-suite :: MultiSource/Benchmarks/Prolangs-C++/city/city.test     6908.00     6924.00  0.2%
                                                                         test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   202830.00   203278.00  0.2%
                                                                                       test-suite :: SingleSource/Benchmarks/CoyoteBench/fftbench.test     9133.00     9149.00  0.2%
                                                                                           test-suite :: MultiSource/Benchmarks/Olden/power/power.test     6792.00     6803.00  0.2%
                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1395585.00  1397473.00  0.1%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1395585.00  1397473.00  0.1%
                                                                        test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test    97662.00    97758.00  0.1%
                                                                                        test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   595179.00   595739.00  0.1%
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test    70603.00    70667.00  0.1%
                                                                            test-suite :: MultiSource/Benchmarks/Prolangs-C/unix-smail/unix-smail.test    19877.00    19893.00  0.1%
                                                                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test    90231.00    90279.00  0.1%
                                                                                         test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test    33738.00    33754.00  0.0%
                                                                                     test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test    13262.00    13268.00  0.0%
                                                                                        test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  1139964.00  1140460.00  0.0%
                                                                                          test-suite :: MultiSource/Applications/JM/lencod/lencod.test   849507.00   849875.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  1158379.00  1158859.00  0.0%
                                                                                   test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test    38724.00    38740.00  0.0%
                                                                                              test-suite :: External/SPEC/CFP2006/470.lbm/470.lbm.test    15180.00    15186.00  0.0%
                                                                                      test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test    15484.00    15490.00  0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test   167391.00   167455.00  0.0%
                                                                        test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.test   137448.00   137496.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2030254.00  2030766.00  0.0%
                                                                              test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test   302870.00   302934.00  0.0%
                                                                                    test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test   303126.00   303190.00  0.0%
                                                                                            test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test   241107.00   241155.00  0.0%
                                                                                      test-suite :: External/SPEC/CFP2006/482.sphinx3/482.sphinx3.test   162974.00   163006.00  0.0%
                                                                                                 test-suite :: MultiSource/Applications/siod/siod.test   167168.00   167200.00  0.0%
                                                                                         test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1048796.00  1048988.00  0.0%
                                                                               test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test   201623.00   201655.00  0.0%
                                                                                           test-suite :: MultiSource/Applications/sqlite3/sqlite3.test   501734.00   501798.00  0.0%
test-suite :: MultiSource/Applications/ClamAV/clamscan.test   580888.00   580952.00  0.0%
                                                                                           test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test   168319.00   168335.00  0.0%
                                                                        test-suite :: MicroBenchmarks/ImageProcessing/Interpolation/Interpolation.test   226022.00   226038.00  0.0%
                                                        test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-flt/StatementReordering-flt.test   118011.00   118015.00  0.0%
                                                                                     test-suite :: External/SPEC/CINT2006/471.omnetpp/471.omnetpp.test   550589.00   550605.00  0.0%
                                                                                             test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test  3072477.00  3072541.00  0.0%
                                                                                 test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test  2385563.00  2385579.00  0.0%
                                                                                          test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   389171.00   389155.00 -0.0%
                                                                                                   test-suite :: MultiSource/Applications/lua/lua.test   234764.00   234748.00 -0.0%
                                                                                        test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test   227694.00   227678.00 -0.0%
                                                                    test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test   119819.00   119807.00 -0.0%
                                                                        test-suite :: MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test   117995.00   117983.00 -0.0%
                                                            test-suite :: MultiSource/Benchmarks/TSVC/InductionVariable-flt/InductionVariable-flt.test   123610.00   123594.00 -0.0%
                                                                                       test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test    81414.00    81398.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   782040.00   781880.00 -0.0%
                                                                                    test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test  9597420.00  9595292.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test  9597420.00  9595292.00 -0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   911832.00   911608.00 -0.0%
                                                                                             test-suite :: MultiSource/Applications/oggenc/oggenc.test   192507.00   192459.00 -0.0%
                                                            test-suite :: MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt.test   122843.00   122811.00 -0.0%
                                                          test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test   122292.00   122260.00 -0.0%
                                                                                    test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test   777363.00   777155.00 -0.0%
                                                                            test-suite :: MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt.test   123265.00   123205.00 -0.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   315534.00   315358.00 -0.1%
                                                                        test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.test   128163.00   128083.00 -0.1%
                                                                           test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test     6562.00     6555.00 -0.1%
                                                                                test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test    23428.00    23396.00 -0.1%
                                                                             test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test    22749.00    22717.00 -0.1%
                                                                           test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    39549.00    39485.00 -0.2%
                                                                                  test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    39546.00    39482.00 -0.2%
                                                                                    test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test    57214.00    57118.00 -0.2%
                                                                                      test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   413668.00   412804.00 -0.2%
                                                                                       test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  1044047.00  1041487.00 -0.2%
                                                                                            test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test    12414.00    12382.00 -0.3%
                                                                                      test-suite :: MultiSource/Benchmarks/Prolangs-C/gnugo/gnugo.test    31161.00    30969.00 -0.6%
                                                                               test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test   224726.00   223254.00 -0.7%
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test    93512.00    92824.00 -0.7%
                                                                        test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   281151.00   278463.00 -1.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test     2820.00     2788.00 -1.1%
                                                                                            test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test   156819.00   154739.00 -1.3%
                                                                 test-suite :: MultiSource/Benchmarks/MiBench/security-blowfish/security-blowfish.test    11560.00    11160.00 -3.5%
                                                                                          test-suite :: MultiSource/Benchmarks/McCat/08-main/main.test     6734.00     6382.00 -5.2%
                                                                                                                                                       results     results0    diff

ASCI_Purple/SMG2000 - extra vector code
VPlanNativePath/outer-loop-vect - extra vectorization, better vector
code
AVX512BWVL/Vector-AVX512BWVL-psadbw - better vector code
Rodinia/hotspot - small variations
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - extra vector code, better vectorization
BenchmarkGame/n-body - better vector code.
AVX512BWVL/Vector-AVX512BWVL-unpack_msasm - small variations
Misc/flops - extra vector code
AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - small variations
Misc-C++/Large - better vector code
CFP2017rate/526.blender_r - extra vector code
Prolangs-C++/city - extra vector code
MiBench/consumer-lame - extra vector code
CoyoteBench/fftbench - extra vector code
Olden/power - better vector code
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - extra vector code
CINT2017rate/531.deepsjeng_r - extra vector code
CFP2006/447.dealII - small variations
DOE-ProxyApps-C/miniAMR - small variations
Prolangs-C/unix-smail - small variations
DOE-ProxyApps-C++/PENNANT - small variations
CINT2006/473.astar - small variations
CFP2006/453.povray - small variations
JM/lencod - extra vector code
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/CoMD - small variations
CFP2006/470.lbm - extra vector code
CFP2017speed/619.lbm_s
CFP2017rate/519.lbm_r - extra vector code
CINT2006/456.hmmer - extra code vectorized
TSVC/ControlFlow-dbl - extra vector code
CFP2017rate/510.parest_r - better vector code
LCALS/SubsetALambdaLoops - extra code vectorized
LCALS/SubsetARawLoops - extra code vectorized
CFP2006/444.namd - extra code vectorized
CFP2006/482.sphinx3 - better vector code
Applications/siod - better vector code
Benchmarks/7zip - better vector code
DOE-ProxyApps-C++/CLAMR - extra code vectorized
Applications/sqlite3 - extra code vectorized
Applications/ClamAV - smaller vector code
MallocBench/gs - small variations
MicroBenchmarks/ImageProcessing - small variations
TSVC/StatementReordering-flt - extra code vectorized
CINT2006/471.omnetpp - small variations
CINT2006/403.gcc - extra code vectorized
CINT2006/483.xalancbmk - extra code vectorized
JM/ldecod - small variations
Applications/lua - extra code vectorized
mafft/pairlocalalign - small variations
TSVC/NodeSplitting-flt - extra code vectorized
TSVC/Recurrences-flt - extra code vectorized
TSVC/InductionVariable-flt - extra code vectorized
FreeBench/pifft - small variations
CINT2006/464.h264ref - extra code vectorized
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - some extra code vectorized, extra code inlined
CINT2006/445.gobmk - small variations
Applications/oggenc - small variations
TSVC/LoopRestructuring-flt - extra code vectorized
TSVC/CrossingThresholds-flt - extra code vectorized
CFP2017rate/508.namd_r - small variations
TSVC/ControlFlow-flt - extra code vectorized
mediabench/g721 - small variations
Prolangs-C/compiler - small variations
FreeBench/fourinarow - better vector code
MiBench/telecomm-gsm - small variation in vector code
mediabench/gsm - same
Prolangs-C/bison - small variations
Adobe-C++/loop_unroll - extra code vectorized
Benchmarks/tramp3d-v4 - extra code gets inlined, small changes in vetor
code
McCat/18-imp - variations in vector code
Prolangs-C/gnugo - variations in vector code
MallocBench/espresso - extra code vectorized
DOE-ProxyApps-C++/miniFE - small variations in vector code
Prolangs-C/TimberWolfMC - extra code vectorized, small changes in
previously vectorized code.
Olden/tsp - small changes in vector code
CFP2006/433.milc - extra code gets inlined, vectorized 2 x stores to 4 x stores
MiBench/security-blowfish - extra code vectorized
McCat/08-main - better vector code.

Metric: size..text, RISCV, sifive-p670

Program                                                                                                                                                size..text
                                                                                                                                                       results    results0   diff
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   63580.00   64020.00  0.7%
                                                                   test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test   21388.00   21406.00  0.1%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  296992.00  297088.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  968112.00  968208.00  0.0%
                                                        test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.test   45160.00   45164.00  0.0%
                                                                         test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 2635902.00 2635854.00 -0.0%
                                                                        test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 2635902.00 2635854.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7568730.00 7568578.00 -0.0%
                                                                                    test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7568730.00 7568578.00 -0.0%
                                                          test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test   49764.00   49762.00 -0.0%
                                                                                           test-suite :: MultiSource/Applications/sqlite3/sqlite3.test  449132.00  449108.00 -0.0%
                                                                                          test-suite :: MultiSource/Applications/JM/lencod/lencod.test  695932.00  695892.00 -0.0%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  508820.00  508788.00 -0.0%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  508820.00  508788.00 -0.0%
                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9594152.00 9593336.00 -0.0%
                                                                                 test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test  166522.00  166490.00 -0.0%
                                                                                    test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test  722252.00  722092.00 -0.0%
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   27554.00   27546.00 -0.0%
                                                                  test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test   10900.00   10896.00 -0.0%
                                                          test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.test   46754.00   46732.00 -0.0%
                                                                                       test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  631570.00  631226.00 -0.1%
                                                                                         test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  850698.00  850218.00 -0.1%
                                                                           test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test   24816.00   24800.00 -0.1%
                                                                                  test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test   24814.00   24798.00 -0.1%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1599946.00 1598394.00 -0.1%
                                                                                                   test-suite :: MultiSource/Applications/hbd/hbd.test   27236.00   27204.00 -0.1%
                                                                                          test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  293848.00  293480.00 -0.1%
                                                                                test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test   20160.00   20048.00 -0.6%
                                                                               test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test  182088.00  181040.00 -0.6%
                                                                           test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    4788.00    4748.00 -0.8%

DOE-ProxyApps-C++/miniFE - extra vector code
MiBench/automotive-susan - small variations
Benchmarks/Bullet - extra vector code
CFP2017rate/511.povray_r - slightly better vector code
TSVC/StatementReordering-dbl - small variations
CINT2017rate/523.xalancbmk_r
CINT2017speed/623.xalancbmk_s - extra vector code
CINT2017rate/502.gcc_r
CINT2017speed/602.gcc_s - extra vector code
TSVC/CrossingThresholds-flt - small variations
Applications/sqlite3 - extra vector code
JM/lencod - extra vector code, small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - small variations
CFP2017rate/526.blender_r - extra vector code, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vectorizer/VPlanNativePath/outer-loop-vect - small variations
TSVC/CrossingThresholds-dbl - small variations
Benchmarks/tramp3d-v4 - small variations
Benchmarks/7zip - extra vector code
MiBench/telecomm-gsm - small variations
mediabench/gsm/toast - small variations
CFP2017rate/510.parest_r - extra vector code
Applications/hbd - extra vector code
JM/ldecod - better vector code
Prolangs-C/compiler - extra vector code
MallocBench/espresso - extra vector code
mediabench/g721/g721encode - extra vectorization

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: #107461
@mikaelholmen
Copy link
Collaborator

Hi @alexey-bataev

The following starts crashing with this patch:
opt -passes=slp-vectorizer bbi-99793.ll -o /dev/null
(I've run llvm-reduce but the input is unfortunately still quite large)

It crashes with

opt: ../lib/Transforms/Vectorize/SLPVectorizer.cpp:7568: void llvm::slpvectorizer::BoUpSLP::buildTree_rec(ArrayRef<llvm::Value *>, unsigned int, const llvm::slpvectorizer::BoUpSLP::EdgeInfo &): Assertion `(allConstant(VL) || allSameType(VL)) && "Invalid types!"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: build-all/bin/opt -passes=slp-vectorizer bbi-99793.ll -o /dev/null
1.	Running pass "function(slp-vectorizer)" on module "bbi-99793.ll"
2.	Running pass "slp-vectorizer" on function "test_structcopy_14"
 #0 0x00005562edfca907 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (build-all/bin/opt+0x422b907)
 #1 0x00005562edfc83ee llvm::sys::RunSignalHandlers() (build-all/bin/opt+0x42293ee)
 #2 0x00005562edfcb12f SignalHandler(int) Signals.cpp:0:0
 #3 0x00007f3e3e628cf0 __restore_rt (/lib64/libpthread.so.0+0x12cf0)
 #4 0x00007f3e3c1e1acf raise (/lib64/libc.so.6+0x4eacf)
 #5 0x00007f3e3c1b4ea5 abort (/lib64/libc.so.6+0x21ea5)
 #6 0x00007f3e3c1b4d79 _nl_load_domain.cold.0 (/lib64/libc.so.6+0x21d79)
 #7 0x00007f3e3c1da426 (/lib64/libc.so.6+0x47426)
 #8 0x00005562ef8e84b6 llvm::slpvectorizer::BoUpSLP::buildTree_rec(llvm::ArrayRef<llvm::Value*>, unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo const&) (build-all/bin/opt+0x5b494b6)
 #9 0x00005562ef8e9fcd llvm::slpvectorizer::BoUpSLP::tryToVectorizeGatheredLoads(llvm::ArrayRef<llvm::SmallVector<std::pair<llvm::LoadInst*, int>, 3u>>)::$_41::operator()(llvm::ArrayRef<llvm::SmallVector<std::pair<llvm::LoadInst*, int>, 3u>>, bool) const SLPVectorizer.cpp:0:0
#10 0x00005562ef8e8a2b llvm::slpvectorizer::BoUpSLP::tryToVectorizeGatheredLoads(llvm::ArrayRef<llvm::SmallVector<std::pair<llvm::LoadInst*, int>, 3u>>) (build-all/bin/opt+0x5b49a2b)
#11 0x00005562ef8f84ca llvm::slpvectorizer::BoUpSLP::transformNodes() (build-all/bin/opt+0x5b594ca)
#12 0x00005562ef96153b (anonymous namespace)::HorizontalReduction::tryToReduce(llvm::slpvectorizer::BoUpSLP&, llvm::DataLayout const&, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo const&) SLPVectorizer.cpp:0:0
#13 0x00005562ef93b1a8 llvm::SLPVectorizerPass::vectorizeHorReduction(llvm::PHINode*, llvm::Instruction*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&, llvm::TargetTransformInfo*, llvm::SmallVectorImpl<llvm::WeakTrackingVH>&) (build-all/bin/opt+0x5b9c1a8)
#14 0x00005562ef93b58b llvm::SLPVectorizerPass::vectorizeRootInstruction(llvm::PHINode*, llvm::Instruction*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&, llvm::TargetTransformInfo*) (build-all/bin/opt+0x5b9c58b)
#15 0x00005562ef93025d llvm::SLPVectorizerPass::vectorizeChainsInBlock(llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) (build-all/bin/opt+0x5b9125d)
#16 0x00005562ef92d1b0 llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) (build-all/bin/opt+0x5b8e1b0)
#17 0x00005562ef92c805 llvm::SLPVectorizerPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (build-all/bin/opt+0x5b8d805)
#18 0x00005562ef381e7d llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) PassBuilderPipelines.cpp:0:0
#19 0x00005562ee1d51da llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (build-all/bin/opt+0x44361da)
#20 0x00005562ef37a1ed llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) PassBuilderPipelines.cpp:0:0
#21 0x00005562ee1d9ca1 llvm::ModuleToFunctionPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (build-all/bin/opt+0x443aca1)
#22 0x00005562ef373d8d llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) PassBuilderPipelines.cpp:0:0
#23 0x00005562ee1d3f1a llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (build-all/bin/opt+0x4434f1a)
#24 0x00005562ef31df5b llvm::runPassPipeline(llvm::StringRef, llvm::Module&, llvm::TargetMachine*, llvm::TargetLibraryInfoImpl*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::StringRef, llvm::ArrayRef<llvm::PassPlugin>, llvm::ArrayRef<std::function<void (llvm::PassBuilder&)>>, llvm::opt_tool::OutputKind, llvm::opt_tool::VerifierKind, bool, bool, bool, bool, bool, bool, bool) (build-all/bin/opt+0x557ef5b)
#25 0x00005562edf91f7d optMain (build-all/bin/opt+0x41f2f7d)
#26 0x00007f3e3c1cdd85 __libc_start_main (/lib64/libc.so.6+0x3ad85)
#27 0x00005562edf8baee _start (build-all/bin/opt+0x41ecaee)
Abort (core dumped)

bbi-99793.ll.gz

alexey-bataev added a commit that referenced this pull request Oct 4, 2024
…ypes

Need to check not only parents, but also types for compatible loads,
when trying to build the vectorizable sequences.

Fixes crash reported in #107461 (comment)
@alexey-bataev
Copy link
Member Author

Hi @alexey-bataev

The following starts crashing with this patch: opt -passes=slp-vectorizer bbi-99793.ll -o /dev/null (I've run llvm-reduce but the input is unfortunately still quite large)

It crashes with

opt: ../lib/Transforms/Vectorize/SLPVectorizer.cpp:7568: void llvm::slpvectorizer::BoUpSLP::buildTree_rec(ArrayRef<llvm::Value *>, unsigned int, const llvm::slpvectorizer::BoUpSLP::EdgeInfo &): Assertion `(allConstant(VL) || allSameType(VL)) && "Invalid types!"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: build-all/bin/opt -passes=slp-vectorizer bbi-99793.ll -o /dev/null
1.	Running pass "function(slp-vectorizer)" on module "bbi-99793.ll"
2.	Running pass "slp-vectorizer" on function "test_structcopy_14"
 #0 0x00005562edfca907 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (build-all/bin/opt+0x422b907)
 #1 0x00005562edfc83ee llvm::sys::RunSignalHandlers() (build-all/bin/opt+0x42293ee)
 #2 0x00005562edfcb12f SignalHandler(int) Signals.cpp:0:0
 #3 0x00007f3e3e628cf0 __restore_rt (/lib64/libpthread.so.0+0x12cf0)
 #4 0x00007f3e3c1e1acf raise (/lib64/libc.so.6+0x4eacf)
 #5 0x00007f3e3c1b4ea5 abort (/lib64/libc.so.6+0x21ea5)
 #6 0x00007f3e3c1b4d79 _nl_load_domain.cold.0 (/lib64/libc.so.6+0x21d79)
 #7 0x00007f3e3c1da426 (/lib64/libc.so.6+0x47426)
 #8 0x00005562ef8e84b6 llvm::slpvectorizer::BoUpSLP::buildTree_rec(llvm::ArrayRef<llvm::Value*>, unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo const&) (build-all/bin/opt+0x5b494b6)
 #9 0x00005562ef8e9fcd llvm::slpvectorizer::BoUpSLP::tryToVectorizeGatheredLoads(llvm::ArrayRef<llvm::SmallVector<std::pair<llvm::LoadInst*, int>, 3u>>)::$_41::operator()(llvm::ArrayRef<llvm::SmallVector<std::pair<llvm::LoadInst*, int>, 3u>>, bool) const SLPVectorizer.cpp:0:0
#10 0x00005562ef8e8a2b llvm::slpvectorizer::BoUpSLP::tryToVectorizeGatheredLoads(llvm::ArrayRef<llvm::SmallVector<std::pair<llvm::LoadInst*, int>, 3u>>) (build-all/bin/opt+0x5b49a2b)
#11 0x00005562ef8f84ca llvm::slpvectorizer::BoUpSLP::transformNodes() (build-all/bin/opt+0x5b594ca)
#12 0x00005562ef96153b (anonymous namespace)::HorizontalReduction::tryToReduce(llvm::slpvectorizer::BoUpSLP&, llvm::DataLayout const&, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo const&) SLPVectorizer.cpp:0:0
#13 0x00005562ef93b1a8 llvm::SLPVectorizerPass::vectorizeHorReduction(llvm::PHINode*, llvm::Instruction*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&, llvm::TargetTransformInfo*, llvm::SmallVectorImpl<llvm::WeakTrackingVH>&) (build-all/bin/opt+0x5b9c1a8)
#14 0x00005562ef93b58b llvm::SLPVectorizerPass::vectorizeRootInstruction(llvm::PHINode*, llvm::Instruction*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&, llvm::TargetTransformInfo*) (build-all/bin/opt+0x5b9c58b)
#15 0x00005562ef93025d llvm::SLPVectorizerPass::vectorizeChainsInBlock(llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) (build-all/bin/opt+0x5b9125d)
#16 0x00005562ef92d1b0 llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) (build-all/bin/opt+0x5b8e1b0)
#17 0x00005562ef92c805 llvm::SLPVectorizerPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (build-all/bin/opt+0x5b8d805)
#18 0x00005562ef381e7d llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) PassBuilderPipelines.cpp:0:0
#19 0x00005562ee1d51da llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (build-all/bin/opt+0x44361da)
#20 0x00005562ef37a1ed llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) PassBuilderPipelines.cpp:0:0
#21 0x00005562ee1d9ca1 llvm::ModuleToFunctionPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (build-all/bin/opt+0x443aca1)
#22 0x00005562ef373d8d llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) PassBuilderPipelines.cpp:0:0
#23 0x00005562ee1d3f1a llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (build-all/bin/opt+0x4434f1a)
#24 0x00005562ef31df5b llvm::runPassPipeline(llvm::StringRef, llvm::Module&, llvm::TargetMachine*, llvm::TargetLibraryInfoImpl*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::StringRef, llvm::ArrayRef<llvm::PassPlugin>, llvm::ArrayRef<std::function<void (llvm::PassBuilder&)>>, llvm::opt_tool::OutputKind, llvm::opt_tool::VerifierKind, bool, bool, bool, bool, bool, bool, bool) (build-all/bin/opt+0x557ef5b)
#25 0x00005562edf91f7d optMain (build-all/bin/opt+0x41f2f7d)
#26 0x00007f3e3c1cdd85 __libc_start_main (/lib64/libc.so.6+0x3ad85)
#27 0x00005562edf8baee _start (build-all/bin/opt+0x41ecaee)
Abort (core dumped)

bbi-99793.ll.gz

Must be fixed in d991e05

@mikaelholmen
Copy link
Collaborator

For info, I've bisected (what I think is) a miscompile to this patch. I'm trying to extract some kind of reproducer.

@mikaelholmen
Copy link
Collaborator

For info, I've bisected (what I think is) a miscompile to this patch. I'm trying to extract some kind of reproducer.

Reproducer:

clang bbi-100697.c -O2 -std=c23
./a.out

After 1833d41 this results in

Fail!

and before it results in

Pass!

I have no idea what goes wrong, if SLP does anything bad or if it just triggers a bug in some later pass.
The input was originally part of a much larger test so I've reduced it a lot but unfortunately it's still quite large (and ugly).

bbi-100697.c.gz

@alexey-bataev
Copy link
Member Author

For info, I've bisected (what I think is) a miscompile to this patch. I'm trying to extract some kind of reproducer.

Reproducer:

clang bbi-100697.c -O2 -std=c23
./a.out

After 1833d41 this results in

Fail!

and before it results in

Pass!

I have no idea what goes wrong, if SLP does anything bad or if it just triggers a bug in some later pass. The input was originally part of a much larger test so I've reduced it a lot but unfortunately it's still quite large (and ugly).

bbi-100697.c.gz

Thanks, I will check it ASAP

@alexey-bataev
Copy link
Member Author

alexey-bataev commented Oct 30, 2024

For info, I've bisected (what I think is) a miscompile to this patch. I'm trying to extract some kind of reproducer.

Reproducer:

clang bbi-100697.c -O2 -std=c23
./a.out

After 1833d41 this results in

Fail!

and before it results in

Pass!

I have no idea what goes wrong, if SLP does anything bad or if it just triggers a bug in some later pass. The input was originally part of a much larger test so I've reduced it a lot but unfortunately it's still quite large (and ugly).

bbi-100697.c.gz

I checked it, looks like the issue is not in SLP vectorizer, but in codegen.

clang bbi-100697.c -O2 -std=c23 -mllvm -opt-bisect-limit=393 && ./a.out

passes, while

clang bbi-100697.c -O2 -std=c23 -mllvm -opt-bisect-limit=399 && ./a.out

fails.
The passes between these numbers are:

BISECT: running pass (394) X86 DAG->DAG Instruction Selection on function (main)
BISECT: running pass (395) Local Dynamic TLS Access Clean-up on function (main)
BISECT: running pass (396) X86 Domain Reassignment Pass on function (main)
BISECT: running pass (397) Early Tail Duplication on function (main)
BISECT: running pass (398) Optimize machine instruction PHIs on function (main)
BISECT: running pass (399) Merge disjoint stack slots on function (main)

The compiler crashes, if -opt-bisect-limit is set a value between 394-399, so cannot tell precisely what pass causes miscompilation.

@mikaelholmen
Copy link
Collaborator

I wrote an issue about the error: #114360

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants