[X86] SimplifyDemandedVectorEltsForTargetNode - reduce the size of VPERMV/VPERMV3 nodes if the upper elements are not demanded #133923

RKSimon · 2025-04-01T15:19:47Z

With AVX512VL targets, use 128/256-bit VPERMV/VPERMV3 nodes when we only need the lower elements.

This exposed an issue with VPERMV3(X,M,Y) -> VPERMV(M,CONCAT(X,Y)) folds when X==Y, so I had to move that fold after the other VPERMV3 folds/canonicalizations.

I also took the opportunity to try to improve support for the VPERMV(M,CONCAT(Y,X)) case as well, but we can revert this if we'd prefer to avoid the extra VSHUFF64X2 node for non-constant shuffle masks (but separate loads) instead.

llvm/lib/Target/X86/X86ISelLowering.cpp

phoebewang · 2025-04-02T09:23:17Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+      SmallVector<SDValue, 2> Ops;
+      // TODO: Handle 128-bit PERMD/Q -> PSHUFD
+      if (Subtarget.hasVLX() &&
+          (VT.is512BitVector() || VT.getScalarSizeInBits() <= 16) &&


Why VT.getScalarSizeInBits() <= 16? I don't see a test for it.

OK - I'll try to add test coverage.

I've added test coverage and handling for 128-bit 32/64element shuffles to correctly use VPERMILPS/D instructions (which can then further simplify to VPSHUFD etc.)

Do you mean 2426ac6? I don't see the test in this patch?

Yes, the new VPERMILP code path is now being hit by the 2426ac6 test - but it still eventually folds to the same VSHUFPS as it did before, just via a different set of combines.

I see the point.

The initial question is about the VT.getScalarSizeInBits() <= 16, where I expect to see some change on vpermi2b/w that don't have 512-bit size, but I didn't find them in tests.

phoebewang · 2025-04-02T09:27:56Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+        if (all_of(Mask,
+                   [&](int M) { return isUndefOrInRange(M, 0, HalfElts); })) {


Where did we check the upper elements are not demanded?

At about line 43714:

// For 256/512-bit ops that are 128/256-bit ops glued together, if we do not // demand any of the high elements, then narrow the op to 128/256-bits: e.g. // (op ymm0, ymm1) --> insert undef, (op xmm0, xmm1), 0 if ((VT.is256BitVector() || VT.is512BitVector()) && DemandedElts.lshr(NumElts / 2) == 0) { unsigned SizeInBits = VT.getSizeInBits(); unsigned ExtSizeInBits = SizeInBits / 2; // See if 512-bit ops only use the bottom 128-bits. if (VT.is512BitVector() && DemandedElts.lshr(NumElts / 4) == 0) ExtSizeInBits = SizeInBits / 4;

These "vector width reduction" folds are after the standard SimplifyDemandedVectorElts simplifications that are handled earlier in SimplifyDemandedVectorEltsForTargetNode .

Based off #133923 - test to ensure the VPERMV node as only the lower 128-bit source elements are demanded.

llvm/test/CodeGen/X86/vector-shuffle-combining-avx512vl.ll

phoebewang · 2025-04-02T13:57:50Z

llvm/test/CodeGen/X86/shuffle-vs-trunc-128.ll

-; AVX512VBMI-FAST-NEXT:    vmovdqa {{.*#+}} xmm1 = [0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,79]
-; AVX512VBMI-FAST-NEXT:    vpxor %xmm2, %xmm2, %xmm2
+; AVX512VBMI-FAST-NEXT:    vmovdqa {{.*#+}} xmm1 = [64,65,66,67,68,69,24,28,32,36,40,44,48,52,56,79]
+; AVX512VBMI-FAST-NEXT:    vpmovdb %ymm0, %xmm2


This is not expected either, right?

This change allows the VPMOVDB node to be created from another VPERMV3 node, so we no longer have 2 VPERMV3 nodes that we can easily fold together. We're still struggling to combine shuffles across different vector widths - it will be handled eventually after #133947 but that is a much larger WIP patch that is highly dependent on us getting this in first.......

Requested for #133923

phoebewang

LGTM.

… VPERMV3 canonicalization Pulled out of #133923 - this prevents regressions with SimplifyDemandedVectorEltsForTargetNode exposing VPERMV3(X,M,X) repeated operand patterns which were getting concatenated to wider VPERMV nodes before simpler canonicalizations could clean them up.

…ERMV/VPERMV3 nodes if the upper elements are not demanded With AVX512VL targets, use 128/256-bit VPERMV/VPERMV3 nodes when we only need the lower elements. This exposed an issue with VPERMV3(X,M,Y) -> VPERMV(M,CONCAT(X,Y)) folds when X==Y, so I had to move that fold after the other VPERMV3 folds/canonicalizations. I also took the opportunity to try to support the VPERMV(M,CONCAT(Y,X)) case as well, but we can revert this if we'd prefer to avoid the extra VSHUFF64X2 node for non-constant shuffle masks (but separate loads) instead.

…ze of VPERMV/VPERMV3 nodes if the upper elements are not demanded" (#134256) Found a typo in the VPERMV3 mask adjustment - I'm going to revert and re-apply the patch with a fix Reverts #133923

…duce the size of VPERMV/VPERMV3 nodes if the upper elements are not demanded" (#134256) Found a typo in the VPERMV3 mask adjustment - I'm going to revert and re-apply the patch with a fix Reverts llvm/llvm-project#133923

…ERMV/VPERMV3 nodes if the upper elements are not demanded (REAPPLIED) With AVX512VL targets, use 128/256-bit VPERMV/VPERMV3 nodes when we only need the lower elements. Reapplied version of llvm#133923 with fix for typo in the VPERMV3 mask adjustment

…ERMV/VPERMV3 nodes if the upper elements are not demanded (REAPPLIED) (#134263) With AVX512VL targets, use 128/256-bit VPERMV/VPERMV3 nodes when we only need the lower elements. Reapplied version of #133923 with fix for typo in the VPERMV3 mask adjustment

…ERMV v16f32/v16i32 nodes if the upper elements are not demanded Missed in llvm#133923 - even without AVX512VL, we can replace VPERMV v16f32/v16i32 nodes with the AVX2 v8f32/v8i32 equivalents.

…ERMV v16f32/v16i32 nodes if the upper elements are not demanded (#134890) Missed in #133923 - even without AVX512VL, we can replace VPERMV v16f32/v16i32 nodes with the AVX2 v8f32/v8i32 equivalents.

…ERMV v16f32/v16i32 nodes if the upper elements are not demanded (llvm#134890) Missed in llvm#133923 - even without AVX512VL, we can replace VPERMV v16f32/v16i32 nodes with the AVX2 v8f32/v8i32 equivalents.

RKSimon requested review from phoebewang and KanRobert April 1, 2025 15:19

phoebewang reviewed Apr 2, 2025

View reviewed changes

llvm/lib/Target/X86/X86ISelLowering.cpp Show resolved Hide resolved

phoebewang reviewed Apr 2, 2025

View reviewed changes

RKSimon added a commit that referenced this pull request Apr 2, 2025

[X86] Add demanded elts for v8f32 VPERMV node

2426ac6

Based off #133923 - test to ensure the VPERMV node as only the lower 128-bit source elements are demanded.

RKSimon force-pushed the x86-demandedelts-vpermv-vpermv3 branch from 09f9845 to 1070f6b Compare April 2, 2025 11:04

llvmbot added the backend:X86 label Apr 2, 2025

phoebewang reviewed Apr 2, 2025

View reviewed changes

llvm/test/CodeGen/X86/vector-shuffle-combining-avx512vl.ll Outdated Show resolved Hide resolved

phoebewang reviewed Apr 2, 2025

View reviewed changes

RKSimon added a commit that referenced this pull request Apr 2, 2025

[X86] Add demanded elts test coverage for vXi16 VPERMW nodes

3843dfe

Requested for #133923

phoebewang approved these changes Apr 3, 2025

View reviewed changes

RKSimon added 3 commits April 3, 2025 10:30

Remove swapped concat handling

311c2ff

[X86] vector-shuffle-combining-avx512bwvl.ll - regenerate checks

6d06e67

RKSimon force-pushed the x86-demandedelts-vpermv-vpermv3 branch from 5ac2a34 to 6d06e67 Compare April 3, 2025 09:31

RKSimon merged commit bf51609 into llvm:main Apr 3, 2025
9 of 11 checks passed

RKSimon deleted the x86-demandedelts-vpermv-vpermv3 branch April 3, 2025 10:01

RKSimon mentioned this pull request Apr 3, 2025

Revert "[X86] SimplifyDemandedVectorEltsForTargetNode - reduce the size of VPERMV/VPERMV3 nodes if the upper elements are not demanded" #134256

Merged

RKSimon mentioned this pull request Apr 3, 2025

[X86] SimplifyDemandedVectorEltsForTargetNode - reduce the size of VPERMV/VPERMV3 nodes if the upper elements are not demanded (REAPPLIED) #134263

Merged

RKSimon mentioned this pull request Apr 8, 2025

[X86] SimplifyDemandedVectorEltsForTargetNode - reduce the size of VPERMV v16f32/v16i32 nodes if the upper elements are not demanded #134890

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X86] SimplifyDemandedVectorEltsForTargetNode - reduce the size of VPERMV/VPERMV3 nodes if the upper elements are not demanded #133923

[X86] SimplifyDemandedVectorEltsForTargetNode - reduce the size of VPERMV/VPERMV3 nodes if the upper elements are not demanded #133923

Uh oh!

RKSimon commented Apr 1, 2025

Uh oh!

Uh oh!

phoebewang Apr 2, 2025

Uh oh!

RKSimon Apr 2, 2025

Uh oh!

RKSimon Apr 2, 2025

Uh oh!

phoebewang Apr 2, 2025

Uh oh!

RKSimon Apr 2, 2025

Uh oh!

phoebewang Apr 2, 2025

Uh oh!

phoebewang Apr 2, 2025

Uh oh!

RKSimon Apr 2, 2025

Uh oh!

Uh oh!

phoebewang Apr 2, 2025

Uh oh!

RKSimon Apr 2, 2025

Uh oh!

phoebewang left a comment

Uh oh!

Uh oh!

Uh oh!

		if (all_of(Mask,
		[&](int M) { return isUndefOrInRange(M, 0, HalfElts); })) {

[X86] SimplifyDemandedVectorEltsForTargetNode - reduce the size of VPERMV/VPERMV3 nodes if the upper elements are not demanded #133923

[X86] SimplifyDemandedVectorEltsForTargetNode - reduce the size of VPERMV/VPERMV3 nodes if the upper elements are not demanded #133923

Uh oh!

Conversation

RKSimon commented Apr 1, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phoebewang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!