[SLP] Avoid crash in computeExtractCost #93188

bjope · 2024-05-23T13:06:30Z

For a downstream target we ended up in a situation with assertion failures in ShuffleCostEstimator::computeExtractCost.

Input IR looked like this:

define void @foo(ptr %p0, <64 x i32> %vec) {
%p1 = getelementptr i32, ptr %p0, i16 1
%p2 = getelementptr i32, ptr %p0, i16 2
%p3 = getelementptr i32, ptr %p0, i16 3
%p4 = getelementptr i32, ptr %p0, i16 4
%p5 = getelementptr i32, ptr %p0, i16 5
%p6 = getelementptr i32, ptr %p0, i16 6
%p7 = getelementptr i32, ptr %p0, i16 7
%elt = extractelement <64 x i32> %vec, i32 0
store i32 %elt, ptr %p0
store i32 %elt, ptr %p1
store i32 %elt, ptr %p2
store i32 %elt, ptr %p3
store i32 %elt, ptr %p4
store i32 %elt, ptr %p5
store i32 %elt, ptr %p6
store i32 %elt, ptr %p7
ret void
}

And the scenario was like this:

VL and Mask has 8 elements at entry to computeExtractCost.
NumParts is 2 (v8i32 is not legal, but v4i32 is legal).
NumElts is calculated as 64 (given by extractelement <64 x i32>).
NumSrcRegs is calculated and is set to 1 (v64i32 is legal).
EltsPerVector is calculated as 64 (given by NumElts/NumSrcRegs).
Assertion failure happens when doing ArrayRef MaskSlice = Mask.slice(Part * EltsPerVector, (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0) ? Mask.size() % EltsPerVector : EltsPerVector); since EltsPerVector is larger than Mask.size() already for Part==0.

This patch resolved the issue by making sure that we slice up Mask in at most EltsPerVector pieces until we have covered the full Mask. When we have covered all elements in Mask we break the loop.

Haven't been able to reproduce this scenario for any in-tree target. So unfortunately there is no regression test included in the patch.

@foo

For a downstream target we ended up in a situation with assertion failures in ShuffleCostEstimator::computeExtractCost. Input IR looked like this: define void @foo(ptr %p0, <64 x i32> %vec) { %p1 = getelementptr i32, ptr %p0, i16 1 %p2 = getelementptr i32, ptr %p0, i16 2 %p3 = getelementptr i32, ptr %p0, i16 3 %p4 = getelementptr i32, ptr %p0, i16 4 %p5 = getelementptr i32, ptr %p0, i16 5 %p6 = getelementptr i32, ptr %p0, i16 6 %p7 = getelementptr i32, ptr %p0, i16 7 %elt = extractelement <64 x i32> %vec, i32 0 store i32 %elt, ptr %p0 store i32 %elt, ptr %p1 store i32 %elt, ptr %p2 store i32 %elt, ptr %p3 store i32 %elt, ptr %p4 store i32 %elt, ptr %p5 store i32 %elt, ptr %p6 store i32 %elt, ptr %p7 ret void } And the scenario was like this: - VL and Mask has 8 elements at entry to computeExtractCost. - NumParts is 2 (v8i32 is not legal, but v4i32 is legal). - NumElts is calculated as 64 (given by extractelement <64 x i32>). - NumSrcRegs is calculated and is set to 1 (v64i32 is legal). - EltsPerVector is calculated as 64 (given by NumElts/NumSrcRegs). - Assertion failure happens when doing ArrayRef<int> MaskSlice = Mask.slice(Part * EltsPerVector, (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0) ? Mask.size() % EltsPerVector : EltsPerVector); since EltsPerVector is larger than Mask.size() already for Part==0. This patch resolved the issue by making sure that we slice up Mask in at most EltsPerVector pieces until we have covered the full Mask. When we have covered all elements in Mask we break the loop. Haven't been able to reproduce this scenario for any in-tree target. So unfortunately there is no regression test included in the patch.

llvmbot · 2024-05-23T13:07:04Z

@llvm/pr-subscribers-llvm-transforms

Author: Björn Pettersson (bjope)

Changes

For a downstream target we ended up in a situation with assertion failures in ShuffleCostEstimator::computeExtractCost.

Input IR looked like this:

define void @foo(ptr %p0, <64 x i32> %vec) {
%p1 = getelementptr i32, ptr %p0, i16 1
%p2 = getelementptr i32, ptr %p0, i16 2
%p3 = getelementptr i32, ptr %p0, i16 3
%p4 = getelementptr i32, ptr %p0, i16 4
%p5 = getelementptr i32, ptr %p0, i16 5
%p6 = getelementptr i32, ptr %p0, i16 6
%p7 = getelementptr i32, ptr %p0, i16 7
%elt = extractelement <64 x i32> %vec, i32 0
store i32 %elt, ptr %p0
store i32 %elt, ptr %p1
store i32 %elt, ptr %p2
store i32 %elt, ptr %p3
store i32 %elt, ptr %p4
store i32 %elt, ptr %p5
store i32 %elt, ptr %p6
store i32 %elt, ptr %p7
ret void
}

And the scenario was like this:

VL and Mask has 8 elements at entry to computeExtractCost.
NumParts is 2 (v8i32 is not legal, but v4i32 is legal).
NumElts is calculated as 64 (given by extractelement <64 x i32>).
NumSrcRegs is calculated and is set to 1 (v64i32 is legal).
EltsPerVector is calculated as 64 (given by NumElts/NumSrcRegs).
Assertion failure happens when doing ArrayRef<int> MaskSlice = Mask.slice(Part * EltsPerVector, (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0) ? Mask.size() % EltsPerVector : EltsPerVector); since EltsPerVector is larger than Mask.size() already for Part==0.

This patch resolved the issue by making sure that we slice up Mask in at most EltsPerVector pieces until we have covered the full Mask. When we have covered all elements in Mask we break the loop.

Haven't been able to reproduce this scenario for any in-tree target. So unfortunately there is no regression test included in the patch.

Full diff: https://github.com/llvm/llvm-project/pull/93188.diff

1 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+7-5)

diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 08ecbe304429e..f25ddf6bd8082 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -8318,11 +8318,13 @@ class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
     for (unsigned Part = 0; Part < NumParts; ++Part) {
       if (!ShuffleKinds[Part])
         continue;
-      ArrayRef<int> MaskSlice =
-          Mask.slice(Part * EltsPerVector,
-                     (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0)
-                         ? Mask.size() % EltsPerVector
-                         : EltsPerVector);
+      unsigned SliceStart = Part * EltsPerVector;
+      if (SliceStart >= Mask.size())
+        break;
+      unsigned SliceSize = (SliceStart + EltsPerVector) > Mask.size()
+                               ? Mask.size() - SliceStart
+                               : EltsPerVector;
+      ArrayRef<int> MaskSlice = Mask.slice(SliceStart, SliceSize);
       SmallVector<int> SubMask(EltsPerVector, PoisonMaskElem);
       copy(MaskSlice, SubMask.begin());
       std::optional<TTI::ShuffleKind> RegShuffleKind =

bjope · 2024-05-23T13:13:41Z

Hi @alexey-bataev , do you think this patch looks OK despite not having any in-tree reproducer?

I could not really find any matching piece of code doing the slicing based on EltsPerVector like that, but I wonder a bit if the cost estimates here need to match some other piece of code used to emit the actual shuffles?

When I looked at this I also noticed some other things that looked suspect to me.

ShuffleCostEstimator::adjustExtracts is calculating SliceSize = VL.size() / NumParts. But afaict this isn't always an exact division. For example test/Transforms/SLPVectorizer/AArch64/vec3-reorder-reshuffle.ll will end up in a situation when divide 3 by 2. And then

   for (unsigned Part = 0; Part < NumParts; ++Part) {
      ArrayRef<int> SubMask = Mask.slice(Part * SliceSize, SliceSize);
      for (auto [I, V] : enumerate(VL.slice(Part * SliceSize, SliceSize))) {

won't cover for the full VL/mask. There are a couple of more places with similar code (e.g. BoUpSLP::tryToGatherExtractElements). It is not clear to me if this is a real problem. But the code seems to have some assumptions regarding how NumParts relates to VL.size() and Mask.size() etc. But it is not quite clear what those assumptions are.

alexey-bataev · 2024-05-23T13:22:49Z

For a downstream target we ended up in a situation with assertion failures in ShuffleCostEstimator::computeExtractCost.

Input IR looked like this:

define void @foo(ptr %p0, <64 x i32> %vec) { %p1 = getelementptr i32, ptr %p0, i16 1 %p2 = getelementptr i32, ptr %p0, i16 2 %p3 = getelementptr i32, ptr %p0, i16 3 %p4 = getelementptr i32, ptr %p0, i16 4 %p5 = getelementptr i32, ptr %p0, i16 5 %p6 = getelementptr i32, ptr %p0, i16 6 %p7 = getelementptr i32, ptr %p0, i16 7 %elt = extractelement <64 x i32> %vec, i32 0 store i32 %elt, ptr %p0 store i32 %elt, ptr %p1 store i32 %elt, ptr %p2 store i32 %elt, ptr %p3 store i32 %elt, ptr %p4 store i32 %elt, ptr %p5 store i32 %elt, ptr %p6 store i32 %elt, ptr %p7 ret void }

And the scenario was like this:

VL and Mask has 8 elements at entry to computeExtractCost.

NumParts is 2 (v8i32 is not legal, but v4i32 is legal).

NumElts is calculated as 64 (given by extractelement <64 x i32>).

NumSrcRegs is calculated and is set to 1 (v64i32 is legal).

EltsPerVector is calculated as 64 (given by NumElts/NumSrcRegs).

Assertion failure happens when doing ArrayRef MaskSlice = Mask.slice(Part * EltsPerVector, (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0) ? Mask.size() % EltsPerVector : EltsPerVector); since EltsPerVector is larger than Mask.size() already for Part==0.

This patch resolved the issue by making sure that we slice up Mask in at most EltsPerVector pieces until we have covered the full Mask. When we have covered all elements in Mask we break the loop.

Haven't been able to reproduce this scenario for any in-tree target. So unfortunately there is no regression test included in the patch.

It is just an unexpected situation, where <8 x i32> has 2 parts, while <64 x i32> has just one. Is this correct at all? Actually, all this code must be moved to TTI and calculated there for best cost estimation, this is just a temporary solution.

alexey-bataev · 2024-05-23T13:34:05Z

Hi @alexey-bataev , do you think this patch looks OK despite not having any in-tree reproducer?

I could not really find any matching piece of code doing the slicing based on EltsPerVector like that, but I wonder a bit if the cost estimates here need to match some other piece of code used to emit the actual shuffles?

When I looked at this I also noticed some other things that looked suspect to me.

ShuffleCostEstimator::adjustExtracts is calculating SliceSize = VL.size() / NumParts. But afaict this isn't always an exact division. For example test/Transforms/SLPVectorizer/AArch64/vec3-reorder-reshuffle.ll will end up in a situation when divide 3 by 2. And then
   for (unsigned Part = 0; Part < NumParts; ++Part) {
      ArrayRef<int> SubMask = Mask.slice(Part * SliceSize, SliceSize);
      for (auto [I, V] : enumerate(VL.slice(Part * SliceSize, SliceSize))) {
won't cover for the full VL/mask. There are a couple of more places with similar code (e.g. BoUpSLP::tryToGatherExtractElements). It is not clear to me if this is a real problem. But the code seems to have some assumptions regarding how NumParts relates to VL.size() and Mask.size() etc. But it is not quite clear what those assumptions are.

adjustExtracts code is correct for power-of-2 number of elements, but not correct for non-power-of-2 number of elements. I will fix it.

bjope · 2024-05-23T13:54:05Z

For a downstream target we ended up in a situation with assertion failures in ShuffleCostEstimator::computeExtractCost.
Input IR looked like this:
define void @foo(ptr %p0, <64 x i32> %vec) { %p1 = getelementptr i32, ptr %p0, i16 1 %p2 = getelementptr i32, ptr %p0, i16 2 %p3 = getelementptr i32, ptr %p0, i16 3 %p4 = getelementptr i32, ptr %p0, i16 4 %p5 = getelementptr i32, ptr %p0, i16 5 %p6 = getelementptr i32, ptr %p0, i16 6 %p7 = getelementptr i32, ptr %p0, i16 7 %elt = extractelement <64 x i32> %vec, i32 0 store i32 %elt, ptr %p0 store i32 %elt, ptr %p1 store i32 %elt, ptr %p2 store i32 %elt, ptr %p3 store i32 %elt, ptr %p4 store i32 %elt, ptr %p5 store i32 %elt, ptr %p6 store i32 %elt, ptr %p7 ret void }
And the scenario was like this:

VL and Mask has 8 elements at entry to computeExtractCost.

NumParts is 2 (v8i32 is not legal, but v4i32 is legal).

NumElts is calculated as 64 (given by extractelement <64 x i32>).

NumSrcRegs is calculated and is set to 1 (v64i32 is legal).

EltsPerVector is calculated as 64 (given by NumElts/NumSrcRegs).

Assertion failure happens when doing ArrayRef MaskSlice = Mask.slice(Part * EltsPerVector, (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0) ? Mask.size() % EltsPerVector : EltsPerVector); since EltsPerVector is larger than Mask.size() already for Part==0.

This patch resolved the issue by making sure that we slice up Mask in at most EltsPerVector pieces until we have covered the full Mask. When we have covered all elements in Mask we break the loop.
Haven't been able to reproduce this scenario for any in-tree target. So unfortunately there is no regression test included in the patch.

It is just an unexpected situation, where <8 x i32> has 2 parts, while <64 x i32> has just one. Is this correct at all? Actually, all this code must be moved to TTI and calculated there for best cost estimation, this is just a temporary solution.

I actually find the existing TTI.getNumberOfParts helper a bit weird (at least the default BasicTTIImple version). It is implemented as checking for some type legalization cost. Type legalization might involve scalarization/widening/splitting etc. So the number returned is some sort of cost (how many steps involved in legalizing the type?). But SLPVectorizer is for example treating it as NumSrcRegs.
Our target has lots of general purpose registers, and vectors could be mapped to a sequence of such registers. So v4i32 is for example 4 consecutive general purpose registers. Still v4i32 is a legal type, so getNumberOfParts("v4i32") is one.
The target also has a totally different register pool with vector registers. In that pool there are vector registers, for example making v32i32 legal. So getNumberOfParts("v32i32") is also one. Similarly v64i32 is legal, but it is actually two consecutive vector registers, but no split is needed at legalization so getNumberOfParts("v64i32") is also one.

alexey-bataev · 2024-05-23T14:00:45Z

For a downstream target we ended up in a situation with assertion failures in ShuffleCostEstimator::computeExtractCost.
Input IR looked like this:
define void @foo(ptr %p0, <64 x i32> %vec) { %p1 = getelementptr i32, ptr %p0, i16 1 %p2 = getelementptr i32, ptr %p0, i16 2 %p3 = getelementptr i32, ptr %p0, i16 3 %p4 = getelementptr i32, ptr %p0, i16 4 %p5 = getelementptr i32, ptr %p0, i16 5 %p6 = getelementptr i32, ptr %p0, i16 6 %p7 = getelementptr i32, ptr %p0, i16 7 %elt = extractelement <64 x i32> %vec, i32 0 store i32 %elt, ptr %p0 store i32 %elt, ptr %p1 store i32 %elt, ptr %p2 store i32 %elt, ptr %p3 store i32 %elt, ptr %p4 store i32 %elt, ptr %p5 store i32 %elt, ptr %p6 store i32 %elt, ptr %p7 ret void }
And the scenario was like this:

VL and Mask has 8 elements at entry to computeExtractCost.

NumParts is 2 (v8i32 is not legal, but v4i32 is legal).

NumElts is calculated as 64 (given by extractelement <64 x i32>).

NumSrcRegs is calculated and is set to 1 (v64i32 is legal).

EltsPerVector is calculated as 64 (given by NumElts/NumSrcRegs).

Assertion failure happens when doing ArrayRef MaskSlice = Mask.slice(Part * EltsPerVector, (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0) ? Mask.size() % EltsPerVector : EltsPerVector); since EltsPerVector is larger than Mask.size() already for Part==0.

This patch resolved the issue by making sure that we slice up Mask in at most EltsPerVector pieces until we have covered the full Mask. When we have covered all elements in Mask we break the loop.
Haven't been able to reproduce this scenario for any in-tree target. So unfortunately there is no regression test included in the patch.

It is just an unexpected situation, where <8 x i32> has 2 parts, while <64 x i32> has just one. Is this correct at all? Actually, all this code must be moved to TTI and calculated there for best cost estimation, this is just a temporary solution.

I actually find the existing TTI.getNumberOfParts helper a bit weird (at least the default BasicTTIImple version). It is implemented as checking for some type legalization cost. Type legalization might involve scalarization/widening/splitting etc. So the number returned is some sort of cost (how many steps involved in legalizing the type?). But SLPVectorizer is for example treating it as NumSrcRegs. Our target has lots of general purpose registers, and vectors could be mapped to a sequence of such registers. So v4i32 is for example 4 consecutive general purpose registers. Still v4i32 is a legal type, so getNumberOfParts("v4i32") is one. The target also has a totally different register pool with vector registers. In that pool there are vector registers, for example making v32i32 legal. So getNumberOfParts("v32i32") is also one. Similarly v64i32 is legal, but it is actually two consecutive vector registers, but no split is needed at legalization so getNumberOfParts("v64i32") is also one.

TTI.getNumberOfParts is correct in general. I'll try to fix the code to support such non-standard situations.

bjope · 2024-05-27T18:05:56Z

The problem attempted to be solved here was solved by commit 70a54bc (#93213) instead.

bjope requested a review from alexey-bataev May 23, 2024 13:06

llvmbot added vectorizers llvm:transforms labels May 23, 2024

bjope closed this May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SLP] Avoid crash in computeExtractCost #93188

[SLP] Avoid crash in computeExtractCost #93188

Uh oh!

bjope commented May 23, 2024

Uh oh!

llvmbot commented May 23, 2024

Uh oh!

bjope commented May 23, 2024

Uh oh!

alexey-bataev commented May 23, 2024

Uh oh!

alexey-bataev commented May 23, 2024

Uh oh!

bjope commented May 23, 2024

Uh oh!

alexey-bataev commented May 23, 2024

Uh oh!

bjope commented May 27, 2024

Uh oh!

Uh oh!

[SLP] Avoid crash in computeExtractCost #93188

[SLP] Avoid crash in computeExtractCost #93188

Uh oh!

Conversation

bjope commented May 23, 2024

Uh oh!

llvmbot commented May 23, 2024

Uh oh!

bjope commented May 23, 2024

Uh oh!

alexey-bataev commented May 23, 2024

Uh oh!

alexey-bataev commented May 23, 2024

Uh oh!

bjope commented May 23, 2024

Uh oh!

alexey-bataev commented May 23, 2024

Uh oh!

bjope commented May 27, 2024

Uh oh!

Uh oh!