Skip to content

[SLP] Avoid crash in computeExtractCost #93188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

bjope
Copy link
Collaborator

@bjope bjope commented May 23, 2024

For a downstream target we ended up in a situation with assertion failures in ShuffleCostEstimator::computeExtractCost.

Input IR looked like this:

define void @foo(ptr %p0, <64 x i32> %vec) {
%p1 = getelementptr i32, ptr %p0, i16 1
%p2 = getelementptr i32, ptr %p0, i16 2
%p3 = getelementptr i32, ptr %p0, i16 3
%p4 = getelementptr i32, ptr %p0, i16 4
%p5 = getelementptr i32, ptr %p0, i16 5
%p6 = getelementptr i32, ptr %p0, i16 6
%p7 = getelementptr i32, ptr %p0, i16 7
%elt = extractelement <64 x i32> %vec, i32 0
store i32 %elt, ptr %p0
store i32 %elt, ptr %p1
store i32 %elt, ptr %p2
store i32 %elt, ptr %p3
store i32 %elt, ptr %p4
store i32 %elt, ptr %p5
store i32 %elt, ptr %p6
store i32 %elt, ptr %p7
ret void
}

And the scenario was like this:

  • VL and Mask has 8 elements at entry to computeExtractCost.
  • NumParts is 2 (v8i32 is not legal, but v4i32 is legal).
  • NumElts is calculated as 64 (given by extractelement <64 x i32>).
  • NumSrcRegs is calculated and is set to 1 (v64i32 is legal).
  • EltsPerVector is calculated as 64 (given by NumElts/NumSrcRegs).
  • Assertion failure happens when doing ArrayRef MaskSlice = Mask.slice(Part * EltsPerVector, (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0) ? Mask.size() % EltsPerVector : EltsPerVector); since EltsPerVector is larger than Mask.size() already for Part==0.

This patch resolved the issue by making sure that we slice up Mask in at most EltsPerVector pieces until we have covered the full Mask. When we have covered all elements in Mask we break the loop.

Haven't been able to reproduce this scenario for any in-tree target. So unfortunately there is no regression test included in the patch.

For a downstream target we ended up in a situation with assertion
failures in ShuffleCostEstimator::computeExtractCost.

Input IR looked like this:

define void @foo(ptr %p0, <64 x i32> %vec) {
  %p1 = getelementptr i32, ptr %p0, i16 1
  %p2 = getelementptr i32, ptr %p0, i16 2
  %p3 = getelementptr i32, ptr %p0, i16 3
  %p4 = getelementptr i32, ptr %p0, i16 4
  %p5 = getelementptr i32, ptr %p0, i16 5
  %p6 = getelementptr i32, ptr %p0, i16 6
  %p7 = getelementptr i32, ptr %p0, i16 7
  %elt = extractelement <64 x i32> %vec, i32 0
  store i32 %elt, ptr %p0
  store i32 %elt, ptr %p1
  store i32 %elt, ptr %p2
  store i32 %elt, ptr %p3
  store i32 %elt, ptr %p4
  store i32 %elt, ptr %p5
  store i32 %elt, ptr %p6
  store i32 %elt, ptr %p7
  ret void
}

And the scenario was like this:
- VL and Mask has 8 elements at entry to computeExtractCost.
- NumParts is 2 (v8i32 is not legal, but v4i32 is legal).
- NumElts is calculated as 64 (given by extractelement <64 x i32>).
- NumSrcRegs is calculated and is set to 1 (v64i32 is legal).
- EltsPerVector is calculated as 64 (given by NumElts/NumSrcRegs).
- Assertion failure happens when doing
      ArrayRef<int> MaskSlice =
          Mask.slice(Part * EltsPerVector,
                     (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0)
                         ? Mask.size() % EltsPerVector
                         : EltsPerVector);
  since EltsPerVector is larger than Mask.size() already for Part==0.

This patch resolved the issue by making sure that we slice up Mask
in at most EltsPerVector pieces until we have covered the full Mask.
When we have covered all elements in Mask we break the loop.

Haven't been able to reproduce this scenario for any in-tree target.
So unfortunately there is no regression test included in the patch.
@llvmbot
Copy link
Member

llvmbot commented May 23, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Björn Pettersson (bjope)

Changes

For a downstream target we ended up in a situation with assertion failures in ShuffleCostEstimator::computeExtractCost.

Input IR looked like this:

define void @foo(ptr %p0, <64 x i32> %vec) {
%p1 = getelementptr i32, ptr %p0, i16 1
%p2 = getelementptr i32, ptr %p0, i16 2
%p3 = getelementptr i32, ptr %p0, i16 3
%p4 = getelementptr i32, ptr %p0, i16 4
%p5 = getelementptr i32, ptr %p0, i16 5
%p6 = getelementptr i32, ptr %p0, i16 6
%p7 = getelementptr i32, ptr %p0, i16 7
%elt = extractelement <64 x i32> %vec, i32 0
store i32 %elt, ptr %p0
store i32 %elt, ptr %p1
store i32 %elt, ptr %p2
store i32 %elt, ptr %p3
store i32 %elt, ptr %p4
store i32 %elt, ptr %p5
store i32 %elt, ptr %p6
store i32 %elt, ptr %p7
ret void
}

And the scenario was like this:

  • VL and Mask has 8 elements at entry to computeExtractCost.
  • NumParts is 2 (v8i32 is not legal, but v4i32 is legal).
  • NumElts is calculated as 64 (given by extractelement <64 x i32>).
  • NumSrcRegs is calculated and is set to 1 (v64i32 is legal).
  • EltsPerVector is calculated as 64 (given by NumElts/NumSrcRegs).
  • Assertion failure happens when doing ArrayRef<int> MaskSlice = Mask.slice(Part * EltsPerVector, (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0) ? Mask.size() % EltsPerVector : EltsPerVector); since EltsPerVector is larger than Mask.size() already for Part==0.

This patch resolved the issue by making sure that we slice up Mask in at most EltsPerVector pieces until we have covered the full Mask. When we have covered all elements in Mask we break the loop.

Haven't been able to reproduce this scenario for any in-tree target. So unfortunately there is no regression test included in the patch.


Full diff: https://github.com/llvm/llvm-project/pull/93188.diff

1 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+7-5)
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 08ecbe304429e..f25ddf6bd8082 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -8318,11 +8318,13 @@ class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
     for (unsigned Part = 0; Part < NumParts; ++Part) {
       if (!ShuffleKinds[Part])
         continue;
-      ArrayRef<int> MaskSlice =
-          Mask.slice(Part * EltsPerVector,
-                     (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0)
-                         ? Mask.size() % EltsPerVector
-                         : EltsPerVector);
+      unsigned SliceStart = Part * EltsPerVector;
+      if (SliceStart >= Mask.size())
+        break;
+      unsigned SliceSize = (SliceStart + EltsPerVector) > Mask.size()
+                               ? Mask.size() - SliceStart
+                               : EltsPerVector;
+      ArrayRef<int> MaskSlice = Mask.slice(SliceStart, SliceSize);
       SmallVector<int> SubMask(EltsPerVector, PoisonMaskElem);
       copy(MaskSlice, SubMask.begin());
       std::optional<TTI::ShuffleKind> RegShuffleKind =

@bjope
Copy link
Collaborator Author

bjope commented May 23, 2024

Hi @alexey-bataev , do you think this patch looks OK despite not having any in-tree reproducer?

I could not really find any matching piece of code doing the slicing based on EltsPerVector like that, but I wonder a bit if the cost estimates here need to match some other piece of code used to emit the actual shuffles?


When I looked at this I also noticed some other things that looked suspect to me.

ShuffleCostEstimator::adjustExtracts is calculating SliceSize = VL.size() / NumParts. But afaict this isn't always an exact division. For example test/Transforms/SLPVectorizer/AArch64/vec3-reorder-reshuffle.ll will end up in a situation when divide 3 by 2. And then

   for (unsigned Part = 0; Part < NumParts; ++Part) {
      ArrayRef<int> SubMask = Mask.slice(Part * SliceSize, SliceSize);
      for (auto [I, V] : enumerate(VL.slice(Part * SliceSize, SliceSize))) {

won't cover for the full VL/mask. There are a couple of more places with similar code (e.g. BoUpSLP::tryToGatherExtractElements). It is not clear to me if this is a real problem. But the code seems to have some assumptions regarding how NumParts relates to VL.size() and Mask.size() etc. But it is not quite clear what those assumptions are.

@alexey-bataev
Copy link
Member

For a downstream target we ended up in a situation with assertion failures in ShuffleCostEstimator::computeExtractCost.

Input IR looked like this:

define void @foo(ptr %p0, <64 x i32> %vec) { %p1 = getelementptr i32, ptr %p0, i16 1 %p2 = getelementptr i32, ptr %p0, i16 2 %p3 = getelementptr i32, ptr %p0, i16 3 %p4 = getelementptr i32, ptr %p0, i16 4 %p5 = getelementptr i32, ptr %p0, i16 5 %p6 = getelementptr i32, ptr %p0, i16 6 %p7 = getelementptr i32, ptr %p0, i16 7 %elt = extractelement <64 x i32> %vec, i32 0 store i32 %elt, ptr %p0 store i32 %elt, ptr %p1 store i32 %elt, ptr %p2 store i32 %elt, ptr %p3 store i32 %elt, ptr %p4 store i32 %elt, ptr %p5 store i32 %elt, ptr %p6 store i32 %elt, ptr %p7 ret void }

And the scenario was like this:

  • VL and Mask has 8 elements at entry to computeExtractCost.
  • NumParts is 2 (v8i32 is not legal, but v4i32 is legal).
  • NumElts is calculated as 64 (given by extractelement <64 x i32>).
  • NumSrcRegs is calculated and is set to 1 (v64i32 is legal).
  • EltsPerVector is calculated as 64 (given by NumElts/NumSrcRegs).
  • Assertion failure happens when doing ArrayRef MaskSlice = Mask.slice(Part * EltsPerVector, (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0) ? Mask.size() % EltsPerVector : EltsPerVector); since EltsPerVector is larger than Mask.size() already for Part==0.

This patch resolved the issue by making sure that we slice up Mask in at most EltsPerVector pieces until we have covered the full Mask. When we have covered all elements in Mask we break the loop.

Haven't been able to reproduce this scenario for any in-tree target. So unfortunately there is no regression test included in the patch.

It is just an unexpected situation, where <8 x i32> has 2 parts, while <64 x i32> has just one. Is this correct at all? Actually, all this code must be moved to TTI and calculated there for best cost estimation, this is just a temporary solution.

@alexey-bataev
Copy link
Member

Hi @alexey-bataev , do you think this patch looks OK despite not having any in-tree reproducer?

I could not really find any matching piece of code doing the slicing based on EltsPerVector like that, but I wonder a bit if the cost estimates here need to match some other piece of code used to emit the actual shuffles?

When I looked at this I also noticed some other things that looked suspect to me.

ShuffleCostEstimator::adjustExtracts is calculating SliceSize = VL.size() / NumParts. But afaict this isn't always an exact division. For example test/Transforms/SLPVectorizer/AArch64/vec3-reorder-reshuffle.ll will end up in a situation when divide 3 by 2. And then

   for (unsigned Part = 0; Part < NumParts; ++Part) {
      ArrayRef<int> SubMask = Mask.slice(Part * SliceSize, SliceSize);
      for (auto [I, V] : enumerate(VL.slice(Part * SliceSize, SliceSize))) {

won't cover for the full VL/mask. There are a couple of more places with similar code (e.g. BoUpSLP::tryToGatherExtractElements). It is not clear to me if this is a real problem. But the code seems to have some assumptions regarding how NumParts relates to VL.size() and Mask.size() etc. But it is not quite clear what those assumptions are.

adjustExtracts code is correct for power-of-2 number of elements, but not correct for non-power-of-2 number of elements. I will fix it.

@bjope
Copy link
Collaborator Author

bjope commented May 23, 2024

For a downstream target we ended up in a situation with assertion failures in ShuffleCostEstimator::computeExtractCost.
Input IR looked like this:
define void @foo(ptr %p0, <64 x i32> %vec) { %p1 = getelementptr i32, ptr %p0, i16 1 %p2 = getelementptr i32, ptr %p0, i16 2 %p3 = getelementptr i32, ptr %p0, i16 3 %p4 = getelementptr i32, ptr %p0, i16 4 %p5 = getelementptr i32, ptr %p0, i16 5 %p6 = getelementptr i32, ptr %p0, i16 6 %p7 = getelementptr i32, ptr %p0, i16 7 %elt = extractelement <64 x i32> %vec, i32 0 store i32 %elt, ptr %p0 store i32 %elt, ptr %p1 store i32 %elt, ptr %p2 store i32 %elt, ptr %p3 store i32 %elt, ptr %p4 store i32 %elt, ptr %p5 store i32 %elt, ptr %p6 store i32 %elt, ptr %p7 ret void }
And the scenario was like this:

  • VL and Mask has 8 elements at entry to computeExtractCost.
  • NumParts is 2 (v8i32 is not legal, but v4i32 is legal).
  • NumElts is calculated as 64 (given by extractelement <64 x i32>).
  • NumSrcRegs is calculated and is set to 1 (v64i32 is legal).
  • EltsPerVector is calculated as 64 (given by NumElts/NumSrcRegs).
  • Assertion failure happens when doing ArrayRef MaskSlice = Mask.slice(Part * EltsPerVector, (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0) ? Mask.size() % EltsPerVector : EltsPerVector); since EltsPerVector is larger than Mask.size() already for Part==0.

This patch resolved the issue by making sure that we slice up Mask in at most EltsPerVector pieces until we have covered the full Mask. When we have covered all elements in Mask we break the loop.
Haven't been able to reproduce this scenario for any in-tree target. So unfortunately there is no regression test included in the patch.

It is just an unexpected situation, where <8 x i32> has 2 parts, while <64 x i32> has just one. Is this correct at all? Actually, all this code must be moved to TTI and calculated there for best cost estimation, this is just a temporary solution.

I actually find the existing TTI.getNumberOfParts helper a bit weird (at least the default BasicTTIImple version). It is implemented as checking for some type legalization cost. Type legalization might involve scalarization/widening/splitting etc. So the number returned is some sort of cost (how many steps involved in legalizing the type?). But SLPVectorizer is for example treating it as NumSrcRegs.
Our target has lots of general purpose registers, and vectors could be mapped to a sequence of such registers. So v4i32 is for example 4 consecutive general purpose registers. Still v4i32 is a legal type, so getNumberOfParts("v4i32") is one.
The target also has a totally different register pool with vector registers. In that pool there are vector registers, for example making v32i32 legal. So getNumberOfParts("v32i32") is also one. Similarly v64i32 is legal, but it is actually two consecutive vector registers, but no split is needed at legalization so getNumberOfParts("v64i32") is also one.

@alexey-bataev
Copy link
Member

For a downstream target we ended up in a situation with assertion failures in ShuffleCostEstimator::computeExtractCost.
Input IR looked like this:
define void @foo(ptr %p0, <64 x i32> %vec) { %p1 = getelementptr i32, ptr %p0, i16 1 %p2 = getelementptr i32, ptr %p0, i16 2 %p3 = getelementptr i32, ptr %p0, i16 3 %p4 = getelementptr i32, ptr %p0, i16 4 %p5 = getelementptr i32, ptr %p0, i16 5 %p6 = getelementptr i32, ptr %p0, i16 6 %p7 = getelementptr i32, ptr %p0, i16 7 %elt = extractelement <64 x i32> %vec, i32 0 store i32 %elt, ptr %p0 store i32 %elt, ptr %p1 store i32 %elt, ptr %p2 store i32 %elt, ptr %p3 store i32 %elt, ptr %p4 store i32 %elt, ptr %p5 store i32 %elt, ptr %p6 store i32 %elt, ptr %p7 ret void }
And the scenario was like this:

  • VL and Mask has 8 elements at entry to computeExtractCost.
  • NumParts is 2 (v8i32 is not legal, but v4i32 is legal).
  • NumElts is calculated as 64 (given by extractelement <64 x i32>).
  • NumSrcRegs is calculated and is set to 1 (v64i32 is legal).
  • EltsPerVector is calculated as 64 (given by NumElts/NumSrcRegs).
  • Assertion failure happens when doing ArrayRef MaskSlice = Mask.slice(Part * EltsPerVector, (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0) ? Mask.size() % EltsPerVector : EltsPerVector); since EltsPerVector is larger than Mask.size() already for Part==0.

This patch resolved the issue by making sure that we slice up Mask in at most EltsPerVector pieces until we have covered the full Mask. When we have covered all elements in Mask we break the loop.
Haven't been able to reproduce this scenario for any in-tree target. So unfortunately there is no regression test included in the patch.

It is just an unexpected situation, where <8 x i32> has 2 parts, while <64 x i32> has just one. Is this correct at all? Actually, all this code must be moved to TTI and calculated there for best cost estimation, this is just a temporary solution.

I actually find the existing TTI.getNumberOfParts helper a bit weird (at least the default BasicTTIImple version). It is implemented as checking for some type legalization cost. Type legalization might involve scalarization/widening/splitting etc. So the number returned is some sort of cost (how many steps involved in legalizing the type?). But SLPVectorizer is for example treating it as NumSrcRegs. Our target has lots of general purpose registers, and vectors could be mapped to a sequence of such registers. So v4i32 is for example 4 consecutive general purpose registers. Still v4i32 is a legal type, so getNumberOfParts("v4i32") is one. The target also has a totally different register pool with vector registers. In that pool there are vector registers, for example making v32i32 legal. So getNumberOfParts("v32i32") is also one. Similarly v64i32 is legal, but it is actually two consecutive vector registers, but no split is needed at legalization so getNumberOfParts("v64i32") is also one.

TTI.getNumberOfParts is correct in general. I'll try to fix the code to support such non-standard situations.

@bjope
Copy link
Collaborator Author

bjope commented May 27, 2024

The problem attempted to be solved here was solved by commit 70a54bc (#93213) instead.

@bjope bjope closed this May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants