[LegalizeTypes] Handle non byte-sized elt types when splitting INSERT/EXTRACT_VECTOR_ELT #93357

bjope · 2024-05-25T00:00:43Z

DAGTypeLegalizer::SplitVecRes_INSERT_VECTOR_ELT and
DAGTypeLegalizer::SplitVecRes_EXTRACT_VECTOR_ELT did not handle
non byte-sized elements properly. In fact, it only dealt with
elements smaller than 8 bits (as well as byte-sized elements).

This patch generalizes the support for non byte-sized element by
always widening the the vector elements to next "round integer type"
(a power of 2 bit size). This should make sure that we can access a
single element via a simple byte-addressed scalar load/store.

Also removing a suspicious CustomLowerNode call from
SplitVecRes_INSERT_VECTOR_ELT. Considering that it did not reset
the Lo/Hi out arguments before the return I think that
DAGTypeLegalizer::SplitVectorResult could be fooled into registering
the input vector as being the result. This should however not have
caused any problems since DAGTypeLegalizer::SplitVectorResult is
doing the same CustomLowerNode call, making the code removed by
this patch redundant.

llvmbot · 2024-05-25T00:01:13Z

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-selectiondag

Author: Björn Pettersson (bjope)

Changes

DAGTypeLegalizer::SplitVecRes_INSERT_VECTOR_ELT and DAGTypeLegalizer::SplitVecRes_EXTRACT_VECTOR_ELT did not handle non byte-sized elements properly. In fact, it only dealt with elements smaller than 8 bits (as well as byte-sized elements).

This patch generalizes the support for non byte-sized element by always extending the vector elements to match the store size for the element type when legalizing via a stack temporary. This should make sure that we can access a single element via a simple byte-addressed scalar load/store.

Full diff: https://github.com/llvm/llvm-project/pull/93357.diff

3 Files Affected:

(modified) llvm/include/llvm/CodeGen/ValueTypes.h (+9)
(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp (+6-10)
(added) llvm/test/CodeGen/X86/legalize-ins-ext-vec-elt.ll (+60)

diff --git a/llvm/include/llvm/CodeGen/ValueTypes.h b/llvm/include/llvm/CodeGen/ValueTypes.h
index dab6c421bf6e6..9ad582c8ab7a9 100644
--- a/llvm/include/llvm/CodeGen/ValueTypes.h
+++ b/llvm/include/llvm/CodeGen/ValueTypes.h
@@ -432,6 +432,15 @@ namespace llvm {
       return EVT::getVectorVT(Context, EltVT, getVectorElementCount());
     }
 
+    /// Return a VT for an integer vector type with the size of the elements
+    /// extended to the store size. The typed returned may be an extended
+    /// type.
+    EVT getStoreSizedIntegerVectorElementType(LLVMContext &Context) const {
+      EVT EltVT = getVectorElementType();
+      EltVT = EVT::getIntegerVT(Context, EltVT.getStoreSizeInBits());
+      return EVT::getVectorVT(Context, EltVT, getVectorElementCount());
+    }
+
     // Return a VT for a vector type with the same element type but
     // half the number of elements. The type returned may be an
     // extended type.
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
index 40e621f0db220..28ecd229bf398 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
@@ -1832,10 +1832,9 @@ void DAGTypeLegalizer::SplitVecRes_INSERT_VECTOR_ELT(SDNode *N, SDValue &Lo,
   // Make the vector elements byte-addressable if they aren't already.
   EVT VecVT = Vec.getValueType();
   EVT EltVT = VecVT.getVectorElementType();
-  if (VecVT.getScalarSizeInBits() < 8) {
-    EltVT = MVT::i8;
-    VecVT = EVT::getVectorVT(*DAG.getContext(), EltVT,
-                             VecVT.getVectorElementCount());
+  if (!EltVT.isByteSized()) {
+    VecVT = VecVT.getStoreSizedIntegerVectorElementType(*DAG.getContext());
+    EltVT = VecVT.getVectorElementType();
     Vec = DAG.getNode(ISD::ANY_EXTEND, dl, VecVT, Vec);
     // Extend the element type to match if needed.
     if (EltVT.bitsGT(Elt.getValueType()))
@@ -3443,10 +3442,9 @@ SDValue DAGTypeLegalizer::SplitVecOp_EXTRACT_VECTOR_ELT(SDNode *N) {
   // Make the vector elements byte-addressable if they aren't already.
   SDLoc dl(N);
   EVT EltVT = VecVT.getVectorElementType();
-  if (VecVT.getScalarSizeInBits() < 8) {
-    EltVT = MVT::i8;
-    VecVT = EVT::getVectorVT(*DAG.getContext(), EltVT,
-                             VecVT.getVectorElementCount());
+  if (!EltVT.isByteSized()) {
+    VecVT = VecVT.getStoreSizedIntegerVectorElementType(*DAG.getContext());
+    EltVT = VecVT.getVectorElementType();
     Vec = DAG.getNode(ISD::ANY_EXTEND, dl, VecVT, Vec);
   }
 
@@ -3465,8 +3463,6 @@ SDValue DAGTypeLegalizer::SplitVecOp_EXTRACT_VECTOR_ELT(SDNode *N) {
   // Load back the required element.
   StackPtr = TLI.getVectorElementPointer(DAG, StackPtr, VecVT, Idx);
 
-  // FIXME: This is to handle i1 vectors with elements promoted to i8.
-  // i1 vector handling needs general improvement.
   if (N->getValueType(0).bitsLT(EltVT)) {
     SDValue Load = DAG.getLoad(EltVT, dl, Store, StackPtr,
       MachinePointerInfo::getUnknownStack(DAG.getMachineFunction()));
diff --git a/llvm/test/CodeGen/X86/legalize-ins-ext-vec-elt.ll b/llvm/test/CodeGen/X86/legalize-ins-ext-vec-elt.ll
new file mode 100644
index 0000000000000..7b517c2ca574f
--- /dev/null
+++ b/llvm/test/CodeGen/X86/legalize-ins-ext-vec-elt.ll
@@ -0,0 +1,60 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=x86_64-- -o - %s| FileCheck %s
+
+; Verify that we support non byte-sized elements, together with variable index.
+
+define void @Legalize_SplitVectorResult_insert_i28(i28 %elt, i16 %idx, ptr %p1, ptr %p2) nounwind {
+; CHECK-LABEL: Legalize_SplitVectorResult_insert_i28:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    # kill: def $esi killed $esi def $rsi
+; CHECK-NEXT:    xorps %xmm0, %xmm0
+; CHECK-NEXT:    movaps %xmm0, -{{[0-9]+}}(%rsp)
+; CHECK-NEXT:    movaps %xmm0, -{{[0-9]+}}(%rsp)
+; CHECK-NEXT:    andl $7, %esi
+; CHECK-NEXT:    movl %edi, -40(%rsp,%rsi,4)
+; CHECK-NEXT:    movaps {{.*#+}} xmm0 = [268435455,268435455,268435455,268435455]
+; CHECK-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm1
+; CHECK-NEXT:    andps %xmm0, %xmm1
+; CHECK-NEXT:    andps -{{[0-9]+}}(%rsp), %xmm0
+; CHECK-NEXT:    movaps %xmm0, 16(%rcx)
+; CHECK-NEXT:    movaps %xmm1, (%rcx)
+; CHECK-NEXT:    retq
+  %vec1 = insertelement <8 x i28> zeroinitializer, i28 %elt, i16 %idx
+  %vec2 = zext <8 x i28> %vec1 to <8 x i32>
+  store <8 x i32> %vec2, ptr %p2
+  ret void
+}
+
+define void @Legalize_SplitVectorResult_extract_i12(i16 %idx, ptr %p1, ptr %p2) nounwind {
+; CHECK-LABEL: Legalize_SplitVectorResult_extract_i12:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    pushq %rax
+; CHECK-NEXT:    # kill: def $edi killed $edi def $rdi
+; CHECK-NEXT:    movaps (%rsi), %xmm0
+; CHECK-NEXT:    movaps 16(%rsi), %xmm1
+; CHECK-NEXT:    movaps 32(%rsi), %xmm2
+; CHECK-NEXT:    movaps 48(%rsi), %xmm3
+; CHECK-NEXT:    movaps 64(%rsi), %xmm4
+; CHECK-NEXT:    movaps 80(%rsi), %xmm5
+; CHECK-NEXT:    movaps 96(%rsi), %xmm6
+; CHECK-NEXT:    movaps 112(%rsi), %xmm7
+; CHECK-NEXT:    movaps %xmm7, -{{[0-9]+}}(%rsp)
+; CHECK-NEXT:    movaps %xmm6, -{{[0-9]+}}(%rsp)
+; CHECK-NEXT:    movaps %xmm5, -{{[0-9]+}}(%rsp)
+; CHECK-NEXT:    movaps %xmm4, -{{[0-9]+}}(%rsp)
+; CHECK-NEXT:    movaps %xmm3, -{{[0-9]+}}(%rsp)
+; CHECK-NEXT:    movaps %xmm2, -{{[0-9]+}}(%rsp)
+; CHECK-NEXT:    movaps %xmm1, -{{[0-9]+}}(%rsp)
+; CHECK-NEXT:    movaps %xmm0, -{{[0-9]+}}(%rsp)
+; CHECK-NEXT:    andl $63, %edi
+; CHECK-NEXT:    movzwl -128(%rsp,%rdi,2), %eax
+; CHECK-NEXT:    andl $4095, %eax # imm = 0xFFF
+; CHECK-NEXT:    movw %ax, (%rdx)
+; CHECK-NEXT:    popq %rax
+; CHECK-NEXT:    retq
+  %vec = load <64 x i16>, ptr %p1
+  %trunc = trunc <64 x i16> %vec to <64 x i12>
+  %elt = extractelement <64 x i12> %trunc, i16 %idx
+  store i12 %elt, ptr %p2
+  ret void
+}

bjope · 2024-05-27T18:00:23Z

llvm/test/CodeGen/X86/legalize-ins-ext-vec-elt.ll

+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=x86_64-- -o - %s| FileCheck %s
+
+; Verify that we support non byte-sized elements, together with variable index.


Both test case added here triggered assertion failure earlier. So can't really pre-commit anything interesting.

efriedma-quic · 2024-05-27T20:38:24Z

Is the nearest byte really what we want here? We can widen to any type we want, and on most targets, storing an <N x i32> is going to be a lot more efficient than storing an <N x i24>.

Instead of expanding immediately, should we just construct an EXTRACT_VECTOR_ELT with the promoted vector type, and legalize that? That might allow custom legalization to kick in in certain cases.

bjope · 2024-05-28T15:22:00Z

Is the nearest byte really what we want here? We can widen to any type we want, and on most targets, storing an <N x i32> is going to be a lot more efficient than storing an <N x i24>.

It is a bit hard to understand what is best for every target (unless perhaps scanning for some legal type, if such a type exists).
Some reasons for not picking a larger element type would be:

If the vector is <N x i24> there is no padding, and one could just memcpy it to the stack if the source vector is in memory. It seems a tiny bit weird to promote <N x i23> to <N x i32> if we do not promote <N x i24> to <N x i32>. But nothing here says that we would promote elements that are byte-sized, so it is possible to promote to next power-of-two.
If for example i128 isn't legal, then we would end up wasting lots of stack etc if for example extending <256 x i65> to <256 x i128>.
Our downstream target can store things like i24, so using getRoundIntegerType would both introduce an unneeded extend, but also wasting stack space. That might ofcourse be an uncommong thing when it comes to our target, and then we should add custom lowering for our target instead of making that the default. One problem with that is that one can't say that <4 x i23> should be custom legalized since that isn't an existing MVT type. So that complicates things a bit if I would pick a power-of-two type blindly without checking for legal types to load/store.
Picking larger types results in touching more memory, which also might be negative (memory bandwidth, dirtying caches, etc).
These legalization has been broken for years as far as I can tell. If performance (and correctness) is an issue I guess most targets are doing custom lowering already. My goal here was just to do something simple to avoid crashes/miscompiles in case this code for example is triggered by IR fuzzy testing etc.

To summarize. How to optimize might be target specific.
Then the question is if we want to add more logic to the common legalization in this case, or just keeping it simple. Just picking a "rounded integer type" instead of "store sized integer type" is still a simple solution. I'm just not sure if it is better in general or not.

Instead of expanding immediately, should we just construct an EXTRACT_VECTOR_ELT with the promoted vector type, and legalize that? That might allow custom legalization to kick in in certain cases.

I don't really see the point in giving the target a second chance for custom legalization. A target could just promote the input vector elements itself if that is feasible when doing CustomLowerNode the first time. Otherwise we need to speculate about what element size that would be feasible for the current target, to match whatever types the custom lowering handles. And it will be hard to pick a type that is best for all targets (unless using a target hook just to please custom lowering...).

arsenm · 2024-05-29T10:51:39Z

Is the nearest byte really what we want here? We can widen to any type we want, and on most targets, storing an <N x i32> is going to be a lot more efficient than storing an <N x i24>.

It is a bit hard to understand what is best for every target (unless perhaps scanning for some legal type, if such a type exists). Some reasons for not picking a larger element type would be:

Using the next legal element type would be a reasonable default. Maybe you want next legal store memory type

Our downstream target can store things like i24, so using getRoundIntegerType would both introduce an unneeded extend, but also wasting stack space.

As in i24 is a legal type, or truncating store i32->i24 is legal?

bjope · 2024-05-29T12:09:37Z

Is the nearest byte really what we want here? We can widen to any type we want, and on most targets, storing an <N x i32> is going to be a lot more efficient than storing an <N x i24>.

It is a bit hard to understand what is best for every target (unless perhaps scanning for some legal type, if such a type exists). Some reasons for not picking a larger element type would be:

Using the next legal element type would be a reasonable default. Maybe you want next legal store memory type

Our downstream target can store things like i24, so using getRoundIntegerType would both introduce an unneeded extend, but also wasting stack space.

As in i24 is a legal type, or truncating store i32->i24 is legal?

As in i24 being a legal type. And by saying that I realize that it would be an uncommon thing for in-tree targets as such an MVT doens't exist in-tree.

So it seems to make more sense in-tree (given by what ValueTypes.td currently looks like) to make use of getRoundIntegerType here. Or alternatively we should do something slightly more complicated by trying to find a legal type for ISD::STORE. But I don't know if there already exist some helper for that or if we would need to invent it.

I'll try to update this PR to use getRoundIntegerType instead of the getStoreSizedIntegerVectorElementType. And then we will deal with it downstream, if we want special handling for types like i24.

(Maybe getRoundIntegerType should be described as rounding up to next simple value type, rather than rounding to nearest power of two. Then we could let it round to other non-power-of-two sizes depending on what ValueTypes.td looks like. But that is ofcourse a totally different story.)

arsenm · 2024-05-29T15:50:07Z

So it seems to make more sense in-tree (given by what ValueTypes.td currently looks like) to make use of getRoundIntegerType here. Or alternatively we should do something slightly more complicated by trying to find a legal type for ISD::STORE. But I don't know if there already exist some helper for that or if we would need to invent it.

I'll try to update this PR to use getRoundIntegerType instead of the getStoreSizedIntegerVectorElementType. And then we will deal with it downstream, if we want special handling for types like i24.

getTypeToTransformTo?

bjope · 2024-06-05T08:59:47Z

So it seems to make more sense in-tree (given by what ValueTypes.td currently looks like) to make use of getRoundIntegerType here. Or alternatively we should do something slightly more complicated by trying to find a legal type for ISD::STORE. But I don't know if there already exist some helper for that or if we would need to invent it.
I'll try to update this PR to use getRoundIntegerType instead of the getStoreSizedIntegerVectorElementType. And then we will deal with it downstream, if we want special handling for types like i24.

getTypeToTransformTo?

That one could perhaps be used in some way. But I don't think it really guarantees to return a type that isn't using a packed layout when used as a vector element type etc. I think it would complicate the code a bit if using that one instead of getRoundIntegerType.

efriedma-quic · 2024-06-05T17:53:02Z

A target could just promote the input vector elements itself if that is feasible when doing CustomLowerNode the first time.

For non-MVT types, I don't think there's any way for a target to request custom lowering. At least, not using setOperationAction.

bjope · 2024-06-05T19:15:29Z

A target could just promote the input vector elements itself if that is feasible when doing CustomLowerNode the first time.

For non-MVT types, I don't think there's any way for a target to request custom lowering. At least, not using setOperationAction.

Right. CustomLowerNode would just bail out early for such types. I did not consider that, so thanks for enlightening me.
I'll take another look at your proposal related to "construct an EXTRACT_VECTOR_ELT with the promoted vector type".

arsenm · 2024-06-06T16:34:32Z

For non-MVT types, I don't think there's any way for a target to request custom lowering. At least, not using setOperationAction.

MVT::Other is supposed to work for this

arsenm · 2024-06-12T19:03:25Z

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

    Vec = DAG.getNode(ISD::ANY_EXTEND, dl, VecVT, Vec);
+    SDValue NewExtract =
+        DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, EltVT, Vec, Idx);
+    return DAG.getZExtOrTrunc(NewExtract, dl, N->getValueType(0));


Why isn't this any ext or trunc?

Nice catch. I've updated it to use getAnyExtOrTrunc now.

efriedma-quic · 2024-06-12T20:03:31Z

Is there some reason we want to create a new EXTRACT_VECTOR_ELT, but not a new INSERT_VECTOR_ELT?

bjope · 2024-06-12T20:36:29Z

Is there some reason we want to create a new EXTRACT_VECTOR_ELT, but not a new INSERT_VECTOR_ELT?

I think that some earlier step should have choosen to widen instead of split the vector if such legalization is preferred for the INSERT_VECTOR_ELT case.
And since SplitVecRes_INSERT_VECTOR_ELT should return a pair of vectors (the lo/hi halfs) it is a bit unclear to me how to do it.. One widened INSERT_VECTOR_ELT with a split+trunc afterwards? And then hope for that the split+trunc are cheap to legalize.

I think optimizing that situation is out-of-scope for this patch anyway. This patch is just supposed to support some scenatios for which we either hit assertion failure of miscompiled in the past. But if suddenly the legalization here is considered as performance critical, then we can follow up with optimizations later. We would need some test cases that show the benefit of doing it some other way, and I don't know about any such test cases.

efriedma-quic

LGTM

…/EXTRACT_VECTOR_ELT (llvm#93357) DAGTypeLegalizer::SplitVecRes_INSERT_VECTOR_ELT and DAGTypeLegalizer::SplitVecRes_EXTRACT_VECTOR_ELT did not handle non byte-sized elements properly. In fact, it only dealt with elements smaller than 8 bits (as well as byte-sized elements). This patch generalizes the support for non byte-sized element by always widening the the vector elements to next "round integer type" (a power of 2 bit size). This should make sure that we can access a single element via a simple byte-addressed scalar load/store. Also removing a suspicious CustomLowerNode call from SplitVecRes_INSERT_VECTOR_ELT. Considering that it did not reset the Lo/Hi out arguments before the return I think that DAGTypeLegalizer::SplitVectorResult could be fooled into registering the input vector as being the result. This should however not have caused any problems since DAGTypeLegalizer::SplitVectorResult is doing the same CustomLowerNode call, making the code removed by this patch redundant.

llvmbot added backend:X86 llvm:SelectionDAG SelectionDAGISel as well labels May 25, 2024

bjope commented May 27, 2024

View reviewed changes

bjope requested review from topperc, arsenm and efriedma-quic May 27, 2024 18:01

llvmbot added backend:AArch64 backend:AMDGPU labels Jun 12, 2024

arsenm reviewed Jun 12, 2024

View reviewed changes

efriedma-quic approved these changes Jun 12, 2024

View reviewed changes

arsenm approved these changes Jun 12, 2024

View reviewed changes

bjope force-pushed the splitinsert branch from 0bad494 to 3732811 Compare June 13, 2024 07:28

bjope merged commit 445973c into llvm:main Jun 13, 2024
5 of 6 checks passed

davemgreen mentioned this pull request Aug 29, 2024

[llvm] Converting bits to bytes lost precision assertion in TargetLowering::getVectorSubVecPointer() #106345

Closed

[LegalizeTypes] Handle non byte-sized elt types when splitting INSERT/EXTRACT_VECTOR_ELT #93357

[LegalizeTypes] Handle non byte-sized elt types when splitting INSERT/EXTRACT_VECTOR_ELT #93357

Conversation

bjope commented May 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented May 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjope May 27, 2024

Choose a reason for hiding this comment

Uh oh!

efriedma-quic commented May 27, 2024

Uh oh!

bjope commented May 28, 2024

Uh oh!

arsenm commented May 29, 2024

Uh oh!

bjope commented May 29, 2024

Uh oh!

arsenm commented May 29, 2024

Uh oh!

bjope commented Jun 5, 2024

Uh oh!

efriedma-quic commented Jun 5, 2024

Uh oh!

bjope commented Jun 5, 2024

Uh oh!

arsenm commented Jun 6, 2024

Uh oh!

arsenm Jun 12, 2024

Choose a reason for hiding this comment

Uh oh!

bjope Jun 12, 2024

Choose a reason for hiding this comment

Uh oh!

efriedma-quic commented Jun 12, 2024

Uh oh!

bjope commented Jun 12, 2024

Uh oh!

efriedma-quic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bjope commented May 25, 2024 •

edited

Loading

llvmbot commented May 25, 2024 •

edited

Loading