Skip to content

[AArch64] Add @llvm.experimental.vector.match #101974

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Nov 14, 2024

Conversation

rj-jesus
Copy link
Contributor

@rj-jesus rj-jesus commented Aug 5, 2024

This patch introduces an experimental intrinsic for matching the elements of one vector against the elements of another.

For AArch64 targets that support SVE2, the intrinsic lowers to a MATCH instruction for supported fixed and scalar vector types. The lowering can be improved a bit for fixed-length vectors but this can be addressed later.

@llvmbot llvmbot added backend:AArch64 llvm:SelectionDAG SelectionDAGISel as well llvm:ir llvm:analysis Includes value tracking, cost tables and constant folding labels Aug 5, 2024
@llvmbot
Copy link
Member

llvmbot commented Aug 5, 2024

@llvm/pr-subscribers-llvm-selectiondag
@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-aarch64

Author: Ricardo Jesus (rj-jesus)

Changes

This patch introduces an experimental intrinsic for matching the elements of one vector against the elements of another.

For AArch64 targets that support SVE2, the intrinsic lowers to a MATCH instruction for supported fixed and scalar vector types. The lowering can be improved a bit for fixed-length vectors but this can be addressed later.


Full diff: https://github.com/llvm/llvm-project/pull/101974.diff

10 Files Affected:

  • (modified) llvm/docs/LangRef.rst (+45)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+9)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
  • (modified) llvm/include/llvm/IR/Intrinsics.td (+10)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+5)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+9)
  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+46)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+12)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+2)
  • (added) llvm/test/CodeGen/AArch64/intrinsic-vector-match-sve2.ll (+57)
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index b17e3c828ed3d..dd9851d1af078 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -19637,6 +19637,51 @@ are undefined.
     }
 
 
+'``llvm.experimental.vector.match.*``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+This is an overloaded intrinsic. Support for specific vector types is target
+dependent.
+
+::
+
+    declare <<n> x i1> @llvm.experimental.vector.match(<<n> x <ty>> %op1, <<n> x <ty>> %op2, <<n> x i1> %mask, i32 <segsize>)
+    declare <vscale x <n> x i1> @llvm.experimental.vector.match(<vscale x <n> x <ty>> %op1, <vscale x <n> x <ty>> %op2, <vscale x <n> x i1> %mask, i32 <segsize>)
+
+Overview:
+"""""""""
+
+Find elements of the first argument matching any elements of the second.
+
+Arguments:
+""""""""""
+
+The first argument is the search vector, the second argument is the vector of
+elements we are searching for (i.e. for which we consider a match successful),
+and the third argument is a mask that controls which elements of the first
+argument are active. The fourth argument is an immediate that sets the segment
+size for the search window.
+
+Semantics:
+""""""""""
+
+The '``llvm.experimental.vector.match``' intrinsic compares each element in the
+first argument against potentially several elements of the second, placing
+``1`` in the corresponding element of the output vector if any comparison is
+successful, and ``0`` otherwise. Inactive elements in the mask are set to ``0``
+in the output. The segment size controls the number of elements of the second
+argument that are compared against.
+
+For example, for vectors with 16 elements, if ``segsize = 16`` then each
+element of the first argument is compared against all 16 elements of the second
+argument; but if ``segsize = 4``, then each of the first four elements of the
+first argument is compared against the first four elements of the second
+argument, each of the second four elements of the first argument is compared
+against the second four elements of the second argument, and so forth.
+
 Matrix Intrinsics
 -----------------
 
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 38e8b9da21397..786c13a177ccf 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1746,6 +1746,10 @@ class TargetTransformInfo {
   bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
                              Align Alignment) const;
 
+  /// \returns Returns true if the target supports vector match operations for
+  /// the vector type `VT` using a segment size of `SegSize`.
+  bool hasVectorMatch(VectorType *VT, unsigned SegSize) const;
+
   struct VPLegalization {
     enum VPTransform {
       // keep the predicating parameter
@@ -2184,6 +2188,7 @@ class TargetTransformInfo::Concept {
   virtual bool supportsScalableVectors() const = 0;
   virtual bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
                                      Align Alignment) const = 0;
+  virtual bool hasVectorMatch(VectorType *VT, unsigned SegSize) const = 0;
   virtual VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const = 0;
   virtual bool hasArmWideBranch(bool Thumb) const = 0;
@@ -2952,6 +2957,10 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
     return Impl.hasActiveVectorLength(Opcode, DataType, Alignment);
   }
 
+  bool hasVectorMatch(VectorType *VT, unsigned SegSize) const override {
+    return Impl.hasVectorMatch(VT, SegSize);
+  }
+
   VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const override {
     return Impl.getVPLegalizationStrategy(PI);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index d208a710bb27f..36621861ab8c8 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -958,6 +958,8 @@ class TargetTransformInfoImplBase {
     return false;
   }
 
+  bool hasVectorMatch(VectorType *VT, unsigned SegSize) const { return false; }
+
   TargetTransformInfo::VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const {
     return TargetTransformInfo::VPLegalization(
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index b4e758136b39f..f6d77aa596f60 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -1892,6 +1892,16 @@ def int_experimental_vector_histogram_add : DefaultAttrsIntrinsic<[],
                                LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>], // Mask
                              [ IntrArgMemOnly ]>;
 
+// Experimental match
+def int_experimental_vector_match : DefaultAttrsIntrinsic<
+                             [ LLVMScalarOrSameVectorWidth<0, llvm_i1_ty> ],
+                             [ llvm_anyvector_ty,
+                               LLVMMatchType<0>,
+                               LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>,  // Mask
+                               llvm_i32_ty ],  // Segment size
+                             [ IntrNoMem, IntrNoSync, IntrWillReturn,
+                               ImmArg<ArgIndex<3>> ]>;
+
 // Operators
 let IntrProperties = [IntrNoMem, IntrNoSync, IntrWillReturn] in {
   // Integer arithmetic
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index dcde78925bfa9..d8314af0537fe 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1352,6 +1352,11 @@ bool TargetTransformInfo::hasActiveVectorLength(unsigned Opcode, Type *DataType,
   return TTIImpl->hasActiveVectorLength(Opcode, DataType, Alignment);
 }
 
+bool TargetTransformInfo::hasVectorMatch(VectorType *VT,
+                                         unsigned SegSize) const {
+  return TTIImpl->hasVectorMatch(VT, SegSize);
+}
+
 TargetTransformInfo::Concept::~Concept() = default;
 
 TargetIRAnalysis::TargetIRAnalysis() : TTICallback(&getDefaultTTI) {}
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 9d617c7acd13c..9cb7d65975b9f 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8096,6 +8096,15 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
              DAG.getNode(ISD::EXTRACT_SUBVECTOR, sdl, ResultVT, Vec, Index));
     return;
   }
+  case Intrinsic::experimental_vector_match: {
+    auto *VT = dyn_cast<VectorType>(I.getOperand(0)->getType());
+    auto SegmentSize = cast<ConstantInt>(I.getOperand(3))->getLimitedValue();
+    const auto &TTI =
+        TLI.getTargetMachine().getTargetTransformInfo(*I.getFunction());
+    assert(VT && TTI.hasVectorMatch(VT, SegmentSize) && "Unsupported type!");
+    visitTargetIntrinsic(I, Intrinsic);
+    return;
+  }
   case Intrinsic::vector_reverse:
     visitVectorReverse(I);
     return;
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7704321a0fc3a..050807142fc0a 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -6106,6 +6106,51 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
         DAG.getNode(AArch64ISD::CTTZ_ELTS, dl, MVT::i64, CttzOp);
     return DAG.getZExtOrTrunc(NewCttzElts, dl, Op.getValueType());
   }
+  case Intrinsic::experimental_vector_match: {
+    SDValue ID =
+        DAG.getTargetConstant(Intrinsic::aarch64_sve_match, dl, MVT::i64);
+
+    auto Op1 = Op.getOperand(1);
+    auto Op2 = Op.getOperand(2);
+    auto Mask = Op.getOperand(3);
+    auto SegmentSize =
+        cast<ConstantSDNode>(Op.getOperand(4))->getLimitedValue();
+
+    EVT VT = Op.getValueType();
+    auto MinNumElts = VT.getVectorMinNumElements();
+
+    assert(Op1.getValueType() == Op2.getValueType() && "Type mismatch.");
+    assert(Op1.getValueSizeInBits().getKnownMinValue() == 128 &&
+           "Custom lower only works on 128-bit segments.");
+    assert((Op1.getValueType().getVectorElementType() == MVT::i8  ||
+            Op1.getValueType().getVectorElementType() == MVT::i16) &&
+           "Custom lower only supports 8-bit or 16-bit characters.");
+    assert(SegmentSize == MinNumElts && "Custom lower needs segment size to "
+                                        "match minimum number of elements.");
+
+    if (VT.isScalableVector())
+      return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, VT, ID, Mask, Op1, Op2);
+
+    // We can use the SVE2 match instruction to lower this intrinsic by
+    // converting the operands to scalable vectors, doing a match, and then
+    // extracting a fixed-width subvector from the scalable vector.
+
+    EVT OpVT = Op1.getValueType();
+    EVT OpContainerVT = getContainerForFixedLengthVector(DAG, OpVT);
+    EVT MatchVT = OpContainerVT.changeElementType(MVT::i1);
+
+    auto ScalableOp1 = convertToScalableVector(DAG, OpContainerVT, Op1);
+    auto ScalableOp2 = convertToScalableVector(DAG, OpContainerVT, Op2);
+    auto ScalableMask = DAG.getNode(ISD::SIGN_EXTEND, dl, OpVT, Mask);
+    ScalableMask = convertFixedMaskToScalableVector(ScalableMask, DAG);
+
+    SDValue Match = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, MatchVT, ID,
+                                ScalableMask, ScalableOp1, ScalableOp2);
+
+    return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, OpVT,
+                       DAG.getNode(ISD::SIGN_EXTEND, dl, OpContainerVT, Match),
+                       DAG.getVectorIdxConstant(0, dl));
+  }
   }
 }
 
@@ -26544,6 +26589,7 @@ void AArch64TargetLowering::ReplaceNodeResults(
       Results.push_back(DAG.getNode(ISD::TRUNCATE, DL, VT, V));
       return;
     }
+    case Intrinsic::experimental_vector_match:
     case Intrinsic::get_active_lane_mask: {
       if (!VT.isFixedLengthVector() || VT.getVectorElementType() != MVT::i1)
         return;
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index b8f19fa87e2ab..806dc856c5862 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -3835,6 +3835,18 @@ bool AArch64TTIImpl::isLegalToVectorizeReduction(
   }
 }
 
+bool AArch64TTIImpl::hasVectorMatch(VectorType *VT, unsigned SegSize) const {
+  // Check that the target has SVE2 (and SVE is available), that `VT' is a
+  // legal type for MATCH, and that the segment size is 128-bit.
+  if (ST->hasSVE2() && ST->isSVEAvailable() &&
+      VT->getPrimitiveSizeInBits().getKnownMinValue() == 128 &&
+      VT->getElementCount().getKnownMinValue() == SegSize &&
+      (VT->getElementCount().getKnownMinValue() ==  8 ||
+       VT->getElementCount().getKnownMinValue() == 16))
+    return true;
+  return false;
+}
+
 InstructionCost
 AArch64TTIImpl::getMinMaxReductionCost(Intrinsic::ID IID, VectorType *Ty,
                                        FastMathFlags FMF,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index a9189fd53f40b..6ad21a9e0a77a 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -391,6 +391,8 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
     return ST->hasSVE();
   }
 
+  bool hasVectorMatch(VectorType *VT, unsigned SegSize) const;
+
   InstructionCost getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
                                              std::optional<FastMathFlags> FMF,
                                              TTI::TargetCostKind CostKind);
diff --git a/llvm/test/CodeGen/AArch64/intrinsic-vector-match-sve2.ll b/llvm/test/CodeGen/AArch64/intrinsic-vector-match-sve2.ll
new file mode 100644
index 0000000000000..0df92dfa80000
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/intrinsic-vector-match-sve2.ll
@@ -0,0 +1,57 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
+; RUN: llc -mtriple=aarch64 < %s -o - | FileCheck %s
+
+define <vscale x 16 x i1> @match_nxv16i8(<vscale x 16 x i8> %op1, <vscale x 16 x i8> %op2, <vscale x 16 x i1> %mask) #0 {
+; CHECK-LABEL: match_nxv16i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    match p0.b, p0/z, z0.b, z1.b
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 16 x i1> @llvm.experimental.vector.match(<vscale x 16 x i8> %op1, <vscale x 16 x i8> %op2, <vscale x 16 x i1> %mask, i32 16)
+  ret <vscale x 16 x i1> %r
+}
+
+define <vscale x 8 x i1> @match_nxv8i16(<vscale x 8 x i16> %op1, <vscale x 8 x i16> %op2, <vscale x 8 x i1> %mask) #0 {
+; CHECK-LABEL: match_nxv8i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    match p0.h, p0/z, z0.h, z1.h
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 8 x i1> @llvm.experimental.vector.match(<vscale x 8 x i16> %op1, <vscale x 8 x i16> %op2, <vscale x 8 x i1> %mask, i32 8)
+  ret <vscale x 8 x i1> %r
+}
+
+define <16 x i1> @match_v16i8(<16 x i8> %op1, <16 x i8> %op2, <16 x i1> %mask) #0 {
+; CHECK-LABEL: match_v16i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    shl v2.16b, v2.16b, #7
+; CHECK-NEXT:    ptrue p0.b, vl16
+; CHECK-NEXT:    // kill: def $q1 killed $q1 def $z1
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    cmlt v2.16b, v2.16b, #0
+; CHECK-NEXT:    cmpne p0.b, p0/z, z2.b, #0
+; CHECK-NEXT:    match p0.b, p0/z, z0.b, z1.b
+; CHECK-NEXT:    mov z0.b, p0/z, #-1 // =0xffffffffffffffff
+; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
+; CHECK-NEXT:    ret
+  %r = tail call <16 x i1> @llvm.experimental.vector.match(<16 x i8> %op1, <16 x i8> %op2, <16 x i1> %mask, i32 16)
+  ret <16 x i1> %r
+}
+
+define <8 x i1> @match_v8i16(<8 x i16> %op1, <8 x i16> %op2, <8 x i1> %mask) #0 {
+; CHECK-LABEL: match_v8i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ushll v2.8h, v2.8b, #0
+; CHECK-NEXT:    ptrue p0.h, vl8
+; CHECK-NEXT:    // kill: def $q1 killed $q1 def $z1
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    shl v2.8h, v2.8h, #15
+; CHECK-NEXT:    cmlt v2.8h, v2.8h, #0
+; CHECK-NEXT:    cmpne p0.h, p0/z, z2.h, #0
+; CHECK-NEXT:    match p0.h, p0/z, z0.h, z1.h
+; CHECK-NEXT:    mov z0.h, p0/z, #-1 // =0xffffffffffffffff
+; CHECK-NEXT:    xtn v0.8b, v0.8h
+; CHECK-NEXT:    ret
+  %r = tail call <8 x i1> @llvm.experimental.vector.match(<8 x i16> %op1, <8 x i16> %op2, <8 x i1> %mask, i32 8)
+  ret <8 x i1> %r
+}
+
+attributes #0 = { "target-features"="+sve2" }

Copy link

github-actions bot commented Aug 5, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I've had a quick look and have a few thoughts so far.

@rj-jesus
Copy link
Contributor Author

Thanks for this! I've had a quick look and have a few thoughts so far.

Thank you, please let me know if you have any other comments!

@rj-jesus
Copy link
Contributor Author

Are there any updates on this?

@david-arm If you think that's better we can start by implementing the intrinsic only for fixed vector types and extend it to scalable vectors later. This should still work for #101976, though it may add a bit of extra work. What do you think?

@david-arm
Copy link
Contributor

Are there any updates on this?

@david-arm If you think that's better we can start by implementing the intrinsic only for fixed vector types and extend it to scalable vectors later. This should still work for #101976, though it may add a bit of extra work. What do you think?

I'm so sorry I haven't had much time to look at this. I was away for 3+ weeks in August and since then have been struggling to find time due to other commitments. I promise to take a look at this in the next few days and get back to you with more useful comments!

@rj-jesus
Copy link
Contributor Author

I'm so sorry I haven't had much time to look at this. I was away for 3+ weeks in August and since then have been struggling to find time due to other commitments. I promise to take a look at this in the next few days and get back to you with more useful comments!

No worries at all, thank you very much for the update!

Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest concern about this patch is the intrinsic looks to me too SVE-specific to be a general intrinsic. For the actual problem you want to solve using SVE's match instruction, what's the bare minimum you need to do? For example, do you really need multiple search segments? Consider a loop like this:

for (unsigned i = 0; i < n; i++) {
  Found[i] = false;
  for (unsigned j = 0; j < 4; j++) {
    if (vec[i] == search[j]) {
      Found[i] = true;
      break;
    }
  }
}

In the example above every element of vec is always being compared against the same 4 elements of search. So effectively this is only one segment to search. That would mean the generic intrinsic only needs to talk about searching the first four elements of the second vector, which significantly reduces the complexity and makes it easier to provide generic lowering and match to other targets.

@rj-jesus
Copy link
Contributor Author

Hi @david-arm, thank you for your feedback, any help to make the intrinsic more general and to simplify it is very welcome! Your example is very close to what I had in mind: in general the bounds of the search array don't have to be known at compile time, but other than that, that's the idea.

For such cases, indeed we don't need multiple search segments. They are currently there to facilitate the efficient lowering to MATCH, but if we wanted to remove them we could:

  1. Only implement the intrinsic for fixed-length vectors.
  2. Restrict the second argument (the search array) to fixed-length, but let the other vector arguments be fixed or scalable.

The comparisons would be performed in an "all-to-all" fashion in either case.

The reason I didn't implement this to begin with is that this can make lowering slightly less efficient. For example, for the second option, we'd need to do a broadcast before doing a MATCH, but we can work around this separately if needed.

Is either of these options what you had in mind?

@david-arm
Copy link
Contributor

if we wanted to remove them we could:

  1. Only implement the intrinsic for fixed-length vectors.
  2. Restrict the second argument (the search array) to fixed-length, but let the other vector arguments be fixed or scalable.

Yeah, I think the second option sounds good to me. If you do it this way you can remove the segsize argument too, because the size of the segment is defined by the number of elements in the second vector argument. I agree that puts more pressure on the backend to lower this efficiently, but even with the previous version of the intrinsic we still have to do work to splat the segments across the whole vector at the IR level. I'd hope that the MachineLICM pass would hoist out any loop invariants!

I do think it's more friendly to other targets and makes it easier to implement general lowering code for all targets too, i.e. something like

%search.lane.0 = extractelement <4 x i32> %search.vec, i32 0
%search.dup.0 = ... broadcast %search.lane.0 ...
%res.0 = icmp eq <vscale x 4 x i32> %vec, %search.dup.0
...
%res.1 = icmp eq <vscale x 4 x i32> %vec, %search.dup.1
...
%res.pt1 = or <vscale x 4 x i1> %res.0, %res.1
%res.pt2 = or <vscale x 4 x i1> %res.1, %res.2
%res = or <vscale x 4 x i1> %res.pt1, %res.pt2

For SVE I suppose you'd have to be clever and use a combination of 64-bit DUPs and ZIPs or something like that to generate the repeating 128-bit pattern!

@rj-jesus
Copy link
Contributor Author

Yeah, I think the second option sounds good to me. If you do it this way you can remove the segsize argument too, because the size of the segment is defined by the number of elements in the second vector argument.

Yep, I think this is a good compromise. Thanks for the feedback, I'll implement this approach!

@rj-jesus rj-jesus force-pushed the rjj/aarch64-experimental-vector-match branch from 13bb990 to 9df5ff0 Compare October 7, 2024 14:12
@rj-jesus
Copy link
Contributor Author

rj-jesus commented Oct 7, 2024

Hi @david-arm, in the latest revision of the patch I've provided general lowering for the intrinsic and removed the segsize parameter. Would you mind having a look when you have the chance, please?

Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for making these changes - it looks a lot nicer now! I haven't reviewed the whole patch yet, but I've left comments I have so far.

Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I've finished reviewing now, and added a few more comments!

This patch introduces an experimental intrinsic for matching the
elements of one vector against the elements of another.

For AArch64 targets that support SVE2, it lowers to a MATCH instruction
for supported fixed and scalar types. Otherwise, the intrinsic has
generic lowering in SelectionDAGBuilder.
Add address other review comments.
@rj-jesus rj-jesus force-pushed the rjj/aarch64-experimental-vector-match branch from 9df5ff0 to 3f9398d Compare October 23, 2024 08:36
Copy link
Contributor Author

@rj-jesus rj-jesus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for the feedback @david-arm!
I've addressed most of your comments, except for the TTI vs TLI hook and cost model. As for the cost model, I thought the decision to use the intrinsic would effectively be binary - unless the hardware supports MATCH (or similar), I doubt we'd want to use the generic lowering. I'm happy to add this though if you think it would be useful. What do you think?

Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests you've added look good - thanks! I just had a few more minor comments ...

@rj-jesus
Copy link
Contributor Author

Thanks very much for the feedback, @david-arm. I've updated the patch with your suggestions - please let me know if you have any other comments!

@paulwalker-arm
Copy link
Collaborator

Sorry I'm late to the conversation but please can you elaborate on the need for a dedicated intrinsic? The default expansion for the ISD node shows LLVM IR can already represent the logic using a handful of existing operations. I concede "handful" starts to breakdown as the needle vector grows but are you anticipating such larger needle vectors that would make the IR matching hard?

A better path forward is to add the LLVM transformation you're interested in first using stock LLVM IR and let the existing optimisers at it, which will provide the necessary context for any next steps. For example, the proposed intrinsic requires a fixed length needle vector and my immediate thought was if making the needle a scalar would be sufficient to simplify the IR so that it can be matched to the desired instruction (assuming the IR from step 1 isn't already simple enough to match).

My worry here is that it looks like we're starting with the desired instruction and working backwards in a way that, to my eye, still looks target specific. So whilst I'm not against having a dedicated intrinsic I'd like to ensure it is absolutely necessary and is as relaxed/simple as possible.

@rj-jesus
Copy link
Contributor Author

Hi @paulwalker-arm, I think we shouldn't necessarily assume the needle vectors will be small. I was planning to use the intrinsic to help with vectorising std::find_first_of. Since MATCH supports up to 16 simultaneous comparisons per segment, I can see us wanting to support needle vectors with up to 16 elements. For these cases, with stock LLVM IR, we'd need around 16 splats, 16 CMEQs, and 15 ORRs. While it's feasible to match this later and lower it to a MATCH, I think it might be a bit cumbersome to do so (more so if we consider the multiple ways in which the ORR reduction chains may be implemented). I'm not opposed to this, but I think it might not be that straightforward. What do you think? Am I missing anything?

@paulwalker-arm
Copy link
Collaborator

I think we're on the same page and you may be correct in concluding the stock IR route will be cumbersome, but I think we should get to the point of knowing it to be true before picking a solution because otherwise we cannot be sure it's the correct solution.

The splats are trivial to match (especially if they're constant). Likewise, an OR tree with matching roots doesn't seem that hard. It's the predication that most worries me because that will likely mean instruction ordering matters, which can make matching awkward.

@rj-jesus
Copy link
Contributor Author

I think we're on the same page and you may be correct in concluding the stock IR route will be cumbersome, but I think we should get to the point of knowing it to be true before picking a solution because otherwise we cannot be sure it's the correct solution.

Sure, but just to be clear, what would you like to see? Would the IR we'd be generating for something like this (https://godbolt.org/z/1bhvonGvj):

char* find_first_byte(char *first, char *last, char *s_first, char *s_last)
{
  for (; first != last; ++first)
    for (char *it = s_first; it != s_last; ++it)
      if (*first == *it)
        return first;
  return last;
}

with the intrinsic and stock IR help?

Also, if you think this would help make the intrinsic slightly more general, one thing we could do is remove the mask parameter. The mask only affects the first operand (the "haystack"), so we can proxy its functionality with a AND after the intrinsic call (as we currently do at the end of the generic lowering). I'm not sure this would have a big impact overall though.

Please let me know what you think. I'm happy to get the IR if you think that will help move this forward.

@paulwalker-arm
Copy link
Collaborator

Sure, but just to be clear, what would you like to see?

I'm assuming the new intrinsic requires a matching transformation to generate it so preferable I'd like the transformation to already exist in LLVM. Given that's only worth doing if there's a performance benefit, I ask "Is enabling vectorisation of the find loop enough to see improved performance? or does that only happen when an operation equivalent to SVE's match instruction is available?

If only the latter is true then yes seeing representative IR will be very useful. However, if the transformation is generically useful then why not implement that first?

@rj-jesus
Copy link
Contributor Author

If only the latter is true then yes seeing representative IR will be very useful. However, if the transformation is generically useful then why not implement that first?

Sorry, maybe I wasn't very clear, but the transformation is in #101976. It does need rebasing to the latest version of the intrinsic, but I was letting the intrinsic settle to hopefully avoid too much back and forth). But I'll get back to it soon.

I ask "Is enabling vectorisation of the find loop enough to see improved performance? or does that only happen when an operation equivalent to SVE's match instruction is available?

I haven't benchmarked with stock IR, but yes, I expect the latter. With SVE2 MATCH we can get very neat speedups (200x+) in some cases, simply because you can do up to 16x16 byte compares in a single instruction with relatively low latency (at least on Neoverse V2).

I can get the stock IR though if you'd still like to have a look at it?

@paulwalker-arm
Copy link
Collaborator

I can get the stock IR though if you'd still like to have a look at it?

Yes please.

200x+/16 is still more than one, which suggests it could be generically useful? Are you able to name/provide the benchmark(s) used?

@rj-jesus
Copy link
Contributor Author

rj-jesus commented Oct 30, 2024

Sounds good, I'll get the stock IR.

200x+/16 is still more than one, which suggests it could be generically useful? Are you able to name/provide the benchmark(s) used?

Yep, that's true, but in that case we can still use the intrinsic and fallback to generic lowering, right? I'll check on the benchmarks.

Copy link
Collaborator

@paulwalker-arm paulwalker-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help things along I've reviewed the PR based on the assumption this intrinsic will be the way to go. In general it looks pretty nice but I do think you'll want to tighten the intrinsic definition a little more.


// Wrap the operands.
Op1 = convertToScalableVector(DAG, OpContainerVT, Op1);
Mask = DAG.getNode(ISD::ANY_EXTEND, dl, Op1VT, Mask);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot any_extend a mask because convertFixedMaskToScalableVector() expects a boolean vector result (see getBooleanContents()). The any_extend makes the top bits undefined and thus the compare performed by convertFixedMaskToScalableVector() also undefined. For AArch64 getBooleanContents() is either all zero or all ones, which ISD::SIGN_EXTEND will maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thank you for catching this. I was trying to avoid the ugly codegen here, which is also why I thought maybe we could remove the mask parameter altogether and avoid problems like these.

I think we use an any_extend here, or is this a different case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, yep that code is broken as well. I'll see about getting it fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for double-checking and fixing it!

@rj-jesus
Copy link
Contributor Author

rj-jesus commented Oct 30, 2024

Thanks very much for the review, I'll go through it tomorrow!

In the meantime, I've also got the IR for the C snippet I sent through earlier (based on this test case):

define ptr @first_byte_of(ptr %0, ptr %1, ptr %2, ptr %3) {
  %5 = icmp eq ptr %0, %1
  %6 = icmp eq ptr %2, %3
  %7 = or i1 %5, %6
  br i1 %7, label %.loopexit1, label %.preheader

.preheader:                                       ; preds = %4, %77
  %pa = phi ptr [ %8, %77 ], [ %0, %4 ]
  %8 = getelementptr i8, ptr %pa, i64 16
  %9 = icmp ult ptr %8, %1
  %10 = select i1 %9, ptr %8, ptr %1
  %11 = ptrtoint ptr %pa to i64
  %12 = ptrtoint ptr %10 to i64
  %13 = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %11, i64 %12)
  %14 = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr %pa, i32 1, <vscale x 16 x i1> %13, <vscale x 16 x i8> zeroinitializer)
  br label %15

15:                                               ; preds = %76, %.preheader
  %pb = phi ptr [ %2, %.preheader ], [ %16, %76 ]
  %16 = getelementptr i8, ptr %pb, i64 16
  %17 = icmp ult ptr %16, %3
  %18 = select i1 %17, ptr %16, ptr %3
  %19 = ptrtoint ptr %pb to i64
  %20 = ptrtoint ptr %18 to i64
  %21 = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %19, i64 %20)
  %22 = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr %pb, i32 1, <vscale x 16 x i1> %21, <vscale x 16 x i8> zeroinitializer)
  %23 = extractelement <vscale x 16 x i8> %22, i64 0
  %.splatinsert = insertelement <vscale x 16 x i8> poison, i8 %23, i64 0
  %.splat = shufflevector <vscale x 16 x i8> %.splatinsert, <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer
  %24 = select <vscale x 16 x i1> %21, <vscale x 16 x i8> %22, <vscale x 16 x i8> %.splat

  ; <<<<< Stock IR

  %25 = extractelement <vscale x 16 x i8> %24, i64 0
  %.splatinsert3 = insertelement <vscale x 16 x i8> poison, i8 %25, i64 0
  %.splat4 = shufflevector <vscale x 16 x i8> %.splatinsert3, <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer
  %26 = icmp eq <vscale x 16 x i8> %14, %.splat4

  %27 = extractelement <vscale x 16 x i8> %24, i64 1
  %.splatinsert5 = insertelement <vscale x 16 x i8> poison, i8 %27, i64 0
  %.splat6 = shufflevector <vscale x 16 x i8> %.splatinsert5, <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer
  %28 = icmp eq <vscale x 16 x i8> %14, %.splat6
  %29 = or <vscale x 16 x i1> %26, %28

  ; the above 13 more times

  %69 = extractelement <vscale x 16 x i8> %24, i64 15
  %.splatinsert33 = insertelement <vscale x 16 x i8> poison, i8 %69, i64 0
  %.splat34 = shufflevector <vscale x 16 x i8> %.splatinsert33, <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer
  %70 = icmp eq <vscale x 16 x i8> %14, %.splat34
  %71 = or <vscale x 16 x i1> %68, %70

  %72 = and <vscale x 16 x i1> %71, %13

  ; ===== Or, with @llvm.experimental.vector.match

  %25 = tail call <16 x i8> @llvm.vector.extract(<vscale x 16 x i8> %24, i64 0)
  %72 = tail call <vscale x 16 x i1> @llvm.experimental.vector.match.nxv16i8(<vscale x 16 x i8> %14, <16 x i8> %25, <vscale x 16 x i1> %13)

  ; >>>>>

  %73 = tail call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> %72)
  br i1 %73, label %.loopexit, label %76

.loopexit:                                        ; preds = %15
  %74 = tail call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> %72, i1 true)
  %75 = getelementptr i8, ptr %pa, i64 %74
  br label %.loopexit1

76:                                               ; preds = %15
  br i1 %17, label %15, label %77

77:                                               ; preds = %76
  br i1 %9, label %.preheader, label %.loopexit1

.loopexit1:                                       ; preds = %77, %.loopexit, %4
  %78 = phi ptr [ %1, %4 ], [ %75, %.loopexit ], [ %1, %77 ]
  ret ptr %78
}

The matching part of the stock IR would lower to:

	mov	z2.b, b1
	cmpeq	p2.b, p0/z, z0.b, z2.b
	mov	z2.b, z1.b[1]
	cmpeq	p3.b, p0/z, z0.b, z2.b
	mov	z2.b, z1.b[2]
	sel	p2.b, p2, p2.b, p3.b
	cmpeq	p3.b, p0/z, z0.b, z2.b
	mov	z2.b, z1.b[3]
	sel	p2.b, p2, p2.b, p3.b
        ; ... about 48 instructions in total

Meanwhile, for the intrinsic version, we'd get the single MATCH instruction.

I can commit these two versions to #101976 if you'd like to give them a go yourself? Just let me know if that would be useful, otherwise I'll do it in one go after going through your feedback.

Thanks again!

@rj-jesus
Copy link
Contributor Author

Hi @paulwalker-arm, I believe I've addressed your comments from yesterday and left some of my own---thanks very much again for the feedback! Please let me know if you have any other questions.

@paulwalker-arm
Copy link
Collaborator

Thanks @rj-jesus. Just wanted to let you know that it'll likely be Tuesday next week before I'll get chance to revisit this PR.

@rj-jesus
Copy link
Contributor Author

rj-jesus commented Nov 1, 2024

Thanks @paulwalker-arm, no worries at all! If you have any other questions in the meantime just let me know.

@paulwalker-arm
Copy link
Collaborator

Thanks for the IR @rj-jesus and thanks for your patience. From a pure "can we reliably match this IR" point of view I'd still answer yes. However, I see your use-case requires loading a needle vector, which adds additional complexity to track where the needles are coming from in order to reuse the original vector. Given this niggle and the significant performance loss that would result from failing to match the sequence, I'm happy to agree there's enough value in having a dedicated intrinsic. Given the potential size of the IR sequence replaced I guess there may also be compile time benefits and it'll certainly be easier to cost model.

Not related to this, but looking at the IR I notice you're effectively doing a fixed length load into a scalable vector for the needle. I just wondered if you have considered having the main loop load directly into a fixed length vector and then have a tail loop that requires a masked load. Doing this would allow you to emit ld1rq instructions where the result can be used directly by the match instruction.

@rj-jesus
Copy link
Contributor Author

rj-jesus commented Nov 7, 2024

Hi @paulwalker-arm, No worries at all, thank you very much for having a look! Once #115189 lands I'll rebase this patch and simplify LowerVectorMatch accordingly. In the meantime, if there's anything you'd like to see done here, please let me know.

You're right, I believe that when I tried with separate main and tail loops, I wasn't fully convinced that the performance difference was worth the extra code size and complexity. but I'll have a look again since I can't really recall the numbers.

Copy link
Collaborator

@paulwalker-arm paulwalker-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments but in general this is looking pretty good to my eye.

@rj-jesus
Copy link
Contributor Author

rj-jesus commented Nov 8, 2024

Thank you very much for the feedback @paulwalker-arm, I'll address your comments and update the PR early next week. Cheers!

Decide to broadcast Op2 based on the result of `Op2VT.is128BitVector()`
rather than by comparing its number of elements with Op1. This fixes a
bug with search and needle vectors of `v8i8`, and enables us to match
search vectors of `v8i8` with needle vectors of `v16i8`.

Also address other review comments.
@rj-jesus
Copy link
Contributor Author

Many thanks for your feedback @paulwalker-arm, I've made the changes you suggested and modified the logic for broadcasting Op2 a bit.
Please let me know if you have any other comments.

Copy link
Collaborator

@paulwalker-arm paulwalker-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor request but otherwise this looks good to me.

@rj-jesus
Copy link
Contributor Author

Thanks very much for your feedback @paulwalker-arm and @david-arm!
I'll commit this early tomorrow if no one else has any other comments.

@rj-jesus rj-jesus merged commit e52238b into llvm:main Nov 14, 2024
9 checks passed
@rj-jesus rj-jesus deleted the rjj/aarch64-experimental-vector-match branch November 14, 2024 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AArch64 llvm:analysis Includes value tracking, cost tables and constant folding llvm:ir llvm:SelectionDAG SelectionDAGISel as well
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants