-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AArch64] Add @llvm.experimental.vector.match #101974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64] Add @llvm.experimental.vector.match #101974
Conversation
@llvm/pr-subscribers-llvm-selectiondag @llvm/pr-subscribers-backend-aarch64 Author: Ricardo Jesus (rj-jesus) ChangesThis patch introduces an experimental intrinsic for matching the elements of one vector against the elements of another. For AArch64 targets that support SVE2, the intrinsic lowers to a MATCH instruction for supported fixed and scalar vector types. The lowering can be improved a bit for fixed-length vectors but this can be addressed later. Full diff: https://github.com/llvm/llvm-project/pull/101974.diff 10 Files Affected:
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index b17e3c828ed3d..dd9851d1af078 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -19637,6 +19637,51 @@ are undefined.
}
+'``llvm.experimental.vector.match.*``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+This is an overloaded intrinsic. Support for specific vector types is target
+dependent.
+
+::
+
+ declare <<n> x i1> @llvm.experimental.vector.match(<<n> x <ty>> %op1, <<n> x <ty>> %op2, <<n> x i1> %mask, i32 <segsize>)
+ declare <vscale x <n> x i1> @llvm.experimental.vector.match(<vscale x <n> x <ty>> %op1, <vscale x <n> x <ty>> %op2, <vscale x <n> x i1> %mask, i32 <segsize>)
+
+Overview:
+"""""""""
+
+Find elements of the first argument matching any elements of the second.
+
+Arguments:
+""""""""""
+
+The first argument is the search vector, the second argument is the vector of
+elements we are searching for (i.e. for which we consider a match successful),
+and the third argument is a mask that controls which elements of the first
+argument are active. The fourth argument is an immediate that sets the segment
+size for the search window.
+
+Semantics:
+""""""""""
+
+The '``llvm.experimental.vector.match``' intrinsic compares each element in the
+first argument against potentially several elements of the second, placing
+``1`` in the corresponding element of the output vector if any comparison is
+successful, and ``0`` otherwise. Inactive elements in the mask are set to ``0``
+in the output. The segment size controls the number of elements of the second
+argument that are compared against.
+
+For example, for vectors with 16 elements, if ``segsize = 16`` then each
+element of the first argument is compared against all 16 elements of the second
+argument; but if ``segsize = 4``, then each of the first four elements of the
+first argument is compared against the first four elements of the second
+argument, each of the second four elements of the first argument is compared
+against the second four elements of the second argument, and so forth.
+
Matrix Intrinsics
-----------------
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 38e8b9da21397..786c13a177ccf 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1746,6 +1746,10 @@ class TargetTransformInfo {
bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
Align Alignment) const;
+ /// \returns Returns true if the target supports vector match operations for
+ /// the vector type `VT` using a segment size of `SegSize`.
+ bool hasVectorMatch(VectorType *VT, unsigned SegSize) const;
+
struct VPLegalization {
enum VPTransform {
// keep the predicating parameter
@@ -2184,6 +2188,7 @@ class TargetTransformInfo::Concept {
virtual bool supportsScalableVectors() const = 0;
virtual bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
Align Alignment) const = 0;
+ virtual bool hasVectorMatch(VectorType *VT, unsigned SegSize) const = 0;
virtual VPLegalization
getVPLegalizationStrategy(const VPIntrinsic &PI) const = 0;
virtual bool hasArmWideBranch(bool Thumb) const = 0;
@@ -2952,6 +2957,10 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
return Impl.hasActiveVectorLength(Opcode, DataType, Alignment);
}
+ bool hasVectorMatch(VectorType *VT, unsigned SegSize) const override {
+ return Impl.hasVectorMatch(VT, SegSize);
+ }
+
VPLegalization
getVPLegalizationStrategy(const VPIntrinsic &PI) const override {
return Impl.getVPLegalizationStrategy(PI);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index d208a710bb27f..36621861ab8c8 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -958,6 +958,8 @@ class TargetTransformInfoImplBase {
return false;
}
+ bool hasVectorMatch(VectorType *VT, unsigned SegSize) const { return false; }
+
TargetTransformInfo::VPLegalization
getVPLegalizationStrategy(const VPIntrinsic &PI) const {
return TargetTransformInfo::VPLegalization(
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index b4e758136b39f..f6d77aa596f60 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -1892,6 +1892,16 @@ def int_experimental_vector_histogram_add : DefaultAttrsIntrinsic<[],
LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>], // Mask
[ IntrArgMemOnly ]>;
+// Experimental match
+def int_experimental_vector_match : DefaultAttrsIntrinsic<
+ [ LLVMScalarOrSameVectorWidth<0, llvm_i1_ty> ],
+ [ llvm_anyvector_ty,
+ LLVMMatchType<0>,
+ LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>, // Mask
+ llvm_i32_ty ], // Segment size
+ [ IntrNoMem, IntrNoSync, IntrWillReturn,
+ ImmArg<ArgIndex<3>> ]>;
+
// Operators
let IntrProperties = [IntrNoMem, IntrNoSync, IntrWillReturn] in {
// Integer arithmetic
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index dcde78925bfa9..d8314af0537fe 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1352,6 +1352,11 @@ bool TargetTransformInfo::hasActiveVectorLength(unsigned Opcode, Type *DataType,
return TTIImpl->hasActiveVectorLength(Opcode, DataType, Alignment);
}
+bool TargetTransformInfo::hasVectorMatch(VectorType *VT,
+ unsigned SegSize) const {
+ return TTIImpl->hasVectorMatch(VT, SegSize);
+}
+
TargetTransformInfo::Concept::~Concept() = default;
TargetIRAnalysis::TargetIRAnalysis() : TTICallback(&getDefaultTTI) {}
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 9d617c7acd13c..9cb7d65975b9f 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8096,6 +8096,15 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
DAG.getNode(ISD::EXTRACT_SUBVECTOR, sdl, ResultVT, Vec, Index));
return;
}
+ case Intrinsic::experimental_vector_match: {
+ auto *VT = dyn_cast<VectorType>(I.getOperand(0)->getType());
+ auto SegmentSize = cast<ConstantInt>(I.getOperand(3))->getLimitedValue();
+ const auto &TTI =
+ TLI.getTargetMachine().getTargetTransformInfo(*I.getFunction());
+ assert(VT && TTI.hasVectorMatch(VT, SegmentSize) && "Unsupported type!");
+ visitTargetIntrinsic(I, Intrinsic);
+ return;
+ }
case Intrinsic::vector_reverse:
visitVectorReverse(I);
return;
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7704321a0fc3a..050807142fc0a 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -6106,6 +6106,51 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
DAG.getNode(AArch64ISD::CTTZ_ELTS, dl, MVT::i64, CttzOp);
return DAG.getZExtOrTrunc(NewCttzElts, dl, Op.getValueType());
}
+ case Intrinsic::experimental_vector_match: {
+ SDValue ID =
+ DAG.getTargetConstant(Intrinsic::aarch64_sve_match, dl, MVT::i64);
+
+ auto Op1 = Op.getOperand(1);
+ auto Op2 = Op.getOperand(2);
+ auto Mask = Op.getOperand(3);
+ auto SegmentSize =
+ cast<ConstantSDNode>(Op.getOperand(4))->getLimitedValue();
+
+ EVT VT = Op.getValueType();
+ auto MinNumElts = VT.getVectorMinNumElements();
+
+ assert(Op1.getValueType() == Op2.getValueType() && "Type mismatch.");
+ assert(Op1.getValueSizeInBits().getKnownMinValue() == 128 &&
+ "Custom lower only works on 128-bit segments.");
+ assert((Op1.getValueType().getVectorElementType() == MVT::i8 ||
+ Op1.getValueType().getVectorElementType() == MVT::i16) &&
+ "Custom lower only supports 8-bit or 16-bit characters.");
+ assert(SegmentSize == MinNumElts && "Custom lower needs segment size to "
+ "match minimum number of elements.");
+
+ if (VT.isScalableVector())
+ return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, VT, ID, Mask, Op1, Op2);
+
+ // We can use the SVE2 match instruction to lower this intrinsic by
+ // converting the operands to scalable vectors, doing a match, and then
+ // extracting a fixed-width subvector from the scalable vector.
+
+ EVT OpVT = Op1.getValueType();
+ EVT OpContainerVT = getContainerForFixedLengthVector(DAG, OpVT);
+ EVT MatchVT = OpContainerVT.changeElementType(MVT::i1);
+
+ auto ScalableOp1 = convertToScalableVector(DAG, OpContainerVT, Op1);
+ auto ScalableOp2 = convertToScalableVector(DAG, OpContainerVT, Op2);
+ auto ScalableMask = DAG.getNode(ISD::SIGN_EXTEND, dl, OpVT, Mask);
+ ScalableMask = convertFixedMaskToScalableVector(ScalableMask, DAG);
+
+ SDValue Match = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, MatchVT, ID,
+ ScalableMask, ScalableOp1, ScalableOp2);
+
+ return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, OpVT,
+ DAG.getNode(ISD::SIGN_EXTEND, dl, OpContainerVT, Match),
+ DAG.getVectorIdxConstant(0, dl));
+ }
}
}
@@ -26544,6 +26589,7 @@ void AArch64TargetLowering::ReplaceNodeResults(
Results.push_back(DAG.getNode(ISD::TRUNCATE, DL, VT, V));
return;
}
+ case Intrinsic::experimental_vector_match:
case Intrinsic::get_active_lane_mask: {
if (!VT.isFixedLengthVector() || VT.getVectorElementType() != MVT::i1)
return;
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index b8f19fa87e2ab..806dc856c5862 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -3835,6 +3835,18 @@ bool AArch64TTIImpl::isLegalToVectorizeReduction(
}
}
+bool AArch64TTIImpl::hasVectorMatch(VectorType *VT, unsigned SegSize) const {
+ // Check that the target has SVE2 (and SVE is available), that `VT' is a
+ // legal type for MATCH, and that the segment size is 128-bit.
+ if (ST->hasSVE2() && ST->isSVEAvailable() &&
+ VT->getPrimitiveSizeInBits().getKnownMinValue() == 128 &&
+ VT->getElementCount().getKnownMinValue() == SegSize &&
+ (VT->getElementCount().getKnownMinValue() == 8 ||
+ VT->getElementCount().getKnownMinValue() == 16))
+ return true;
+ return false;
+}
+
InstructionCost
AArch64TTIImpl::getMinMaxReductionCost(Intrinsic::ID IID, VectorType *Ty,
FastMathFlags FMF,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index a9189fd53f40b..6ad21a9e0a77a 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -391,6 +391,8 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
return ST->hasSVE();
}
+ bool hasVectorMatch(VectorType *VT, unsigned SegSize) const;
+
InstructionCost getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
std::optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind);
diff --git a/llvm/test/CodeGen/AArch64/intrinsic-vector-match-sve2.ll b/llvm/test/CodeGen/AArch64/intrinsic-vector-match-sve2.ll
new file mode 100644
index 0000000000000..0df92dfa80000
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/intrinsic-vector-match-sve2.ll
@@ -0,0 +1,57 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
+; RUN: llc -mtriple=aarch64 < %s -o - | FileCheck %s
+
+define <vscale x 16 x i1> @match_nxv16i8(<vscale x 16 x i8> %op1, <vscale x 16 x i8> %op2, <vscale x 16 x i1> %mask) #0 {
+; CHECK-LABEL: match_nxv16i8:
+; CHECK: // %bb.0:
+; CHECK-NEXT: match p0.b, p0/z, z0.b, z1.b
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 16 x i1> @llvm.experimental.vector.match(<vscale x 16 x i8> %op1, <vscale x 16 x i8> %op2, <vscale x 16 x i1> %mask, i32 16)
+ ret <vscale x 16 x i1> %r
+}
+
+define <vscale x 8 x i1> @match_nxv8i16(<vscale x 8 x i16> %op1, <vscale x 8 x i16> %op2, <vscale x 8 x i1> %mask) #0 {
+; CHECK-LABEL: match_nxv8i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: match p0.h, p0/z, z0.h, z1.h
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 8 x i1> @llvm.experimental.vector.match(<vscale x 8 x i16> %op1, <vscale x 8 x i16> %op2, <vscale x 8 x i1> %mask, i32 8)
+ ret <vscale x 8 x i1> %r
+}
+
+define <16 x i1> @match_v16i8(<16 x i8> %op1, <16 x i8> %op2, <16 x i1> %mask) #0 {
+; CHECK-LABEL: match_v16i8:
+; CHECK: // %bb.0:
+; CHECK-NEXT: shl v2.16b, v2.16b, #7
+; CHECK-NEXT: ptrue p0.b, vl16
+; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
+; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT: cmlt v2.16b, v2.16b, #0
+; CHECK-NEXT: cmpne p0.b, p0/z, z2.b, #0
+; CHECK-NEXT: match p0.b, p0/z, z0.b, z1.b
+; CHECK-NEXT: mov z0.b, p0/z, #-1 // =0xffffffffffffffff
+; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
+; CHECK-NEXT: ret
+ %r = tail call <16 x i1> @llvm.experimental.vector.match(<16 x i8> %op1, <16 x i8> %op2, <16 x i1> %mask, i32 16)
+ ret <16 x i1> %r
+}
+
+define <8 x i1> @match_v8i16(<8 x i16> %op1, <8 x i16> %op2, <8 x i1> %mask) #0 {
+; CHECK-LABEL: match_v8i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ushll v2.8h, v2.8b, #0
+; CHECK-NEXT: ptrue p0.h, vl8
+; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
+; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT: shl v2.8h, v2.8h, #15
+; CHECK-NEXT: cmlt v2.8h, v2.8h, #0
+; CHECK-NEXT: cmpne p0.h, p0/z, z2.h, #0
+; CHECK-NEXT: match p0.h, p0/z, z0.h, z1.h
+; CHECK-NEXT: mov z0.h, p0/z, #-1 // =0xffffffffffffffff
+; CHECK-NEXT: xtn v0.8b, v0.8h
+; CHECK-NEXT: ret
+ %r = tail call <8 x i1> @llvm.experimental.vector.match(<8 x i16> %op1, <8 x i16> %op2, <8 x i1> %mask, i32 8)
+ ret <8 x i1> %r
+}
+
+attributes #0 = { "target-features"="+sve2" }
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I've had a quick look and have a few thoughts so far.
Thank you, please let me know if you have any other comments! |
Are there any updates on this? @david-arm If you think that's better we can start by implementing the intrinsic only for fixed vector types and extend it to scalable vectors later. This should still work for #101976, though it may add a bit of extra work. What do you think? |
I'm so sorry I haven't had much time to look at this. I was away for 3+ weeks in August and since then have been struggling to find time due to other commitments. I promise to take a look at this in the next few days and get back to you with more useful comments! |
No worries at all, thank you very much for the update! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest concern about this patch is the intrinsic looks to me too SVE-specific to be a general intrinsic. For the actual problem you want to solve using SVE's match instruction, what's the bare minimum you need to do? For example, do you really need multiple search segments? Consider a loop like this:
for (unsigned i = 0; i < n; i++) {
Found[i] = false;
for (unsigned j = 0; j < 4; j++) {
if (vec[i] == search[j]) {
Found[i] = true;
break;
}
}
}
In the example above every element of vec
is always being compared against the same 4 elements of search. So effectively this is only one segment to search. That would mean the generic intrinsic only needs to talk about searching the first four elements of the second vector, which significantly reduces the complexity and makes it easier to provide generic lowering and match to other targets.
Hi @david-arm, thank you for your feedback, any help to make the intrinsic more general and to simplify it is very welcome! Your example is very close to what I had in mind: in general the bounds of the For such cases, indeed we don't need multiple search segments. They are currently there to facilitate the efficient lowering to MATCH, but if we wanted to remove them we could:
The comparisons would be performed in an "all-to-all" fashion in either case. The reason I didn't implement this to begin with is that this can make lowering slightly less efficient. For example, for the second option, we'd need to do a broadcast before doing a MATCH, but we can work around this separately if needed. Is either of these options what you had in mind? |
Yeah, I think the second option sounds good to me. If you do it this way you can remove the segsize argument too, because the size of the segment is defined by the number of elements in the second vector argument. I agree that puts more pressure on the backend to lower this efficiently, but even with the previous version of the intrinsic we still have to do work to splat the segments across the whole vector at the IR level. I'd hope that the MachineLICM pass would hoist out any loop invariants! I do think it's more friendly to other targets and makes it easier to implement general lowering code for all targets too, i.e. something like
For SVE I suppose you'd have to be clever and use a combination of 64-bit DUPs and ZIPs or something like that to generate the repeating 128-bit pattern! |
Yep, I think this is a good compromise. Thanks for the feedback, I'll implement this approach! |
13bb990
to
9df5ff0
Compare
Hi @david-arm, in the latest revision of the patch I've provided general lowering for the intrinsic and removed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for making these changes - it looks a lot nicer now! I haven't reviewed the whole patch yet, but I've left comments I have so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I've finished reviewing now, and added a few more comments!
This patch introduces an experimental intrinsic for matching the elements of one vector against the elements of another. For AArch64 targets that support SVE2, it lowers to a MATCH instruction for supported fixed and scalar types. Otherwise, the intrinsic has generic lowering in SelectionDAGBuilder.
Add address other review comments.
9df5ff0
to
3f9398d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for the feedback @david-arm!
I've addressed most of your comments, except for the TTI vs TLI hook and cost model. As for the cost model, I thought the decision to use the intrinsic would effectively be binary - unless the hardware supports MATCH (or similar), I doubt we'd want to use the generic lowering. I'm happy to add this though if you think it would be useful. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new tests you've added look good - thanks! I just had a few more minor comments ...
Thanks very much for the feedback, @david-arm. I've updated the patch with your suggestions - please let me know if you have any other comments! |
Sorry I'm late to the conversation but please can you elaborate on the need for a dedicated intrinsic? The default expansion for the ISD node shows LLVM IR can already represent the logic using a handful of existing operations. I concede "handful" starts to breakdown as the needle vector grows but are you anticipating such larger needle vectors that would make the IR matching hard? A better path forward is to add the LLVM transformation you're interested in first using stock LLVM IR and let the existing optimisers at it, which will provide the necessary context for any next steps. For example, the proposed intrinsic requires a fixed length needle vector and my immediate thought was if making the needle a scalar would be sufficient to simplify the IR so that it can be matched to the desired instruction (assuming the IR from step 1 isn't already simple enough to match). My worry here is that it looks like we're starting with the desired instruction and working backwards in a way that, to my eye, still looks target specific. So whilst I'm not against having a dedicated intrinsic I'd like to ensure it is absolutely necessary and is as relaxed/simple as possible. |
Hi @paulwalker-arm, I think we shouldn't necessarily assume the needle vectors will be small. I was planning to use the intrinsic to help with vectorising |
I think we're on the same page and you may be correct in concluding the stock IR route will be cumbersome, but I think we should get to the point of knowing it to be true before picking a solution because otherwise we cannot be sure it's the correct solution. The splats are trivial to match (especially if they're constant). Likewise, an OR tree with matching roots doesn't seem that hard. It's the predication that most worries me because that will likely mean instruction ordering matters, which can make matching awkward. |
Sure, but just to be clear, what would you like to see? Would the IR we'd be generating for something like this (https://godbolt.org/z/1bhvonGvj): char* find_first_byte(char *first, char *last, char *s_first, char *s_last)
{
for (; first != last; ++first)
for (char *it = s_first; it != s_last; ++it)
if (*first == *it)
return first;
return last;
} with the intrinsic and stock IR help? Also, if you think this would help make the intrinsic slightly more general, one thing we could do is remove the mask parameter. The mask only affects the first operand (the "haystack"), so we can proxy its functionality with a AND after the intrinsic call (as we currently do at the end of the generic lowering). I'm not sure this would have a big impact overall though. Please let me know what you think. I'm happy to get the IR if you think that will help move this forward. |
I'm assuming the new intrinsic requires a matching transformation to generate it so preferable I'd like the transformation to already exist in LLVM. Given that's only worth doing if there's a performance benefit, I ask "Is enabling vectorisation of the find loop enough to see improved performance? or does that only happen when an operation equivalent to SVE's If only the latter is true then yes seeing representative IR will be very useful. However, if the transformation is generically useful then why not implement that first? |
Sorry, maybe I wasn't very clear, but the transformation is in #101976. It does need rebasing to the latest version of the intrinsic, but I was letting the intrinsic settle to hopefully avoid too much back and forth). But I'll get back to it soon.
I haven't benchmarked with stock IR, but yes, I expect the latter. With SVE2 MATCH we can get very neat speedups (200x+) in some cases, simply because you can do up to 16x16 byte compares in a single instruction with relatively low latency (at least on Neoverse V2). I can get the stock IR though if you'd still like to have a look at it? |
Yes please. 200x+/16 is still more than one, which suggests it could be generically useful? Are you able to name/provide the benchmark(s) used? |
Sounds good, I'll get the stock IR.
Yep, that's true, but in that case we can still use the intrinsic and fallback to generic lowering, right? I'll check on the benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To help things along I've reviewed the PR based on the assumption this intrinsic will be the way to go. In general it looks pretty nice but I do think you'll want to tighten the intrinsic definition a little more.
|
||
// Wrap the operands. | ||
Op1 = convertToScalableVector(DAG, OpContainerVT, Op1); | ||
Mask = DAG.getNode(ISD::ANY_EXTEND, dl, Op1VT, Mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you cannot any_extend
a mask because convertFixedMaskToScalableVector()
expects a boolean vector result (see getBooleanContents()
). The any_extend
makes the top bits undefined and thus the compare performed by convertFixedMaskToScalableVector()
also undefined. For AArch64 getBooleanContents()
is either all zero or all ones, which ISD::SIGN_EXTEND
will maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch, yep that code is broken as well. I'll see about getting it fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for double-checking and fixing it!
Thanks very much for the review, I'll go through it tomorrow! In the meantime, I've also got the IR for the C snippet I sent through earlier (based on this test case): define ptr @first_byte_of(ptr %0, ptr %1, ptr %2, ptr %3) {
%5 = icmp eq ptr %0, %1
%6 = icmp eq ptr %2, %3
%7 = or i1 %5, %6
br i1 %7, label %.loopexit1, label %.preheader
.preheader: ; preds = %4, %77
%pa = phi ptr [ %8, %77 ], [ %0, %4 ]
%8 = getelementptr i8, ptr %pa, i64 16
%9 = icmp ult ptr %8, %1
%10 = select i1 %9, ptr %8, ptr %1
%11 = ptrtoint ptr %pa to i64
%12 = ptrtoint ptr %10 to i64
%13 = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %11, i64 %12)
%14 = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr %pa, i32 1, <vscale x 16 x i1> %13, <vscale x 16 x i8> zeroinitializer)
br label %15
15: ; preds = %76, %.preheader
%pb = phi ptr [ %2, %.preheader ], [ %16, %76 ]
%16 = getelementptr i8, ptr %pb, i64 16
%17 = icmp ult ptr %16, %3
%18 = select i1 %17, ptr %16, ptr %3
%19 = ptrtoint ptr %pb to i64
%20 = ptrtoint ptr %18 to i64
%21 = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %19, i64 %20)
%22 = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr %pb, i32 1, <vscale x 16 x i1> %21, <vscale x 16 x i8> zeroinitializer)
%23 = extractelement <vscale x 16 x i8> %22, i64 0
%.splatinsert = insertelement <vscale x 16 x i8> poison, i8 %23, i64 0
%.splat = shufflevector <vscale x 16 x i8> %.splatinsert, <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer
%24 = select <vscale x 16 x i1> %21, <vscale x 16 x i8> %22, <vscale x 16 x i8> %.splat
; <<<<< Stock IR
%25 = extractelement <vscale x 16 x i8> %24, i64 0
%.splatinsert3 = insertelement <vscale x 16 x i8> poison, i8 %25, i64 0
%.splat4 = shufflevector <vscale x 16 x i8> %.splatinsert3, <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer
%26 = icmp eq <vscale x 16 x i8> %14, %.splat4
%27 = extractelement <vscale x 16 x i8> %24, i64 1
%.splatinsert5 = insertelement <vscale x 16 x i8> poison, i8 %27, i64 0
%.splat6 = shufflevector <vscale x 16 x i8> %.splatinsert5, <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer
%28 = icmp eq <vscale x 16 x i8> %14, %.splat6
%29 = or <vscale x 16 x i1> %26, %28
; the above 13 more times
%69 = extractelement <vscale x 16 x i8> %24, i64 15
%.splatinsert33 = insertelement <vscale x 16 x i8> poison, i8 %69, i64 0
%.splat34 = shufflevector <vscale x 16 x i8> %.splatinsert33, <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer
%70 = icmp eq <vscale x 16 x i8> %14, %.splat34
%71 = or <vscale x 16 x i1> %68, %70
%72 = and <vscale x 16 x i1> %71, %13
; ===== Or, with @llvm.experimental.vector.match
%25 = tail call <16 x i8> @llvm.vector.extract(<vscale x 16 x i8> %24, i64 0)
%72 = tail call <vscale x 16 x i1> @llvm.experimental.vector.match.nxv16i8(<vscale x 16 x i8> %14, <16 x i8> %25, <vscale x 16 x i1> %13)
; >>>>>
%73 = tail call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> %72)
br i1 %73, label %.loopexit, label %76
.loopexit: ; preds = %15
%74 = tail call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> %72, i1 true)
%75 = getelementptr i8, ptr %pa, i64 %74
br label %.loopexit1
76: ; preds = %15
br i1 %17, label %15, label %77
77: ; preds = %76
br i1 %9, label %.preheader, label %.loopexit1
.loopexit1: ; preds = %77, %.loopexit, %4
%78 = phi ptr [ %1, %4 ], [ %75, %.loopexit ], [ %1, %77 ]
ret ptr %78
} The matching part of the stock IR would lower to: mov z2.b, b1
cmpeq p2.b, p0/z, z0.b, z2.b
mov z2.b, z1.b[1]
cmpeq p3.b, p0/z, z0.b, z2.b
mov z2.b, z1.b[2]
sel p2.b, p2, p2.b, p3.b
cmpeq p3.b, p0/z, z0.b, z2.b
mov z2.b, z1.b[3]
sel p2.b, p2, p2.b, p3.b
; ... about 48 instructions in total Meanwhile, for the intrinsic version, we'd get the single MATCH instruction. I can commit these two versions to #101976 if you'd like to give them a go yourself? Just let me know if that would be useful, otherwise I'll do it in one go after going through your feedback. Thanks again! |
(And address other minor feedback.)
Hi @paulwalker-arm, I believe I've addressed your comments from yesterday and left some of my own---thanks very much again for the feedback! Please let me know if you have any other questions. |
Thanks @rj-jesus. Just wanted to let you know that it'll likely be Tuesday next week before I'll get chance to revisit this PR. |
Thanks @paulwalker-arm, no worries at all! If you have any other questions in the meantime just let me know. |
Thanks for the IR @rj-jesus and thanks for your patience. From a pure "can we reliably match this IR" point of view I'd still answer yes. However, I see your use-case requires loading a needle vector, which adds additional complexity to track where the needles are coming from in order to reuse the original vector. Given this niggle and the significant performance loss that would result from failing to match the sequence, I'm happy to agree there's enough value in having a dedicated intrinsic. Given the potential size of the IR sequence replaced I guess there may also be compile time benefits and it'll certainly be easier to cost model. Not related to this, but looking at the IR I notice you're effectively doing a fixed length load into a scalable vector for the needle. I just wondered if you have considered having the main loop load directly into a fixed length vector and then have a tail loop that requires a masked load. Doing this would allow you to emit |
Hi @paulwalker-arm, No worries at all, thank you very much for having a look! Once #115189 lands I'll rebase this patch and simplify You're right, I believe that when I tried with separate main and tail loops, I wasn't fully convinced that the performance difference was worth the extra code size and complexity. but I'll have a look again since I can't really recall the numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments but in general this is looking pretty good to my eye.
Thank you very much for the feedback @paulwalker-arm, I'll address your comments and update the PR early next week. Cheers! |
Decide to broadcast Op2 based on the result of `Op2VT.is128BitVector()` rather than by comparing its number of elements with Op1. This fixes a bug with search and needle vectors of `v8i8`, and enables us to match search vectors of `v8i8` with needle vectors of `v16i8`. Also address other review comments.
Many thanks for your feedback @paulwalker-arm, I've made the changes you suggested and modified the logic for broadcasting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor request but otherwise this looks good to me.
Thanks very much for your feedback @paulwalker-arm and @david-arm! |
This patch introduces an experimental intrinsic for matching the elements of one vector against the elements of another.
For AArch64 targets that support SVE2, the intrinsic lowers to a MATCH instruction for supported fixed and scalar vector types. The lowering can be improved a bit for fixed-length vectors but this can be addressed later.