-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AArch64] Remove EXT instr before UZP when extracting elements from vector #91328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
978454b
8a23ca6
7db8e29
6cfb764
d077a36
27059a9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21448,6 +21448,29 @@ static SDValue performUzpCombine(SDNode *N, SelectionDAG &DAG, | |
SDValue Op1 = N->getOperand(1); | ||
EVT ResVT = N->getValueType(0); | ||
|
||
// uzp(extract_lo(x), extract_hi(x)) -> extract_lo(uzp x, x) | ||
if (Op0.getOpcode() == ISD::EXTRACT_SUBVECTOR && | ||
Op1.getOpcode() == ISD::EXTRACT_SUBVECTOR && | ||
Op0.getOperand(0) == Op1.getOperand(0)) { | ||
|
||
SDValue SourceVec = Op0.getOperand(0); | ||
uint64_t ExtIdx0 = Op0.getConstantOperandVal(1); | ||
uint64_t ExtIdx1 = Op1.getConstantOperandVal(1); | ||
uint64_t NumElements = SourceVec.getValueType().getVectorMinNumElements(); | ||
if (ExtIdx0 == 0 && ExtIdx1 == NumElements / 2) { | ||
EVT OpVT = Op0.getOperand(1).getValueType(); | ||
EVT WidenedResVT = ResVT.getDoubleNumVectorElementsVT(*DAG.getContext()); | ||
SDValue Uzp = DAG.getNode(N->getOpcode(), DL, WidenedResVT, SourceVec, | ||
DAG.getUNDEF(WidenedResVT)); | ||
return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ResVT, Uzp, | ||
DAG.getConstant(0, DL, OpVT)); | ||
} | ||
} | ||
|
||
// Following optimizations only work with uzp1. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -> The following optimizations only work with uzp1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The coding standard prefers Capitalization and full sentences. https://llvm.org/docs/CodingStandards.html#commenting |
||
if (N->getOpcode() == AArch64ISD::UZP2) | ||
return SDValue(); | ||
|
||
// uzp1(x, undef) -> concat(truncate(x), undef) | ||
if (Op1.getOpcode() == ISD::UNDEF) { | ||
EVT BCVT = MVT::Other, HalfVT = MVT::Other; | ||
|
@@ -24665,6 +24688,7 @@ SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N, | |
case AArch64ISD::UUNPKHI: | ||
return performUnpackCombine(N, DAG, Subtarget); | ||
case AArch64ISD::UZP1: | ||
case AArch64ISD::UZP2: | ||
return performUzpCombine(N, DAG, Subtarget); | ||
case AArch64ISD::SETCC_MERGE_ZERO: | ||
return performSetccMergeZeroCombine(N, DCI); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uzp2 -> Uzp2 (and maybe without the 2? UZP perhaps?)
Can the second SourceVec be Undef if we are only using the bottom half? Or does that produce worse code in places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It certainly produces different code. But based on what I have seen from current testcases, it doesn't seem to degrade performance. The only thing is that sometimes compiler now prefers using XTN instruction instead of UZP1 for extracting even elements sometimes, but based on software optimizations guides for V3 and N3 core, both instructions have same latency and throughput. So unless smth changes in the future this should be fine.
I changed code to use UNDEF now. Thanks for the suggestions !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks that does sound good. The Cortex-A55 issues Neon operations in 64bit chunks, and I believe an xtn counts as a 64-bit operations so can to two per cycle instead of 1 for 128bit operations. That's getting older now, but it might be a tiny bit nicer in places.
One of the problems it can cause is that because the undef is replaced by any register, it can create false-dependencies at times. Hopefully this won't be too much of a problem anywhere though, we can adjust it if needed.
Variable names should be Capitalized though!