Skip to content

[AArch64][SVE] Use SVE for scalar FP converts in streaming[-compatible] functions #112564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 82 additions & 0 deletions llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18961,13 +18961,92 @@ static SDValue performVectorCompareAndMaskUnaryOpCombine(SDNode *N,
return SDValue();
}

/// Tries to replace scalar FP <-> INT conversions with SVE in streaming
/// functions, this can help to reduce the number of fmovs to/from GPRs.
static SDValue
tryToReplaceScalarFPConversionWithSVE(SDNode *N, SelectionDAG &DAG,
const AArch64Subtarget *Subtarget) {
if (N->isStrictFPOpcode())
return SDValue();

if (!Subtarget->isSVEorStreamingSVEAvailable() ||
(!Subtarget->isStreaming() && !Subtarget->isStreamingCompatible()))
return SDValue();

auto isSupportedType = [](EVT VT) {
if (!VT.isSimple())
return false;
// There are SVE instructions that can convert to/from all pairs of these
// int and float types. Note: We don't bother with i8 or i16 as those are
// illegal types for scalars.
return is_contained({MVT::i32, MVT::i64, MVT::f16, MVT::f32, MVT::f64},
VT.getSimpleVT().SimpleTy);
};

if (!isSupportedType(N->getValueType(0)) ||
!isSupportedType(N->getOperand(0).getValueType()))
return SDValue();

unsigned Opc = N->getOpcode();
bool IsSigned = Opc == ISD::SINT_TO_FP || Opc == ISD::FP_TO_SINT;

SDValue SrcVal = N->getOperand(0);
EVT SrcTy = SrcVal.getValueType();
EVT DestTy = N->getValueType(0);

EVT SrcVecTy;
EVT DestVecTy;
if (DestTy.bitsGT(SrcTy)) {
DestVecTy = getPackedSVEVectorVT(DestTy);
SrcVecTy = SrcTy == MVT::i32 ? getPackedSVEVectorVT(SrcTy)
: DestVecTy.changeVectorElementType(SrcTy);
} else {
SrcVecTy = getPackedSVEVectorVT(SrcTy);
DestVecTy = DestTy == MVT::i32 ? getPackedSVEVectorVT(DestTy)
: SrcVecTy.changeVectorElementType(DestTy);
}

SDLoc DL(N);
SDValue ZeroIdx = DAG.getVectorIdxConstant(0, DL);
SDValue Vec = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, SrcVecTy,
DAG.getUNDEF(SrcVecTy), SrcVal, ZeroIdx);

// Conversions between f64 and i32 are a special case as nxv2i32 is an illegal
// type (unlike the equivalent nxv2f32 for floating-point types). So the only
// way to lower to these variants is via the intrinsics. Note: We could
// sign/zero extend to the i64 variant, but that may result in extra extends
// or fmovs in the final assembly.
bool IsI32ToF64 = SrcTy == MVT::i32 && DestTy == MVT::f64;
bool isF64ToI32 = SrcTy == MVT::f64 && DestTy == MVT::i32;
if (IsI32ToF64 || isF64ToI32) {
unsigned IntrinsicOpc;
if (IsI32ToF64)
IntrinsicOpc = IsSigned ? Intrinsic::aarch64_sve_scvtf_f64i32
: Intrinsic::aarch64_sve_ucvtf_f64i32;
else
IntrinsicOpc = IsSigned ? Intrinsic::aarch64_sve_fcvtzs_i32f64
: Intrinsic::aarch64_sve_fcvtzu_i32f64;
SDValue PTrue = getPredicateForVector(DAG, DL, MVT::nxv2f64);
Vec = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, DestVecTy,
{DAG.getConstant(IntrinsicOpc, DL, MVT::i32),
Comment on lines +19024 to +19031
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not much better than the original isel route. Code generation is rarely about implementing a single function that emits exactly what you need, but rather a sequence of smaller cuts that each move the DAG to its finial destination.

You mention using 64-bit converts did not result in the best code generation. What are the resulting deficiencies? Is it possible for them to be resolved using isolated combines or perhaps better isel? If the deficiencies are integer scalar instructions then doesn't that mean the GPR<->FPR transitions have not been removed and thus the rational for the combine has not been achieved anyway?

Copy link
Member Author

@MacDue MacDue Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not much better than the original isel route. Code generation is rarely about implementing a single function that emits exactly what you need, but rather a sequence of smaller cuts that each move the DAG to its final destination.

I get that, but this is not entirely the standard case. This is emulating Neon/FP, which has f64 <-> i32 conversions, SVE also has these conversions, but they're currently unused (outside of intrinsics), as nxv2i32 is an illegal type that is promoted to nxv2i64. So you can't rely on the standard lowerings/patterns being in place.

You mention using 64-bit converts did not result in the best code generation. What are the resulting deficiencies?

I've shown a few such cases in my comments above. The issues were:

  • Loads not being promoted to FP registers (likely due to the sign/zero extensions in the way)
  • Pointless fmovs between GPRs + sign/zero extends for values that should stay within FP/vector registers
  • Additional unavoidable sign/zero extends in some cases
    • If you can only do i64 -> f64 conversions, and you have an i32, you'll need to extend it

You could probably write a bunch of folds to eventually tidy this up, but given that this will never be used for the general SVE path (since these converts are illegal there), the direct approach made more sense to me.

The inspiration for using the intrinsics directly partly came from the existing performFpToIntCombine(), which sometimes uses the Neon intrinsics directly, so it didn't seem too unreasonable.

DAG.getUNDEF(DestTy), PTrue, Vec});
} else {
Vec = DAG.getNode(Opc, DL, DestVecTy, Vec);
}

return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, DestTy, Vec, ZeroIdx);
}

static SDValue performIntToFpCombine(SDNode *N, SelectionDAG &DAG,
const AArch64Subtarget *Subtarget) {
// First try to optimize away the conversion when it's conditionally from
// a constant. Vectors only.
if (SDValue Res = performVectorCompareAndMaskUnaryOpCombine(N, DAG))
return Res;

if (SDValue Res = tryToReplaceScalarFPConversionWithSVE(N, DAG, Subtarget))
return Res;

EVT VT = N->getValueType(0);
if (VT != MVT::f32 && VT != MVT::f64)
return SDValue();
Expand Down Expand Up @@ -19006,6 +19085,9 @@ static SDValue performIntToFpCombine(SDNode *N, SelectionDAG &DAG,
static SDValue performFpToIntCombine(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,
const AArch64Subtarget *Subtarget) {
if (SDValue Res = tryToReplaceScalarFPConversionWithSVE(N, DAG, Subtarget))
return Res;

if (!Subtarget->isNeonAvailable())
return SDValue();

Expand Down
4 changes: 2 additions & 2 deletions llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
Original file line number Diff line number Diff line change
Expand Up @@ -2328,8 +2328,8 @@ let Predicates = [HasSVEorSME] in {
defm FCVT_ZPmZ_HtoD : sve_fp_2op_p_zd< 0b1101001, "fcvt", ZPR16, ZPR64, int_aarch64_sve_fcvt_f64f16, AArch64fcvte_mt, nxv2f64, nxv2i1, nxv2f16, ElementSizeD>;
defm FCVT_ZPmZ_DtoS : sve_fp_2op_p_zdr<0b1101010, "fcvt", ZPR64, ZPR32, int_aarch64_sve_fcvt_f32f64, AArch64fcvtr_mt, nxv2f32, nxv2i1, nxv2f64, ElementSizeD>;
defm FCVT_ZPmZ_StoD : sve_fp_2op_p_zd< 0b1101011, "fcvt", ZPR32, ZPR64, int_aarch64_sve_fcvt_f64f32, AArch64fcvte_mt, nxv2f64, nxv2i1, nxv2f32, ElementSizeD>;
defm SCVTF_ZPmZ_StoD : sve_fp_2op_p_zd< 0b1110000, "scvtf", ZPR32, ZPR64, int_aarch64_sve_scvtf_f64i32, AArch64scvtf_mt, nxv2f64, nxv2i1, nxv4i32, ElementSizeD>;
defm UCVTF_ZPmZ_StoD : sve_fp_2op_p_zd< 0b1110001, "ucvtf", ZPR32, ZPR64, int_aarch64_sve_ucvtf_f64i32, AArch64ucvtf_mt, nxv2f64, nxv2i1, nxv4i32, ElementSizeD>;
defm SCVTF_ZPmZ_StoD : sve_fp_2op_p_zd< 0b1110000, "scvtf", ZPR32, ZPR64, int_aarch64_sve_scvtf_f64i32, null_frag, nxv2f64, nxv2i1, nxv4i32, ElementSizeD>;
defm UCVTF_ZPmZ_StoD : sve_fp_2op_p_zd< 0b1110001, "ucvtf", ZPR32, ZPR64, int_aarch64_sve_ucvtf_f64i32, null_frag, nxv2f64, nxv2i1, nxv4i32, ElementSizeD>;
defm UCVTF_ZPmZ_StoH : sve_fp_2op_p_zd< 0b0110101, "ucvtf", ZPR32, ZPR16, int_aarch64_sve_ucvtf_f16i32, AArch64ucvtf_mt, nxv4f16, nxv4i1, nxv4i32, ElementSizeS>;
defm SCVTF_ZPmZ_DtoS : sve_fp_2op_p_zd< 0b1110100, "scvtf", ZPR64, ZPR32, int_aarch64_sve_scvtf_f32i64, AArch64scvtf_mt, nxv2f32, nxv2i1, nxv2i64, ElementSizeD>;
defm SCVTF_ZPmZ_StoH : sve_fp_2op_p_zd< 0b0110100, "scvtf", ZPR32, ZPR16, int_aarch64_sve_scvtf_f16i32, AArch64scvtf_mt, nxv4f16, nxv4i1, nxv4i32, ElementSizeS>;
Expand Down
94 changes: 75 additions & 19 deletions llvm/test/CodeGen/AArch64/sve-streaming-mode-cvt-fp-int-fp.ll
Original file line number Diff line number Diff line change
@@ -1,22 +1,33 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -force-streaming-compatible < %s | FileCheck %s
; RUN: llc -force-streaming-compatible -mattr=+sme2p2 < %s | FileCheck %s --check-prefix=USE-NEON-NO-GPRS
; RUN: llc < %s | FileCheck %s --check-prefix=USE-NEON-NO-GPRS
; RUN: llc -mattr=+sve -force-streaming-compatible < %s | FileCheck %s
; RUN: llc -mattr=+sme -force-streaming < %s | FileCheck %s
; RUN: llc -mattr=+sme2p2 -force-streaming-compatible < %s | FileCheck %s --check-prefix=USE-NEON-NO-GPRS
; RUN: llc -mattr=+neon < %s | FileCheck %s --check-prefix=USE-NEON-NO-GPRS
; RUN: llc -force-streaming-compatible < %s | FileCheck %s --check-prefix=NONEON-NOSVE

target triple = "aarch64-unknown-linux-gnu"

define double @t1(double %x) {
; CHECK-LABEL: t1:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: fcvtzs x8, d0
; CHECK-NEXT: scvtf d0, x8
; CHECK-NEXT: ptrue p0.d
; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
; CHECK-NEXT: fcvtzs z0.d, p0/m, z0.d
; CHECK-NEXT: scvtf z0.d, p0/m, z0.d
; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
; CHECK-NEXT: ret
;
; USE-NEON-NO-GPRS-LABEL: t1:
; USE-NEON-NO-GPRS: // %bb.0: // %entry
; USE-NEON-NO-GPRS-NEXT: fcvtzs d0, d0
; USE-NEON-NO-GPRS-NEXT: scvtf d0, d0
; USE-NEON-NO-GPRS-NEXT: ret
;
; NONEON-NOSVE-LABEL: t1:
; NONEON-NOSVE: // %bb.0: // %entry
; NONEON-NOSVE-NEXT: fcvtzs x8, d0
; NONEON-NOSVE-NEXT: scvtf d0, x8
; NONEON-NOSVE-NEXT: ret
entry:
%conv = fptosi double %x to i64
%conv1 = sitofp i64 %conv to double
Expand All @@ -26,15 +37,24 @@ entry:
define float @t2(float %x) {
; CHECK-LABEL: t2:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: fcvtzs w8, s0
; CHECK-NEXT: scvtf s0, w8
; CHECK-NEXT: ptrue p0.s
; CHECK-NEXT: // kill: def $s0 killed $s0 def $z0
; CHECK-NEXT: fcvtzs z0.s, p0/m, z0.s
; CHECK-NEXT: scvtf z0.s, p0/m, z0.s
; CHECK-NEXT: // kill: def $s0 killed $s0 killed $z0
; CHECK-NEXT: ret
;
; USE-NEON-NO-GPRS-LABEL: t2:
; USE-NEON-NO-GPRS: // %bb.0: // %entry
; USE-NEON-NO-GPRS-NEXT: fcvtzs s0, s0
; USE-NEON-NO-GPRS-NEXT: scvtf s0, s0
; USE-NEON-NO-GPRS-NEXT: ret
;
; NONEON-NOSVE-LABEL: t2:
; NONEON-NOSVE: // %bb.0: // %entry
; NONEON-NOSVE-NEXT: fcvtzs w8, s0
; NONEON-NOSVE-NEXT: scvtf s0, w8
; NONEON-NOSVE-NEXT: ret
entry:
%conv = fptosi float %x to i32
%conv1 = sitofp i32 %conv to float
Expand All @@ -44,10 +64,11 @@ entry:
define half @t3(half %x) {
; CHECK-LABEL: t3:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: fcvt s0, h0
; CHECK-NEXT: fcvtzs w8, s0
; CHECK-NEXT: scvtf s0, w8
; CHECK-NEXT: fcvt h0, s0
; CHECK-NEXT: ptrue p0.s
; CHECK-NEXT: // kill: def $h0 killed $h0 def $z0
; CHECK-NEXT: fcvtzs z0.s, p0/m, z0.h
; CHECK-NEXT: scvtf z0.h, p0/m, z0.s
; CHECK-NEXT: // kill: def $h0 killed $h0 killed $z0
; CHECK-NEXT: ret
;
; USE-NEON-NO-GPRS-LABEL: t3:
Expand All @@ -57,6 +78,14 @@ define half @t3(half %x) {
; USE-NEON-NO-GPRS-NEXT: scvtf s0, s0
; USE-NEON-NO-GPRS-NEXT: fcvt h0, s0
; USE-NEON-NO-GPRS-NEXT: ret
;
; NONEON-NOSVE-LABEL: t3:
; NONEON-NOSVE: // %bb.0: // %entry
; NONEON-NOSVE-NEXT: fcvt s0, h0
; NONEON-NOSVE-NEXT: fcvtzs w8, s0
; NONEON-NOSVE-NEXT: scvtf s0, w8
; NONEON-NOSVE-NEXT: fcvt h0, s0
; NONEON-NOSVE-NEXT: ret
entry:
%conv = fptosi half %x to i32
%conv1 = sitofp i32 %conv to half
Expand All @@ -66,15 +95,24 @@ entry:
define double @t4(double %x) {
; CHECK-LABEL: t4:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: fcvtzu x8, d0
; CHECK-NEXT: ucvtf d0, x8
; CHECK-NEXT: ptrue p0.d
; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
; CHECK-NEXT: fcvtzu z0.d, p0/m, z0.d
; CHECK-NEXT: ucvtf z0.d, p0/m, z0.d
; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
; CHECK-NEXT: ret
;
; USE-NEON-NO-GPRS-LABEL: t4:
; USE-NEON-NO-GPRS: // %bb.0: // %entry
; USE-NEON-NO-GPRS-NEXT: fcvtzu d0, d0
; USE-NEON-NO-GPRS-NEXT: ucvtf d0, d0
; USE-NEON-NO-GPRS-NEXT: ret
;
; NONEON-NOSVE-LABEL: t4:
; NONEON-NOSVE: // %bb.0: // %entry
; NONEON-NOSVE-NEXT: fcvtzu x8, d0
; NONEON-NOSVE-NEXT: ucvtf d0, x8
; NONEON-NOSVE-NEXT: ret
entry:
%conv = fptoui double %x to i64
%conv1 = uitofp i64 %conv to double
Expand All @@ -84,15 +122,24 @@ entry:
define float @t5(float %x) {
; CHECK-LABEL: t5:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: fcvtzu w8, s0
; CHECK-NEXT: ucvtf s0, w8
; CHECK-NEXT: ptrue p0.s
; CHECK-NEXT: // kill: def $s0 killed $s0 def $z0
; CHECK-NEXT: fcvtzu z0.s, p0/m, z0.s
; CHECK-NEXT: ucvtf z0.s, p0/m, z0.s
; CHECK-NEXT: // kill: def $s0 killed $s0 killed $z0
; CHECK-NEXT: ret
;
; USE-NEON-NO-GPRS-LABEL: t5:
; USE-NEON-NO-GPRS: // %bb.0: // %entry
; USE-NEON-NO-GPRS-NEXT: fcvtzu s0, s0
; USE-NEON-NO-GPRS-NEXT: ucvtf s0, s0
; USE-NEON-NO-GPRS-NEXT: ret
;
; NONEON-NOSVE-LABEL: t5:
; NONEON-NOSVE: // %bb.0: // %entry
; NONEON-NOSVE-NEXT: fcvtzu w8, s0
; NONEON-NOSVE-NEXT: ucvtf s0, w8
; NONEON-NOSVE-NEXT: ret
entry:
%conv = fptoui float %x to i32
%conv1 = uitofp i32 %conv to float
Expand All @@ -102,10 +149,11 @@ entry:
define half @t6(half %x) {
; CHECK-LABEL: t6:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: fcvt s0, h0
; CHECK-NEXT: fcvtzu w8, s0
; CHECK-NEXT: ucvtf s0, w8
; CHECK-NEXT: fcvt h0, s0
; CHECK-NEXT: ptrue p0.s
; CHECK-NEXT: // kill: def $h0 killed $h0 def $z0
; CHECK-NEXT: fcvtzu z0.s, p0/m, z0.h
; CHECK-NEXT: ucvtf z0.h, p0/m, z0.s
; CHECK-NEXT: // kill: def $h0 killed $h0 killed $z0
; CHECK-NEXT: ret
;
; USE-NEON-NO-GPRS-LABEL: t6:
Expand All @@ -115,6 +163,14 @@ define half @t6(half %x) {
; USE-NEON-NO-GPRS-NEXT: ucvtf s0, s0
; USE-NEON-NO-GPRS-NEXT: fcvt h0, s0
; USE-NEON-NO-GPRS-NEXT: ret
;
; NONEON-NOSVE-LABEL: t6:
; NONEON-NOSVE: // %bb.0: // %entry
; NONEON-NOSVE-NEXT: fcvt s0, h0
; NONEON-NOSVE-NEXT: fcvtzu w8, s0
; NONEON-NOSVE-NEXT: ucvtf s0, w8
; NONEON-NOSVE-NEXT: fcvt h0, s0
; NONEON-NOSVE-NEXT: ret
entry:
%conv = fptoui half %x to i32
%conv1 = uitofp i32 %conv to half
Expand Down
Loading
Loading