[AArch64] Generalize integer FPR lane stores for all types #134117

MacDue · 2025-04-02T17:03:03Z

This rewrites the fold from #129756 to apply to all types, including stores of i8s. This required adding a new aarch64mfp8 MVT to represent FPR8 types on AArch64, which can be used to extract and store 8-bit values using b sub-registers.

Follow on from: #129756
Closes: #131793

llvm/lib/Target/AArch64/AArch64InstrInfo.td

llvm/test/CodeGen/AArch64/aarch64-sve-ldst-one.ll

llvm/test/CodeGen/AArch64/trunc-to-tbl.ll

llvm/test/CodeGen/AArch64/vec3-loads-ext-trunc-stores.ll

llvm/test/CodeGen/AArch64/pr-cf624b2.ll

llvm/test/CodeGen/AArch64/vector-compress.ll

llvmbot · 2025-04-03T09:28:57Z

@llvm/pr-subscribers-backend-aarch64

Author: Benjamin Maxwell (MacDue)

Changes

This helps avoid some pointless fmovs (and in some cases improves addressing). The central fold is a DAG combine that lowers byte stores of lane extracts to v1i64 -> v1i8 truncstores, which can be lowered to bsub stores in ISEL. This currently requires a new MVT vi8 to create bsub sub-register extracts in ISEL.

Follow on from: #129756
Alternate implementation of: #131793

Patch is 74.76 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/134117.diff

36 Files Affected:

(modified) llvm/include/llvm/CodeGen/ValueTypes.td (+2)
(modified) llvm/lib/CodeGen/ValueTypes.cpp (+2)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+57-1)
(modified) llvm/lib/Target/AArch64/AArch64InstrInfo.td (+16-4)
(modified) llvm/lib/Target/AArch64/AArch64RegisterInfo.td (+1-1)
(modified) llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td (+2)
(modified) llvm/test/CodeGen/AArch64/aarch64-sve-ldst-one.ll (+131-16)
(modified) llvm/test/CodeGen/AArch64/add.ll (+7-8)
(modified) llvm/test/CodeGen/AArch64/andorxor.ll (+21-24)
(modified) llvm/test/CodeGen/AArch64/arm64-collect-loh.ll (+4-6)
(modified) llvm/test/CodeGen/AArch64/arm64-st1.ll (+50-22)
(modified) llvm/test/CodeGen/AArch64/bitcast-v2i8.ll (+12-6)
(modified) llvm/test/CodeGen/AArch64/ctlz.ll (+7-8)
(modified) llvm/test/CodeGen/AArch64/ctpop.ll (+7-8)
(modified) llvm/test/CodeGen/AArch64/cttz.ll (+7-8)
(modified) llvm/test/CodeGen/AArch64/extract-vector-cmp.ll (+3-4)
(modified) llvm/test/CodeGen/AArch64/mul.ll (+7-8)
(modified) llvm/test/CodeGen/AArch64/neon-truncstore.ll (+6-8)
(modified) llvm/test/CodeGen/AArch64/nontemporal-load.ll (+1-2)
(modified) llvm/test/CodeGen/AArch64/pr-cf624b2.ll (+23-37)
(modified) llvm/test/CodeGen/AArch64/sadd_sat_vec.ll (+4-5)
(modified) llvm/test/CodeGen/AArch64/setcc-type-mismatch.ll (+1-2)
(modified) llvm/test/CodeGen/AArch64/ssub_sat_vec.ll (+4-5)
(modified) llvm/test/CodeGen/AArch64/store.ll (+7-8)
(modified) llvm/test/CodeGen/AArch64/sub.ll (+7-8)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ld2-alloca.ll (+4-5)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-masked-gather-scatter.ll (+5-7)
(modified) llvm/test/CodeGen/AArch64/trunc-to-tbl.ll (+15-12)
(modified) llvm/test/CodeGen/AArch64/uadd_sat_vec.ll (+4-5)
(modified) llvm/test/CodeGen/AArch64/usub_sat_vec.ll (+4-5)
(modified) llvm/test/CodeGen/AArch64/vec-combine-compare-truncate-store.ll (+8-15)
(modified) llvm/test/CodeGen/AArch64/vec3-loads-ext-trunc-stores.ll (+84-86)
(modified) llvm/test/CodeGen/AArch64/vec_uaddo.ll (+2-3)
(modified) llvm/test/CodeGen/AArch64/vec_umulo.ll (+3-4)
(modified) llvm/test/CodeGen/AArch64/vector-compress.ll (+1-1)
(modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+15-14)

diff --git a/llvm/include/llvm/CodeGen/ValueTypes.td b/llvm/include/llvm/CodeGen/ValueTypes.td
index fc1a95e33380b..42c4830e94220 100644
--- a/llvm/include/llvm/CodeGen/ValueTypes.td
+++ b/llvm/include/llvm/CodeGen/ValueTypes.td
@@ -338,6 +338,8 @@ def amdgpuBufferFatPointer : ValueType<160, 234>;
 // FIXME: Remove this and the getPointerType() override if MVT::i82 is added.
 def amdgpuBufferStridedPointer : ValueType<192, 235>;
 
+def vi8       : ValueType<8,  236>;  // 8-bit integer in FPR (AArch64)
+
 let isNormalValueType = false in {
 def token      : ValueType<0, 504>;  // TokenTy
 def MetadataVT : ValueType<0, 505> { // Metadata
diff --git a/llvm/lib/CodeGen/ValueTypes.cpp b/llvm/lib/CodeGen/ValueTypes.cpp
index 0554b6387c5e6..c769568253b12 100644
--- a/llvm/lib/CodeGen/ValueTypes.cpp
+++ b/llvm/lib/CodeGen/ValueTypes.cpp
@@ -198,6 +198,8 @@ std::string EVT::getEVTString() const {
     return "amdgpuBufferFatPointer";
   case MVT::amdgpuBufferStridedPointer:
     return "amdgpuBufferStridedPointer";
+  case MVT::vi8:
+    return "vi8";
   }
 }
 
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 1c8e3afdfd718..bdbae15719fea 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -401,6 +401,7 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
   }
 
   if (Subtarget->hasFPARMv8()) {
+    addRegisterClass(MVT::vi8, &AArch64::FPR8RegClass);
     addRegisterClass(MVT::f16, &AArch64::FPR16RegClass);
     addRegisterClass(MVT::bf16, &AArch64::FPR16RegClass);
     addRegisterClass(MVT::f32, &AArch64::FPR32RegClass);
@@ -1393,6 +1394,8 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
       }
     }
 
+    setTruncStoreAction(MVT::v1i64, MVT::v1i8, Legal);
+
     for (auto Op :
          {ISD::FFLOOR, ISD::FNEARBYINT, ISD::FCEIL, ISD::FRINT, ISD::FTRUNC,
           ISD::FROUND, ISD::FROUNDEVEN, ISD::FMAXNUM_IEEE, ISD::FMINNUM_IEEE,
@@ -23988,6 +23991,22 @@ static unsigned getFPSubregForVT(EVT VT) {
   }
 }
 
+static EVT get64BitVector(EVT ElVT) {
+  assert(ElVT.isSimple() && "Expected simple VT");
+  switch (ElVT.getSimpleVT().SimpleTy) {
+  case MVT::i8:
+    return MVT::v8i8;
+  case MVT::i16:
+    return MVT::v4i16;
+  case MVT::i32:
+    return MVT::v2i32;
+  case MVT::i64:
+    return MVT::v1i64;
+  default:
+    llvm_unreachable("Unexpected VT!");
+  }
+}
+
 static SDValue performSTORECombine(SDNode *N,
                                    TargetLowering::DAGCombinerInfo &DCI,
                                    SelectionDAG &DAG,
@@ -24066,11 +24085,44 @@ static SDValue performSTORECombine(SDNode *N,
     SDValue ExtIdx = Value.getOperand(1);
     EVT VectorVT = Vector.getValueType();
     EVT ElemVT = VectorVT.getVectorElementType();
-    if (!ValueVT.isInteger() || ElemVT == MVT::i8 || MemVT == MVT::i8)
+    if (!ValueVT.isInteger())
       return SDValue();
     if (ValueVT != MemVT && !ST->isTruncatingStore())
       return SDValue();
 
+    if (MemVT == MVT::i8) {
+      auto *ExtCst = dyn_cast<ConstantSDNode>(ExtIdx);
+      if (Subtarget->isNeonAvailable() &&
+          (VectorVT == MVT::v8i8 || VectorVT == MVT::v16i8) && ExtCst &&
+          !ExtCst->isZero() && ST->getBasePtr().getOpcode() != ISD::ADD) {
+        // These can lower to st1.b, which is preferable if we're unlikely to
+        // fold the addressing into the store.
+        return SDValue();
+      }
+
+      // Lower as truncstore of v1i64 -> v1i8 (which can lower to a bsub store).
+      SDValue Zero = DAG.getConstant(0, DL, MVT::i64);
+      SDValue ExtVector;
+      EVT VecVT64 = get64BitVector(ElemVT);
+      if (ExtCst && ExtCst->isZero()) {
+        ExtVector =
+            DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VecVT64, Vector, Zero);
+      } else {
+        SDValue Ext = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL,
+                                  Value.getValueType(), Vector, ExtIdx);
+        ExtVector = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, VecVT64,
+                                DAG.getUNDEF(VecVT64), Ext, Zero);
+      }
+
+      SDValue Cast = DAG.getNode(AArch64ISD::NVCAST, DL, MVT::v1i64, ExtVector);
+      return DAG.getTruncStore(ST->getChain(), DL, Cast, ST->getBasePtr(),
+                               MVT::v1i8, ST->getMemOperand());
+    }
+
+    // TODO: Handle storing i8s to wider types.
+    if (ElemVT == MVT::i8)
+      return SDValue();
+
     // Heuristic: If there are other users of integer scalars extracted from
     // this vector that won't fold into the store -- abandon folding. Applying
     // this fold may extend the vector lifetime and disrupt paired stores.
@@ -28825,6 +28877,10 @@ SDValue AArch64TargetLowering::LowerFixedLengthVectorStoreToSVE(
   auto Pg = getPredicateForFixedLengthVector(DAG, DL, VT);
   auto NewValue = convertToScalableVector(DAG, ContainerVT, Store->getValue());
 
+  // Can be lowered to a bsub store in ISEL.
+  if (VT == MVT::v1i64 && MemVT == MVT::v1i8)
+    return SDValue();
+
   if (VT.isFloatingPoint() && Store->isTruncatingStore()) {
     EVT TruncVT = ContainerVT.changeVectorElementType(
         Store->getMemoryVT().getVectorElementType());
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index 6c61e3a613f6f..349391d17b95b 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -3575,7 +3575,7 @@ defm LDRW : LoadUI<0b10, 0, 0b01, GPR32z, uimm12s4, "ldr",
                          (load (am_indexed32 GPR64sp:$Rn, uimm12s4:$offset)))]>;
 let Predicates = [HasFPARMv8] in {
 defm LDRB : LoadUI<0b00, 1, 0b01, FPR8Op, uimm12s1, "ldr",
-                   [(set FPR8Op:$Rt,
+                   [(set (i8 FPR8Op:$Rt),
                          (load (am_indexed8 GPR64sp:$Rn, uimm12s1:$offset)))]>;
 defm LDRH : LoadUI<0b01, 1, 0b01, FPR16Op, uimm12s2, "ldr",
                    [(set (f16 FPR16Op:$Rt),
@@ -3763,7 +3763,7 @@ defm LDURW : LoadUnscaled<0b10, 0, 0b01, GPR32z, "ldur",
                           (load (am_unscaled32 GPR64sp:$Rn, simm9:$offset)))]>;
 let Predicates = [HasFPARMv8] in {
 defm LDURB : LoadUnscaled<0b00, 1, 0b01, FPR8Op, "ldur",
-                    [(set FPR8Op:$Rt,
+                    [(set (i8 FPR8Op:$Rt),
                           (load (am_unscaled8 GPR64sp:$Rn, simm9:$offset)))]>;
 defm LDURH : LoadUnscaled<0b01, 1, 0b01, FPR16Op, "ldur",
                     [(set (f16 FPR16Op:$Rt),
@@ -4333,7 +4333,7 @@ defm STRW : StoreUIz<0b10, 0, 0b00, GPR32z, uimm12s4, "str",
                             (am_indexed32 GPR64sp:$Rn, uimm12s4:$offset))]>;
 let Predicates = [HasFPARMv8] in {
 defm STRB : StoreUI<0b00, 1, 0b00, FPR8Op, uimm12s1, "str",
-                    [(store FPR8Op:$Rt,
+                    [(store (i8 FPR8Op:$Rt),
                             (am_indexed8 GPR64sp:$Rn, uimm12s1:$offset))]>;
 defm STRH : StoreUI<0b01, 1, 0b00, FPR16Op, uimm12s2, "str",
                     [(store (f16 FPR16Op:$Rt),
@@ -4469,7 +4469,7 @@ defm STURW : StoreUnscaled<0b10, 0, 0b00, GPR32z, "stur",
                                  (am_unscaled32 GPR64sp:$Rn, simm9:$offset))]>;
 let Predicates = [HasFPARMv8] in {
 defm STURB : StoreUnscaled<0b00, 1, 0b00, FPR8Op, "stur",
-                         [(store FPR8Op:$Rt,
+                         [(store (i8 FPR8Op:$Rt),
                                  (am_unscaled8 GPR64sp:$Rn, simm9:$offset))]>;
 defm STURH : StoreUnscaled<0b01, 1, 0b00, FPR16Op, "stur",
                          [(store (f16 FPR16Op:$Rt),
@@ -4589,6 +4589,18 @@ def : Pat<(truncstorei16 GPR64:$Rt, (am_unscaled16 GPR64sp:$Rn, simm9:$offset)),
 def : Pat<(truncstorei8 GPR64:$Rt, (am_unscaled8 GPR64sp:$Rn, simm9:$offset)),
   (STURBBi (EXTRACT_SUBREG GPR64:$Rt, sub_32), GPR64sp:$Rn, simm9:$offset)>;
 
+// v1i64 -> bsub truncating stores
+// Supporting pattern lower f32/64 -> v8i8
+def : Pat<(v8i8 (vector_insert (v8i8 (undef)), (i32 FPR32:$src), 0)),
+          (INSERT_SUBREG (v8i8 (IMPLICIT_DEF)), FPR32:$src, ssub)>;
+def : Pat<(v8i8 (vector_insert (v8i8 (undef)), (i64 FPR64:$src), 0)),
+          (v8i8 (EXTRACT_SUBREG (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)), FPR64:$src, dsub), dsub))>;
+// Lower v1i64 -> v1i8 truncstore to bsub store
+def : Pat<(truncstorevi8 v1i64:$VT, (am_unscaled8 GPR64sp:$Rn, simm9:$offset)),
+          (STURBi (vi8 (EXTRACT_SUBREG v1i64:$VT, bsub)), GPR64sp:$Rn, simm9:$offset)>;
+def : Pat<(truncstorevi8 v1i64:$VT, (am_indexed8 GPR64sp:$Rn, uimm12s4:$offset)),
+          (STRBui (vi8 (EXTRACT_SUBREG v1i64:$VT, bsub)), GPR64sp:$Rn, uimm12s4:$offset)>;
+
 // Match stores from lane 0 to the appropriate subreg's store.
 multiclass VecStoreULane0Pat<SDPatternOperator StoreOp,
                              ValueType VTy, ValueType STy,
diff --git a/llvm/lib/Target/AArch64/AArch64RegisterInfo.td b/llvm/lib/Target/AArch64/AArch64RegisterInfo.td
index fed9b7b173e9c..42ba1451650ed 100644
--- a/llvm/lib/Target/AArch64/AArch64RegisterInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64RegisterInfo.td
@@ -497,7 +497,7 @@ def Q30   : AArch64Reg<30, "q30", [D30, D30_HI], ["v30", ""]>, DwarfRegAlias<B30
 def Q31   : AArch64Reg<31, "q31", [D31, D31_HI], ["v31", ""]>, DwarfRegAlias<B31>;
 }
 
-def FPR8  : RegisterClass<"AArch64", [i8], 8, (sequence "B%u", 0, 31)> {
+def FPR8  : RegisterClass<"AArch64", [i8, vi8], 8, (sequence "B%u", 0, 31)> {
   let Size = 8;
   let DecoderMethod = "DecodeSimpleRegisterClass<AArch64::FPR8RegClassID, 0, 32>";
 }
diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index 3ee71c14c6bd4..8179b253a86de 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -3208,6 +3208,8 @@ let Predicates = [HasSVE_or_SME] in {
   // Insert scalar into undef[0]
   def : Pat<(nxv16i8 (vector_insert (nxv16i8 (undef)), (i32 FPR32:$src), 0)),
             (INSERT_SUBREG (nxv16i8 (IMPLICIT_DEF)), FPR32:$src, ssub)>;
+  def : Pat<(nxv16i8 (vector_insert (nxv16i8 (undef)), (i64 FPR64:$src), 0)),
+            (INSERT_SUBREG (nxv16i8 (IMPLICIT_DEF)), FPR64:$src, dsub)>;
   def : Pat<(nxv8i16 (vector_insert (nxv8i16 (undef)), (i32 FPR32:$src), 0)),
             (INSERT_SUBREG (nxv8i16 (IMPLICIT_DEF)), FPR32:$src, ssub)>;
   def : Pat<(nxv4i32 (vector_insert (nxv4i32 (undef)), (i32 FPR32:$src), 0)),
diff --git a/llvm/test/CodeGen/AArch64/aarch64-sve-ldst-one.ll b/llvm/test/CodeGen/AArch64/aarch64-sve-ldst-one.ll
index d39c9bf760621..713ddd9aefe01 100644
--- a/llvm/test/CodeGen/AArch64/aarch64-sve-ldst-one.ll
+++ b/llvm/test/CodeGen/AArch64/aarch64-sve-ldst-one.ll
@@ -108,17 +108,15 @@ entry:
 define void @test_str_lane_s8(ptr %a, <vscale x 16 x i8> %b) {
 ; CHECK-NONSTREAMING-LABEL: test_str_lane_s8:
 ; CHECK-NONSTREAMING:       // %bb.0: // %entry
-; CHECK-NONSTREAMING-NEXT:    umov w8, v0.b[7]
-; CHECK-NONSTREAMING-NEXT:    strb w8, [x0]
+; CHECK-NONSTREAMING-NEXT:    mov v0.b[0], v0.b[7]
+; CHECK-NONSTREAMING-NEXT:    str b0, [x0]
 ; CHECK-NONSTREAMING-NEXT:    ret
 ;
 ; STREAMING-COMPAT-LABEL: test_str_lane_s8:
 ; STREAMING-COMPAT:       // %bb.0: // %entry
 ; STREAMING-COMPAT-NEXT:    mov z0.b, z0.b[7]
-; STREAMING-COMPAT-NEXT:    fmov w8, s0
-; STREAMING-COMPAT-NEXT:    strb w8, [x0]
+; STREAMING-COMPAT-NEXT:    str b0, [x0]
 ; STREAMING-COMPAT-NEXT:    ret
-
 entry:
   %0 = extractelement <vscale x 16 x i8> %b, i32 7
   store i8 %0, ptr %a, align 1
@@ -128,10 +126,8 @@ entry:
 define void @test_str_lane0_s8(ptr %a, <vscale x 16 x i8> %b) {
 ; CHECK-LABEL: test_str_lane0_s8:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    fmov w8, s0
-; CHECK-NEXT:    strb w8, [x0]
+; CHECK-NEXT:    str b0, [x0]
 ; CHECK-NEXT:    ret
-
 entry:
   %0 = extractelement <vscale x 16 x i8> %b, i32 0
   store i8 %0, ptr %a, align 1
@@ -201,6 +197,19 @@ define void @test_str_reduction_i32_to_i16(ptr %ptr, <vscale x 4 x i1> %p0, <vsc
   ret void
 }
 
+define void @test_str_reduction_i32_to_i8(ptr %ptr, <vscale x 4 x i1> %p0, <vscale x 4 x i32> %v) {
+; CHECK-LABEL: test_str_reduction_i32_to_i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    uaddv d0, p0, z0.s
+; CHECK-NEXT:    str b0, [x0]
+; CHECK-NEXT:    ret
+
+  %reduce = tail call i64 @llvm.aarch64.sve.uaddv.nxv4i32(<vscale x 4 x i1> %p0, <vscale x 4 x i32> %v)
+  %trunc = trunc i64 %reduce to i8
+  store i8 %trunc, ptr %ptr, align 1
+  ret void
+}
+
 define void @test_str_reduction_i32_to_i32_negative_offset(ptr %ptr, <vscale x 4 x i1> %p0, <vscale x 4 x i32> %v) {
 ; CHECK-LABEL: test_str_reduction_i32_to_i32_negative_offset:
 ; CHECK:       // %bb.0:
@@ -242,6 +251,20 @@ define void @test_str_reduction_i32_to_i16_negative_offset(ptr %ptr, <vscale x 4
   ret void
 }
 
+define void @test_str_reduction_i32_to_i8_negative_offset(ptr %ptr, <vscale x 4 x i1> %p0, <vscale x 4 x i32> %v) {
+; CHECK-LABEL: test_str_reduction_i32_to_i8_negative_offset:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    uaddv d0, p0, z0.s
+; CHECK-NEXT:    stur b0, [x0, #-8]
+; CHECK-NEXT:    ret
+
+  %reduce = tail call i64 @llvm.aarch64.sve.uaddv.nxv4i32(<vscale x 4 x i1> %p0, <vscale x 4 x i32> %v)
+  %trunc = trunc i64 %reduce to i8
+  %out_ptr = getelementptr inbounds i8, ptr %ptr, i64 -8
+  store i8 %trunc, ptr %out_ptr, align 1
+  ret void
+}
+
 define void @test_str_lane_s32_negative_offset(ptr %a, <vscale x 4 x i32> %b) {
 ; CHECK-LABEL: test_str_lane_s32_negative_offset:
 ; CHECK:       // %bb.0: // %entry
@@ -299,17 +322,15 @@ entry:
 define void @test_str_lane_s8_negative_offset(ptr %a, <vscale x 16 x i8> %b) {
 ; CHECK-NONSTREAMING-LABEL: test_str_lane_s8_negative_offset:
 ; CHECK-NONSTREAMING:       // %bb.0: // %entry
-; CHECK-NONSTREAMING-NEXT:    umov w8, v0.b[7]
-; CHECK-NONSTREAMING-NEXT:    sturb w8, [x0, #-8]
+; CHECK-NONSTREAMING-NEXT:    mov v0.b[0], v0.b[7]
+; CHECK-NONSTREAMING-NEXT:    stur b0, [x0, #-8]
 ; CHECK-NONSTREAMING-NEXT:    ret
 ;
 ; STREAMING-COMPAT-LABEL: test_str_lane_s8_negative_offset:
 ; STREAMING-COMPAT:       // %bb.0: // %entry
 ; STREAMING-COMPAT-NEXT:    mov z0.b, z0.b[7]
-; STREAMING-COMPAT-NEXT:    fmov w8, s0
-; STREAMING-COMPAT-NEXT:    sturb w8, [x0, #-8]
+; STREAMING-COMPAT-NEXT:    stur b0, [x0, #-8]
 ; STREAMING-COMPAT-NEXT:    ret
-
 entry:
   %0 = extractelement <vscale x 16 x i8> %b, i32 7
   %out_ptr = getelementptr inbounds i8, ptr %a, i64 -8
@@ -320,10 +341,8 @@ entry:
 define void @test_str_lane0_s8_negative_offset(ptr %a, <vscale x 16 x i8> %b) {
 ; CHECK-LABEL: test_str_lane0_s8_negative_offset:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    fmov w8, s0
-; CHECK-NEXT:    sturb w8, [x0, #-8]
+; CHECK-NEXT:    stur b0, [x0, #-8]
 ; CHECK-NEXT:    ret
-
 entry:
   %0 = extractelement <vscale x 16 x i8> %b, i32 0
   %out_ptr = getelementptr inbounds i8, ptr %a, i64 -8
@@ -385,6 +404,53 @@ entry:
   ret void
 }
 
+
+define void @test_str_trunc_lane_s32_to_s8(ptr %a, <vscale x 4 x i32> %b) {
+; CHECK-NONSTREAMING-LABEL: test_str_trunc_lane_s32_to_s8:
+; CHECK-NONSTREAMING:       // %bb.0: // %entry
+; CHECK-NONSTREAMING-NEXT:    mov v0.s[0], v0.s[3]
+; CHECK-NONSTREAMING-NEXT:    str b0, [x0]
+; CHECK-NONSTREAMING-NEXT:    ret
+;
+; STREAMING-COMPAT-LABEL: test_str_trunc_lane_s32_to_s8:
+; STREAMING-COMPAT:       // %bb.0: // %entry
+; STREAMING-COMPAT-NEXT:    mov z0.s, z0.s[3]
+; STREAMING-COMPAT-NEXT:    str b0, [x0]
+; STREAMING-COMPAT-NEXT:    ret
+entry:
+  %0 = extractelement <vscale x 4 x i32> %b, i32 3
+  %trunc = trunc i32 %0 to i8
+  store i8 %trunc, ptr %a, align 1
+  ret void
+}
+
+define void @test_str_trunc_lane0_s32_to_s8(ptr %a, <vscale x 4 x i32> %b) {
+; CHECK-LABEL: test_str_trunc_lane0_s32_to_s8:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    str b0, [x0]
+; CHECK-NEXT:    ret
+
+entry:
+  %0 = extractelement <vscale x 4 x i32> %b, i32 0
+  %trunc = trunc i32 %0 to i8
+  store i8 %trunc, ptr %a, align 1
+  ret void
+}
+
+define void @test_str_trunc_lane_s64_to_s8(ptr %a, <vscale x 2 x i64> %b) {
+; CHECK-LABEL: test_str_trunc_lane_s64_to_s8:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    mov z0.d, z0.d[3]
+; CHECK-NEXT:    str b0, [x0]
+; CHECK-NEXT:    ret
+
+entry:
+  %0 = extractelement <vscale x 2 x i64> %b, i32 3
+  %trunc = trunc i64 %0 to i8
+  store i8 %trunc, ptr %a, align 1
+  ret void
+}
+
 define void @test_str_trunc_lane_s32_to_s16_negative_offset(ptr %a, <vscale x 4 x i32> %b) {
 ; CHECK-LABEL: test_str_trunc_lane_s32_to_s16_negative_offset:
 ; CHECK:       // %bb.0: // %entry
@@ -413,3 +479,52 @@ entry:
   store i16 %trunc, ptr %out_ptr, align 2
   ret void
 }
+
+define void @test_str_trunc_lane_s32_to_s8_negative_offset(ptr %a, <vscale x 4 x i32> %b) {
+; CHECK-NONSTREAMING-LABEL: test_str_trunc_lane_s32_to_s8_negative_offset:
+; CHECK-NONSTREAMING:       // %bb.0: // %entry
+; CHECK-NONSTREAMING-NEXT:    mov v0.s[0], v0.s[3]
+; CHECK-NONSTREAMING-NEXT:    stur b0, [x0, #-8]
+; CHECK-NONSTREAMING-NEXT:    ret
+;
+; STREAMING-COMPAT-LABEL: test_str_trunc_lane_s32_to_s8_negative_offset:
+; STREAMING-COMPAT:       // %bb.0: // %entry
+; STREAMING-COMPAT-NEXT:    mov z0.s, z0.s[3]
+; STREAMING-COMPAT-NEXT:    stur b0, [x0, #-8]
+; STREAMING-COMPAT-NEXT:    ret
+entry:
+  %0 = extractelement <vscale x 4 x i32> %b, i32 3
+  %trunc = trunc i32 %0 to i8
+  %out_ptr = getelementptr inbounds i8, ptr %a, i64 -8
+  store i8 %trunc, ptr %out_ptr, align 1
+  ret void
+}
+
+define void @test_str_trunc_lane0_s32_to_s8_negative_offset(ptr %a, <vscale x 4 x i32> %b) {
+; CHECK-LABEL: test_str_trunc_lane0_s32_to_s8_negative_offset:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    stur b0, [x0, #-8]
+; CHECK-NEXT:    ret
+
+entry:
+  %0 = extractelement <vscale x 4 x i32> %b, i32 0
+  %trunc = trunc i32 %0 to i8
+  %out_ptr = getelementptr inbounds i8, ptr %a, i64 -8
+  store i8 %trunc, ptr %out_ptr, align 1
+  ret void
+}
+
+define void @test_str_trunc_lane_s64_to_s8_negative_offset(ptr %a, <vscale x 2 x i64> %b) {
+; CHECK-LABEL: test_str_trunc_lane_s64_to_s8_negative_offset:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    mov z0.d, z0.d[3]
+; CHECK-NEXT:    stur b0, [x0, #-8]
+; CHECK-NEXT:    ret
+
+entry:
+  %0 = extractelement <vscale x 2 x i64> %b, i32 3
+  %trunc = trunc i64 %0 to i8
+  %out_ptr = getelementptr inbounds i8, ptr %a, i64 -8
+  store i8 %trunc, ptr %out_ptr, align 1
+  ret void
+}
diff --git a/llvm/test/CodeGen/AArch64/add.ll b/llvm/test/CodeGen/AArch64/add.ll
index fc0ba336b21cc..ea5dbc03ca174 100644
--- a/llvm/test/CodeGen/AArch64/add.ll
+++ b/llvm/test/CodeGen/AArch64/add.ll
@@ -63,10 +63,9 @@ define void @v2i8(ptr %p1, ptr %p2) {
 ; CHECK-SD-NEXT:    ld1 { v0.b }[4], [x8]
 ; CHECK-SD-NEXT:    ld1 { v1.b }[4], [x9]
 ; CHECK-SD-NEXT:    add v0.2s, v0.2s, v1.2s
-; CHECK-SD-NEXT:    mov w8, v0.s[1]
-; CHECK-SD-NEXT:    fmov w9, s0
-; CHECK-SD-NEXT:    strb w9, [x0]
-; CHECK-SD-NEXT:    strb w8, [x0, #1]
+; CHECK-SD-NEXT:    mov v1.s[0], v0.s[1]
+; CHECK-SD-NEXT:    str b0, [x0]
+; CHECK-SD-NEXT:    stur b1, [x0, #1]
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: v2i8:
@@ -101,11 +100,11 @@ define void @v3i8(ptr %p1, ptr %p2) {
 ; CHECK-SD-NEXT:    zip1 v1.8b, v1.8b, v0.8b
 ; CHECK-SD-NEXT:    add v0.4h, v0.4h, v1.4h
 ; CHECK-SD-NEXT:    uzp1 v1.8b, v0.8b, v0.8b
-; CHECK-SD-NEXT:    umov w8, v0.h[2]
+; CHECK-SD-NEXT:    mov v0.h[0], v0.h[2]
 ; CHECK-SD-NEXT:    str s1, [sp, #12]
-; CHECK-SD-NEXT:    ldrh w9, [sp, #12]
-; CHECK-SD-NEXT:    strb w8, [x0, #2]
-; CHECK-SD-NEXT:    strh w9, [x0]
+; CHECK-SD-NEXT:    ldrh w8, [sp, #12]
+; CHECK-SD-NEXT:    stur b0, [x0, #2]
+; CHECK-SD-NEXT:    strh w8, [x0]
 ; CHECK-SD-NEXT:    add sp, sp, #16
 ; CHECK-SD-NEXT:    ret
 ;
diff --git a/llvm/test/CodeGen/AArch64/andorxor.ll b/llvm/test/CodeGen/AArch64/andorxor.ll
index 24f2549cce785..adfcf26f85ba4 100644
--- a/llvm/test/CodeGen/AArch64/andorxor.ll
+++ b/llvm/test/CodeGen/AArch64/andorxor.ll
@@ -183,10 +183,9 @@ define void @and_v2i8(ptr %p1, ptr %p2) {
 ; CHECK-SD-NEXT:    ld1 { v0.b }[4], [x8]
 ; CHECK-SD-NEXT:    ld1 { v1.b }[4], [x9]
 ; CHECK-SD-NEXT:    and v0.8b, v0.8b, v1.8b
-; CHECK-SD-NEXT:    mov w8, v0.s[1]
-; CHECK-SD-NEXT:    fmov w9, s0
-; CHECK-SD-NEXT:    strb w9, [x0]
-; CHECK-SD-NEXT:    strb w8, [x0, #1]
+; CHECK-SD-NEXT:    mov v1.s[0], v0.s[1]
+; CHECK-SD-NEXT:    str b0, [x0]
+; CHECK-SD-NEXT:    stur b1, [x0, #1]
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: and_v2i8:
@@ -220,1...
[truncated]

llvm/include/llvm/CodeGen/ValueTypes.td

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/lib/Target/AArch64/AArch64InstrInfo.td

paulwalker-arm · 2025-04-08T14:48:34Z

I believe the decision to have a dedicated MVT for mfp8 means we can simplify things further because we don't need the truncating store (given that was my failed suggestion to remove the need for a dedicated MVT) and can instead have isel to store aarch64mfp8 directly. In doing this I believe we can unify the i8 support with the previous support for i16-i64 because at the end of the day the logic is essentially:

Move lane_to_store to lane_0
bitcast to a vector we can extract a subreg from
extract the subreg
store the subreg

Whilst the new MVT makes this slightly more awkward I think we're only really talking about having logic like:

EVT FPElemVT = ElemVT == MVT::i8 ? MVT::i8 : EVT::getFloatingPointVT(***);
EVT FPMemVT = MemVT == MVT::i8 ? MVT::aarch64mfp8 : EVT::getFloatingPointVT(***);

but beyond the VTs in play the resulting DAG will hopefully be the same across all the types.

MacDue · 2025-04-08T19:54:53Z

It's not quite simplified (as the fold is a little more complex in general), but I've been able to rewrite this (and the previous fold) into a unified fold that works the same for all types. This has added quite a bit more test churn, though it mostly seems innocuous.

I've split out one patch to regenerate the st1 tests (as previously they were manually written tests, which are annoying to update). The ~same checks can be generated using the --filter option of update_test_checks, patch here: #134919 (comment)

MacDue · 2025-04-09T07:02:48Z

It might be easier if the generalization of the fold was a separate patch (and this PR lands the previous byte store combine). That way, the test changes due to changing the non-i8 fold would be in their own patch. WDYT?

paulwalker-arm · 2025-04-09T10:06:06Z

It might be easier if the generalization of the fold was a separate patch (and this PR lands the previous byte store combine). That way, the test changes due to changing the non-i8 fold would be in their own patch. WDYT?

It's not worth it. If we'd come to this design earlier we would have just landed support for all types anyway.

This helps avoid some pointless `fmovs` in some cases. Currently, this is done in ISEL as FPR bytes are problematic in SDAG (as neither GPR or FPR bytes are a legal type).

paulwalker-arm

I think we're pretty much there now with the bones of the patch looking good. That said, there are a couple of idiom switches (most importantly the "int add" -> "vector mov") that I would like us to avoid if possible.

llvm/lib/CodeGen/ValueTypes.cpp

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/lib/Target/AArch64/AArch64InstrInfo.td

llvm/test/CodeGen/AArch64/aarch64-neon-vector-insert-uaddlv.ll

llvm/test/CodeGen/AArch64/add.ll

paulwalker-arm · 2025-04-10T12:47:45Z

llvm/test/CodeGen/AArch64/add.ll

+; CHECK-SD-NEXT:    mov v1.h[0], v0.h[2]
 ; CHECK-SD-NEXT:    str s0, [x0]
+; CHECK-SD-NEXT:    str h1, [x0, #4]


This looks like a negative change because the original integer add is likely to be cheaper than the vector mov.

I believe it could be an improvement as st1 is basically str + mov. Looking at the Neoverse V1 optimization guide, I get:

add => latency of 1 st1 (b/h/s) => latency of 4

vs

ins => latency of 2 str (b/h/s/d) => latency of 2

I see what you mean.

paulwalker-arm

One last comment.

paulwalker-arm · 2025-04-16T11:58:13Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

+    // Propagate zero constants (applying this fold may miss optimizations).
+    if (ISD::isConstantSplatVectorAllZeros(Vector.getNode())) {
+      SDValue ZeroElt = DAG.getConstant(0, DL, ValueVT);
+      DAG.ReplaceAllUsesWith(Value, ZeroElt);
+      return SDValue();
+    }


I'd say we're missing an obvious combine and looking at DAGCombiner::visitBITCAST I see

TODO: Support FP bitcasts after legalize types

It's not worth looking for trouble so I'll take the current fix and we can circle back later if necessary. That said, DAG.getConstant is specific to integers so the code needs moving after the !ValueVT.isInteger() bailout.

👍 I added this fold:

+ // extract_vector_elt of undef index -> UNDEF + if (Index.isUndef()) + return DAG.getUNDEF(ScalarVT); + + // extract_vector_elt of zero splat -> zero + if (ISD::isConstantSplatVectorAllZeros(VecOp.getNode())) + return ScalarVT.isFloatingPoint() ? DAG.getConstantFP(0.0, DL, ScalarVT) + : DAG.getConstant(0, DL, ScalarVT); +

to visitEXTRACT_VECTOR_ELT, which also seemed to resolve this (and a few more cases). But I can maybe post that in a later patch (since it results in changes across a few targets).

Yep, definitely future work. It's worth looking at visitBITCAST first though because I think it only affects floating point build_vectors that are bit casted to integer. When the "redundant" bit casting is removed I believe existing combines will then take over.

This rewrites the fold from llvm#129756 to apply to all types, including stores of i8s. This required adding a new `aarch64mfp8` MVT to represent FPR8 types on AArch64, which can be used to extract and store 8-bit values using b sub-registers. Follow on from: llvm#129756 Closes: llvm#131793

MacDue commented Apr 2, 2025

View reviewed changes

llvm/lib/Target/AArch64/AArch64InstrInfo.td Outdated Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64InstrInfo.td Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

MacDue force-pushed the attempt_using_truncstore branch from 3a0e00d to 2906095 Compare April 2, 2025 18:42

MacDue commented Apr 2, 2025

View reviewed changes

MacDue force-pushed the attempt_using_truncstore branch from 2906095 to ef964aa Compare April 2, 2025 19:16

MacDue changed the title ~~(WIP) [AArch64] Make use of byte FPR stores for bytes extracted from vectors~~ [AArch64] Make use of byte FPR stores for bytes extracted from vectors Apr 3, 2025

MacDue marked this pull request as ready for review April 3, 2025 09:28

llvmbot added the backend:AArch64 label Apr 3, 2025

MacDue mentioned this pull request Apr 3, 2025

[AArch64] Make use of byte FPR stores for bytes extracted from vectors #131793

Closed

MacDue requested review from sdesmalen-arm, paulwalker-arm, SamTebbs33, david-arm and NickGuy-Arm April 3, 2025 10:05

paulwalker-arm reviewed Apr 4, 2025

View reviewed changes

MacDue force-pushed the attempt_using_truncstore branch from da2fbee to e14277e Compare April 8, 2025 19:48

MacDue changed the title ~~[AArch64] Make use of byte FPR stores for bytes extracted from vectors~~ [AArch64] Generalize integer FPR lane stores for all types Apr 10, 2025

MacDue added 6 commits April 10, 2025 12:46

[AArch64] Make use of byte FPR stores bytes extracted from vectors

24cbbb3

This helps avoid some pointless `fmovs` in some cases. Currently, this is done in ISEL as FPR bytes are problematic in SDAG (as neither GPR or FPR bytes are a legal type).

WIP: Attempt vector truncstore

bc61bd5

Prefer st1.b in some cases

7842bc4

Avoid NOP movs

f6466ce

Add note

7abb342

Fixups

36d54fe

paulwalker-arm reviewed Apr 10, 2025

View reviewed changes

Generalize fold

6857792

Fixups

9ce811e

MacDue force-pushed the attempt_using_truncstore branch from e14277e to 9ce811e Compare April 11, 2025 12:31

MacDue added 3 commits April 11, 2025 13:56

Prefer scalar mov form

87e277f

Remove SVE FIXME

2f39546

Restore zero store behaviour

33346f8

paulwalker-arm reviewed Apr 16, 2025

View reviewed changes

Move check

ab5439a

paulwalker-arm approved these changes Apr 16, 2025

View reviewed changes

MacDue merged commit 1588aab into llvm:main Apr 17, 2025
11 checks passed

MacDue deleted the attempt_using_truncstore branch April 22, 2025 09:01

[AArch64] Generalize integer FPR lane stores for all types #134117

[AArch64] Generalize integer FPR lane stores for all types #134117

Uh oh!

Conversation

MacDue commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Apr 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paulwalker-arm commented Apr 8, 2025

Uh oh!

MacDue commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MacDue commented Apr 9, 2025

Uh oh!

paulwalker-arm commented Apr 9, 2025

Uh oh!

paulwalker-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paulwalker-arm Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

MacDue Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulwalker-arm Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

paulwalker-arm left a comment

Choose a reason for hiding this comment

Uh oh!

paulwalker-arm Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

MacDue Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

paulwalker-arm Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MacDue commented Apr 2, 2025 •

edited

Loading

MacDue commented Apr 8, 2025 •

edited

Loading

MacDue Apr 10, 2025 •

edited

Loading