Skip to content

[AArch64][GlobalISel] Legalize G_VECREDUCE_{MIN/MAX} #69461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 9, 2023

Conversation

chuongg3
Copy link
Contributor

Legalizes G_VECREDUCE_{MIN/MAX} and selects instructions for vecreduce_{min/max}

@llvmbot
Copy link
Member

llvmbot commented Oct 18, 2023

@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-llvm-globalisel

Author: None (chuongg3)

Changes

Legalizes G_VECREDUCE_{MIN/MAX} and selects instructions for vecreduce_{min/max}


Patch is 70.64 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/69461.diff

5 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64InstrGISel.td (+5)
  • (modified) llvm/lib/Target/AArch64/AArch64InstrInfo.td (+36)
  • (modified) llvm/lib/Target/AArch64/GISel/AArch64LegalizerInfo.cpp (+15)
  • (modified) llvm/test/CodeGen/AArch64/GlobalISel/legalizer-info-validation.mir (+11-8)
  • (modified) llvm/test/CodeGen/AArch64/aarch64-minmaxv.ll (+1787-82)
diff --git a/llvm/lib/Target/AArch64/AArch64InstrGISel.td b/llvm/lib/Target/AArch64/AArch64InstrGISel.td
index 27338bd24393325..a3e8b1fff32eee9 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrGISel.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrGISel.td
@@ -274,6 +274,11 @@ def : GINodeEquiv<G_EXTRACT_VECTOR_ELT, vector_extract>;
 
 def : GINodeEquiv<G_PREFETCH, AArch64Prefetch>;
 
+def : GINodeEquiv<G_VECREDUCE_UMIN, vecreduce_umin>;
+def : GINodeEquiv<G_VECREDUCE_UMAX, vecreduce_umax>;
+def : GINodeEquiv<G_VECREDUCE_SMIN, vecreduce_smin>;
+def : GINodeEquiv<G_VECREDUCE_SMAX, vecreduce_smax>;
+
 // These are patterns that we only use for GlobalISel via the importer.
 def : Pat<(f32 (fadd (vector_extract (v2f32 FPR64:$Rn), (i64 0)),
                      (vector_extract (v2f32 FPR64:$Rn), (i64 1)))),
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index df59dc4ad27fadb..d1b23da7dbfac6c 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -6642,6 +6642,42 @@ defm : SIMDAcrossLanesUnsignedIntrinsic<"UMINV", AArch64uminv>;
 def : Pat<(v2i32 (AArch64uminv (v2i32 V64:$Rn))),
           (UMINPv2i32 V64:$Rn, V64:$Rn)>;
 
+// For vecreduce_{opc}
+multiclass SIMDAcrossLanesVecReductionIntrinsic<string baseOpc,
+                                            SDPatternOperator opNode> {
+def : Pat<(i8 (opNode (v8i8 FPR64:$Rn))),
+          (!cast<Instruction>(!strconcat(baseOpc, "v8i8v")) FPR64:$Rn)>;
+
+def : Pat<(i8 (opNode (v16i8 FPR128:$Rn))),
+          (!cast<Instruction>(!strconcat(baseOpc, "v16i8v")) FPR128:$Rn)>;
+
+def : Pat<(i16 (opNode (v4i16 FPR64:$Rn))),
+          (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) FPR64:$Rn)>;
+
+def : Pat<(i16 (opNode (v8i16 FPR128:$Rn))),
+          (!cast<Instruction>(!strconcat(baseOpc, "v8i16v")) FPR128:$Rn)>;
+
+def : Pat<(i32 (opNode (v4i32 V128:$Rn))), 
+          (!cast<Instruction>(!strconcat(baseOpc, "v4i32v")) V128:$Rn)>;
+
+}
+
+defm : SIMDAcrossLanesVecReductionIntrinsic<"UMINV", vecreduce_umin>;
+def : Pat<(i32 (vecreduce_umin (v2i32 V64:$Rn))), 
+          (i32 (EXTRACT_SUBREG (UMINPv2i32 V64:$Rn, V64:$Rn), ssub))>;
+
+defm : SIMDAcrossLanesVecReductionIntrinsic<"UMAXV", vecreduce_umax>;
+def : Pat<(i32 (vecreduce_umax (v2i32 V64:$Rn))), 
+          (i32 (EXTRACT_SUBREG (UMAXPv2i32 V64:$Rn, V64:$Rn), ssub))>;
+
+defm : SIMDAcrossLanesVecReductionIntrinsic<"SMINV", vecreduce_smin>;
+def : Pat<(i32 (vecreduce_smin (v2i32 V64:$Rn))), 
+          (i32 (EXTRACT_SUBREG (SMINPv2i32 V64:$Rn, V64:$Rn), ssub))>;
+
+defm : SIMDAcrossLanesVecReductionIntrinsic<"SMAXV", vecreduce_smax>;
+def : Pat<(i32 (vecreduce_smax (v2i32 V64:$Rn))), 
+          (i32 (EXTRACT_SUBREG (SMAXPv2i32 V64:$Rn, V64:$Rn), ssub))>;
+
 multiclass SIMDAcrossLanesSignedLongIntrinsic<string baseOpc, Intrinsic intOp> {
   def : Pat<(i32 (intOp (v8i8 V64:$Rn))),
         (i32 (SMOVvi16to32
diff --git a/llvm/lib/Target/AArch64/GISel/AArch64LegalizerInfo.cpp b/llvm/lib/Target/AArch64/GISel/AArch64LegalizerInfo.cpp
index ddc27bebb767693..3c0d61f3ba4213b 100644
--- a/llvm/lib/Target/AArch64/GISel/AArch64LegalizerInfo.cpp
+++ b/llvm/lib/Target/AArch64/GISel/AArch64LegalizerInfo.cpp
@@ -902,6 +902,21 @@ AArch64LegalizerInfo::AArch64LegalizerInfo(const AArch64Subtarget &ST)
       .scalarize(1)
       .lower();
 
+  getActionDefinitionsBuilder(
+      {G_VECREDUCE_SMIN, G_VECREDUCE_SMAX, G_VECREDUCE_UMIN, G_VECREDUCE_UMAX})
+      .legalFor({{s8, v8s8},
+                 {s8, v16s8},
+                 {s16, v4s16},
+                 {s16, v8s16},
+                 {s32, v2s32},
+                 {s32, v4s32}})
+      .clampMaxNumElements(1, s64, 2)
+      .clampMaxNumElements(1, s32, 4)
+      .clampMaxNumElements(1, s16, 8)
+      .clampMaxNumElements(1, s8, 16)
+      .scalarize(1)
+      .lower();
+
   getActionDefinitionsBuilder(
       {G_VECREDUCE_OR, G_VECREDUCE_AND, G_VECREDUCE_XOR})
       // Try to break down into smaller vectors as long as they're at least 64
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/legalizer-info-validation.mir b/llvm/test/CodeGen/AArch64/GlobalISel/legalizer-info-validation.mir
index 549f36b2afd066f..745ba70140d42d4 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/legalizer-info-validation.mir
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/legalizer-info-validation.mir
@@ -768,17 +768,20 @@
 # DEBUG-NEXT: .. type index coverage check SKIPPED: user-defined predicate detected
 # DEBUG-NEXT: .. imm index coverage check SKIPPED: user-defined predicate detected
 # DEBUG-NEXT: G_VECREDUCE_SMAX (opcode {{[0-9]+}}): 2 type indices, 0 imm indices
-# DEBUG-NEXT: .. type index coverage check SKIPPED: no rules defined
-# DEBUG-NEXT: .. imm index coverage check SKIPPED: no rules defined
+# DEBUG-NEXT: .. opcode {{[0-9]+}} is aliased to {{[0-9]+}}
+# DEBUG-NEXT: .. type index coverage check SKIPPED: user-defined predicate detected
+# DEBUG-NEXT: .. imm index coverage check SKIPPED: user-defined predicate detected
 # DEBUG-NEXT: G_VECREDUCE_SMIN (opcode {{[0-9]+}}): 2 type indices, 0 imm indices
-# DEBUG-NEXT: .. type index coverage check SKIPPED: no rules defined
-# DEBUG-NEXT: .. imm index coverage check SKIPPED: no rules defined
+# DEBUG-NEXT: .. type index coverage check SKIPPED: user-defined predicate detected
+# DEBUG-NEXT: .. imm index coverage check SKIPPED: user-defined predicate detected
 # DEBUG-NEXT: G_VECREDUCE_UMAX (opcode {{[0-9]+}}): 2 type indices, 0 imm indices
-# DEBUG-NEXT: .. type index coverage check SKIPPED: no rules defined
-# DEBUG-NEXT: .. imm index coverage check SKIPPED: no rules defined
+# DEBUG-NEXT: .. opcode {{[0-9]+}} is aliased to {{[0-9]+}}
+# DEBUG-NEXT: .. type index coverage check SKIPPED: user-defined predicate detected
+# DEBUG-NEXT: .. imm index coverage check SKIPPED: user-defined predicate detected
 # DEBUG-NEXT: G_VECREDUCE_UMIN (opcode {{[0-9]+}}): 2 type indices, 0 imm indices
-# DEBUG-NEXT: .. type index coverage check SKIPPED: no rules defined
-# DEBUG-NEXT: .. imm index coverage check SKIPPED: no rules defined
+# DEBUG-NEXT: .. opcode {{[0-9]+}} is aliased to {{[0-9]+}}
+# DEBUG-NEXT: .. type index coverage check SKIPPED: user-defined predicate detected
+# DEBUG-NEXT: .. imm index coverage check SKIPPED: user-defined predicate detected
 # DEBUG-NEXT: G_SBFX (opcode {{[0-9]+}}): 2 type indices, 0 imm indices
 # DEBUG-NEXT: .. the first uncovered type index: 2, OK
 # DEBUG-NEXT: .. the first uncovered imm index: 0, OK
diff --git a/llvm/test/CodeGen/AArch64/aarch64-minmaxv.ll b/llvm/test/CodeGen/AArch64/aarch64-minmaxv.ll
index f5d7d330b45c449..df35b4ecb3d6623 100644
--- a/llvm/test/CodeGen/AArch64/aarch64-minmaxv.ll
+++ b/llvm/test/CodeGen/AArch64/aarch64-minmaxv.ll
@@ -1,228 +1,1933 @@
-; RUN: llc < %s -mtriple=aarch64-linux--gnu -aarch64-neon-syntax=generic | FileCheck %s
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
+; RUN: llc -mtriple=aarch64 -verify-machineinstrs %s -o - 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-SD
+; RUN: llc -mtriple=aarch64 -global-isel -global-isel-abort=2 -verify-machineinstrs %s -o - 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-GI
 
 target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
 
-declare i8 @llvm.vector.reduce.smax.v16i8(<16 x i8>)
-declare i16 @llvm.vector.reduce.smax.v8i16(<8 x i16>)
-declare i32 @llvm.vector.reduce.smax.v4i32(<4 x i32>)
-declare i8 @llvm.vector.reduce.umax.v16i8(<16 x i8>)
-declare i16 @llvm.vector.reduce.umax.v8i16(<8 x i16>)
-declare i32 @llvm.vector.reduce.umax.v4i32(<4 x i32>)
+; CHECK-GI:         warning: Instruction selection used fallback path for sminv_v3i64
+; CHECK-GI-NEXT:    warning: Instruction selection used fallback path for smaxv_v3i64
+; CHECK-GI-NEXT:    warning: Instruction selection used fallback path for uminv_v3i64
+; CHECK-GI-NEXT:    warning: Instruction selection used fallback path for umaxv_v3i64
 
+declare i8 @llvm.vector.reduce.smin.v2i8(<2 x i8>)
+declare i8 @llvm.vector.reduce.smin.v3i8(<3 x i8>)
+declare i8 @llvm.vector.reduce.smin.v4i8(<4 x i8>)
+declare i8 @llvm.vector.reduce.smin.v8i8(<8 x i8>)
 declare i8 @llvm.vector.reduce.smin.v16i8(<16 x i8>)
+declare i8 @llvm.vector.reduce.smin.v32i8(<32 x i8>)
+declare i16 @llvm.vector.reduce.smin.v2i16(<2 x i16>)
+declare i16 @llvm.vector.reduce.smin.v3i16(<3 x i16>)
+declare i16 @llvm.vector.reduce.smin.v4i16(<4 x i16>)
 declare i16 @llvm.vector.reduce.smin.v8i16(<8 x i16>)
+declare i16 @llvm.vector.reduce.smin.v16i16(<16 x i16>)
+declare i32 @llvm.vector.reduce.smin.v2i32(<2 x i32>)
+declare i32 @llvm.vector.reduce.smin.v3i32(<3 x i32>)
 declare i32 @llvm.vector.reduce.smin.v4i32(<4 x i32>)
+declare i32 @llvm.vector.reduce.smin.v8i32(<8 x i32>)
+declare i32 @llvm.vector.reduce.smin.v16i32(<16 x i32>)
+declare i64 @llvm.vector.reduce.smin.v2i64(<2 x i64>)
+declare i64 @llvm.vector.reduce.smin.v3i64(<3 x i64>)
+declare i64 @llvm.vector.reduce.smin.v4i64(<4 x i64>)
+declare i128 @llvm.vector.reduce.smin.v2i128(<2 x i128>)
+declare i8 @llvm.vector.reduce.smax.v2i8(<2 x i8>)
+declare i8 @llvm.vector.reduce.smax.v3i8(<3 x i8>)
+declare i8 @llvm.vector.reduce.smax.v4i8(<4 x i8>)
+declare i8 @llvm.vector.reduce.smax.v8i8(<8 x i8>)
+declare i8 @llvm.vector.reduce.smax.v16i8(<16 x i8>)
+declare i8 @llvm.vector.reduce.smax.v32i8(<32 x i8>)
+declare i16 @llvm.vector.reduce.smax.v2i16(<2 x i16>)
+declare i16 @llvm.vector.reduce.smax.v3i16(<3 x i16>)
+declare i16 @llvm.vector.reduce.smax.v4i16(<4 x i16>)
+declare i16 @llvm.vector.reduce.smax.v8i16(<8 x i16>)
+declare i16 @llvm.vector.reduce.smax.v16i16(<16 x i16>)
+declare i32 @llvm.vector.reduce.smax.v2i32(<2 x i32>)
+declare i32 @llvm.vector.reduce.smax.v3i32(<3 x i32>)
+declare i32 @llvm.vector.reduce.smax.v4i32(<4 x i32>)
+declare i32 @llvm.vector.reduce.smax.v8i32(<8 x i32>)
+declare i32 @llvm.vector.reduce.smax.v16i32(<16 x i32>)
+declare i64 @llvm.vector.reduce.smax.v2i64(<2 x i64>)
+declare i64 @llvm.vector.reduce.smax.v3i64(<3 x i64>)
+declare i64 @llvm.vector.reduce.smax.v4i64(<4 x i64>)
+declare i128 @llvm.vector.reduce.smax.v2i128(<2 x i128>)
+declare i8 @llvm.vector.reduce.umin.v2i8(<2 x i8>)
+declare i8 @llvm.vector.reduce.umin.v3i8(<3 x i8>)
+declare i8 @llvm.vector.reduce.umin.v4i8(<4 x i8>)
+declare i8 @llvm.vector.reduce.umin.v8i8(<8 x i8>)
 declare i8 @llvm.vector.reduce.umin.v16i8(<16 x i8>)
+declare i8 @llvm.vector.reduce.umin.v32i8(<32 x i8>)
+declare i16 @llvm.vector.reduce.umin.v2i16(<2 x i16>)
+declare i16 @llvm.vector.reduce.umin.v3i16(<3 x i16>)
+declare i16 @llvm.vector.reduce.umin.v4i16(<4 x i16>)
 declare i16 @llvm.vector.reduce.umin.v8i16(<8 x i16>)
+declare i16 @llvm.vector.reduce.umin.v16i16(<16 x i16>)
+declare i32 @llvm.vector.reduce.umin.v2i32(<2 x i32>)
+declare i32 @llvm.vector.reduce.umin.v3i32(<3 x i32>)
 declare i32 @llvm.vector.reduce.umin.v4i32(<4 x i32>)
+declare i32 @llvm.vector.reduce.umin.v8i32(<8 x i32>)
+declare i32 @llvm.vector.reduce.umin.v16i32(<16 x i32>)
+declare i64 @llvm.vector.reduce.umin.v2i64(<2 x i64>)
+declare i64 @llvm.vector.reduce.umin.v3i64(<3 x i64>)
+declare i64 @llvm.vector.reduce.umin.v4i64(<4 x i64>)
+declare i128 @llvm.vector.reduce.umin.v2i128(<2 x i128>)
+declare i8 @llvm.vector.reduce.umax.v2i8(<2 x i8>)
+declare i8 @llvm.vector.reduce.umax.v3i8(<3 x i8>)
+declare i8 @llvm.vector.reduce.umax.v4i8(<4 x i8>)
+declare i8 @llvm.vector.reduce.umax.v8i8(<8 x i8>)
+declare i8 @llvm.vector.reduce.umax.v16i8(<16 x i8>)
+declare i8 @llvm.vector.reduce.umax.v32i8(<32 x i8>)
+declare i16 @llvm.vector.reduce.umax.v2i16(<2 x i16>)
+declare i16 @llvm.vector.reduce.umax.v3i16(<3 x i16>)
+declare i16 @llvm.vector.reduce.umax.v4i16(<4 x i16>)
+declare i16 @llvm.vector.reduce.umax.v8i16(<8 x i16>)
+declare i16 @llvm.vector.reduce.umax.v16i16(<16 x i16>)
+declare i32 @llvm.vector.reduce.umax.v2i32(<2 x i32>)
+declare i32 @llvm.vector.reduce.umax.v3i32(<3 x i32>)
+declare i32 @llvm.vector.reduce.umax.v4i32(<4 x i32>)
+declare i32 @llvm.vector.reduce.umax.v8i32(<8 x i32>)
+declare i32 @llvm.vector.reduce.umax.v16i32(<16 x i32>)
+declare i64 @llvm.vector.reduce.umax.v2i64(<2 x i64>)
+declare i64 @llvm.vector.reduce.umax.v3i64(<3 x i64>)
+declare i64 @llvm.vector.reduce.umax.v4i64(<4 x i64>)
+declare i128 @llvm.vector.reduce.umax.v2i128(<2 x i128>)
 
 declare float @llvm.vector.reduce.fmax.v4f32(<4 x float>)
 declare float @llvm.vector.reduce.fmin.v4f32(<4 x float>)
 
-; CHECK-LABEL: smax_B
-; CHECK: smaxv {{b[0-9]+}}, {{v[0-9]+}}.16b
 define i8 @smax_B(ptr nocapture readonly %arr)  {
+; CHECK-LABEL: smax_B:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr q0, [x0]
+; CHECK-NEXT:    smaxv b0, v0.16b
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
   %arr.load = load <16 x i8>, ptr %arr
   %r = call i8 @llvm.vector.reduce.smax.v16i8(<16 x i8> %arr.load)
   ret i8 %r
 }
 
-; CHECK-LABEL: smax_H
-; CHECK: smaxv {{h[0-9]+}}, {{v[0-9]+}}.8h
 define i16 @smax_H(ptr nocapture readonly %arr) {
+; CHECK-LABEL: smax_H:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr q0, [x0]
+; CHECK-NEXT:    smaxv h0, v0.8h
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
   %arr.load = load <8 x i16>, ptr %arr
   %r = call i16 @llvm.vector.reduce.smax.v8i16(<8 x i16> %arr.load)
   ret i16 %r
 }
 
-; CHECK-LABEL: smax_S
-; CHECK: smaxv {{s[0-9]+}}, {{v[0-9]+}}.4s
 define i32 @smax_S(ptr nocapture readonly %arr)  {
+; CHECK-LABEL: smax_S:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr q0, [x0]
+; CHECK-NEXT:    smaxv s0, v0.4s
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
   %arr.load = load <4 x i32>, ptr %arr
   %r = call i32 @llvm.vector.reduce.smax.v4i32(<4 x i32> %arr.load)
   ret i32 %r
 }
 
-; CHECK-LABEL: umax_B
-; CHECK: umaxv {{b[0-9]+}}, {{v[0-9]+}}.16b
 define i8 @umax_B(ptr nocapture readonly %arr)  {
+; CHECK-LABEL: umax_B:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr q0, [x0]
+; CHECK-NEXT:    umaxv b0, v0.16b
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
   %arr.load = load <16 x i8>, ptr %arr
   %r = call i8 @llvm.vector.reduce.umax.v16i8(<16 x i8> %arr.load)
   ret i8 %r
 }
 
-; CHECK-LABEL: umax_H
-; CHECK: umaxv {{h[0-9]+}}, {{v[0-9]+}}.8h
 define i16 @umax_H(ptr nocapture readonly %arr)  {
+; CHECK-LABEL: umax_H:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr q0, [x0]
+; CHECK-NEXT:    umaxv h0, v0.8h
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
   %arr.load = load <8 x i16>, ptr %arr
   %r = call i16 @llvm.vector.reduce.umax.v8i16(<8 x i16> %arr.load)
   ret i16 %r
 }
 
-; CHECK-LABEL: umax_S
-; CHECK: umaxv {{s[0-9]+}}, {{v[0-9]+}}.4s
 define i32 @umax_S(ptr nocapture readonly %arr) {
+; CHECK-LABEL: umax_S:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr q0, [x0]
+; CHECK-NEXT:    umaxv s0, v0.4s
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
   %arr.load = load <4 x i32>, ptr %arr
   %r = call i32 @llvm.vector.reduce.umax.v4i32(<4 x i32> %arr.load)
   ret i32 %r
 }
 
-; CHECK-LABEL: smin_B
-; CHECK: sminv {{b[0-9]+}}, {{v[0-9]+}}.16b
 define i8 @smin_B(ptr nocapture readonly %arr) {
+; CHECK-LABEL: smin_B:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr q0, [x0]
+; CHECK-NEXT:    sminv b0, v0.16b
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
   %arr.load = load <16 x i8>, ptr %arr
   %r = call i8 @llvm.vector.reduce.smin.v16i8(<16 x i8> %arr.load)
   ret i8 %r
 }
 
-; CHECK-LABEL: smin_H
-; CHECK: sminv {{h[0-9]+}}, {{v[0-9]+}}.8h
 define i16 @smin_H(ptr nocapture readonly %arr) {
+; CHECK-LABEL: smin_H:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr q0, [x0]
+; CHECK-NEXT:    sminv h0, v0.8h
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
   %arr.load = load <8 x i16>, ptr %arr
   %r = call i16 @llvm.vector.reduce.smin.v8i16(<8 x i16> %arr.load)
   ret i16 %r
 }
 
-; CHECK-LABEL: smin_S
-; CHECK: sminv {{s[0-9]+}}, {{v[0-9]+}}.4s
 define i32 @smin_S(ptr nocapture readonly %arr) {
+; CHECK-LABEL: smin_S:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr q0, [x0]
+; CHECK-NEXT:    sminv s0, v0.4s
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
   %arr.load = load <4 x i32>, ptr %arr
   %r = call i32 @llvm.vector.reduce.smin.v4i32(<4 x i32> %arr.load)
   ret i32 %r
 }
 
-; CHECK-LABEL: umin_B
-; CHECK: uminv {{b[0-9]+}}, {{v[0-9]+}}.16b
 define i8 @umin_B(ptr nocapture readonly %arr)  {
+; CHECK-LABEL: umin_B:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr q0, [x0]
+; CHECK-NEXT:    uminv b0, v0.16b
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
   %arr.load = load <16 x i8>, ptr %arr
   %r = call i8 @llvm.vector.reduce.umin.v16i8(<16 x i8> %arr.load)
   ret i8 %r
 }
 
-; CHECK-LABEL: umin_H
-; CHECK: uminv {{h[0-9]+}}, {{v[0-9]+}}.8h
 define i16 @umin_H(ptr nocapture readonly %arr)  {
+; CHECK-LABEL: umin_H:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr q0, [x0]
+; CHECK-NEXT:    uminv h0, v0.8h
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
   %arr.load = load <8 x i16>, ptr %arr
   %r = call i16 @llvm.vector.reduce.umin.v8i16(<8 x i16> %arr.load)
   ret i16 %r
 }
 
-; CHECK-LABEL: umin_S
-; CHECK: uminv {{s[0-9]+}}, {{v[0-9]+}}.4s
 define i32 @umin_S(ptr nocapture readonly %arr) {
+; CHECK-LABEL: umin_S:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr q0, [x0]
+; CHECK-NEXT:    uminv s0, v0.4s
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
   %arr.load = load <4 x i32>, ptr %arr
   %r = call i32 @llvm.vector.reduce.umin.v4i32(<4 x i32> %arr.load)
   ret i32 %r
 }
 
-; CHECK-LABEL: fmaxnm_S
-; CHECK: fmaxnmv
 define float @fmaxnm_S(ptr nocapture readonly %arr) {
+; CHECK-LABEL: fmaxnm_S:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr q0, [x0]
+; CHECK-NEXT:    fmaxnmv s0, v0.4s
+; CHECK-NEXT:    ret
   %arr.load  = load <4 x float>, ptr %arr
   %r = call nnan float @llvm.vector.reduce.fmax.v4f32(<4 x float> %arr.load)
   ret float %r
 }
 
-; CHECK-LABEL: fminnm_S
-; CHECK: fminnmv
 define float @fminnm_S(ptr nocapture readonly %arr) {
+; CHECK-LABEL: fminnm_S:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr q0, [x0]
+; CHECK-NEXT:    fminnmv s0, v0.4s
+; CHECK-NEXT:    ret
   %arr.load  = load <4 x float>, ptr %arr
   %r = call nnan float @llvm.vector.reduce.fmin.v4f32(<4 x float> %arr.load)
   ret float %r
 }
 
-declare i16 @llvm.vector.reduce.umax.v16i16(<16 x i16>)
-
 define i16 @oversized_umax_256(ptr nocapture readonly %arr)  {
-; CHECK-LABEL: oversized_umax_256
-; CHECK: umax [[V0:v[0-9]+]].8h, {{v[0-9]+}}.8h, {{v[0-9]+}}.8h
-; CHECK: umaxv {{h[0-9]+}}, [[V0]]
+; CHECK-SD-LABEL: oversized_umax_256:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    ldp q1, q0, [x0]
+; CHECK-SD-NEXT:    umax v0.8h, v1.8h, v0.8h
+; CHECK-SD-NEXT:    umaxv h0, v0.8h
+; CHECK-SD-NEXT:    fmov w0, s0
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: oversized_umax_256:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    ldp q0, q1, [x0]
+; CHECK-GI-NEXT:    umax v0.8h, v0.8h, v1.8h
+; CHECK-GI-NEXT:    umaxv h0, v0.8h
+; CHECK-GI-NEXT:    fmov w0, s0
+; CHECK-GI-NEXT:    ret
   %arr.load = load <16 x i16>, ptr %arr
   %r = call i16 @llvm.vector.reduce.umax.v16i16(<16 x i16> %arr.load)
   ret i16 %r
 }
 
-declare i32 @llvm.vector.reduce.umax.v16i32(<16 x i32>)
-
 define i32 @oversized_umax_512(ptr nocapture readonly %arr)  {
-; CHECK-LABEL: oversized_umax_512
-; CHECK: umax v
-; CHECK-NEXT: umax v
-; CHECK-NEXT: umax [[V0:v[0-9]+]].4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.4s
-; CHECK-NEXT: umaxv {{s[0-9]+}}, [[V0]]
+; CHECK-SD-LABEL: oversized_umax_512:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    ldp q0, q1, [x0, #32]
+; CHECK-SD-NEXT:    ldp q2, q3, [x0]
+; CHECK-SD-NEXT:    umax v1.4s, v3.4s, v1.4s
+; CHECK-SD-NEXT:    umax v0.4s, v2.4s, v0.4s
+; CHECK-SD-NEXT:    umax v0.4s, v0.4s, v1.4s
+; CHECK-SD-NEXT:    umaxv s0, v0.4s
+; CHECK-SD-NEXT:    fmov w0, s0
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: oversized_umax_512:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    ldp q0, q1, [x0]
+; CHECK-GI-NEXT:    ldp q2, q3, [x0, #32]
+; CHECK-GI-NEXT:    umax v0.4s, v0.4s, v1.4s
+; CHECK-GI-NEXT:    umax v1.4s, v2.4s, v3.4s
+; CHECK-GI-NEXT:    umax v0.4s, v0.4s, v1.4s
+; CHECK-GI-NEXT:    umaxv s0, v0.4s
+; CHECK-GI-NEXT:    fmov w0, s0
+; CHECK-GI-NEXT:    ret
   %arr.load = load <16 x i32>, ptr %arr
   %r = call i32 @llvm.vector.reduce.umax.v16i32(<16 x i32> %arr.load)
   ret i32 %r
 }
 
-declare i16 @llvm.vector.reduce.umin.v16i16(<16 x i16>)
-
 define i16 @oversized_umin_256(ptr nocapture readonly %arr)  {
-; CHECK-LABEL: oversized_umin_256
-; CHECK: umin [[V0:v[0-9]+]].8h, {{v[...
[truncated]

Comment on lines +1859 to +1880
define i64 @umaxv_v3i64(<3 x i64> %a) {
; CHECK-LABEL: umaxv_v3i64:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: // kill: def $d2 killed $d2 def $q2
; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0
; CHECK-NEXT: mov v3.16b, v0.16b
; CHECK-NEXT: mov v4.16b, v2.16b
; CHECK-NEXT: // kill: def $d1 killed $d1 def $q1
; CHECK-NEXT: mov v3.d[1], v1.d[0]
; CHECK-NEXT: mov v4.d[1], xzr
; CHECK-NEXT: cmhi v3.2d, v3.2d, v4.2d
; CHECK-NEXT: ext v4.16b, v3.16b, v3.16b, #8
; CHECK-NEXT: bif v0.8b, v2.8b, v3.8b
; CHECK-NEXT: and v1.8b, v1.8b, v4.8b
; CHECK-NEXT: cmhi d2, d0, d1
; CHECK-NEXT: bif v0.8b, v1.8b, v2.8b
; CHECK-NEXT: fmov x0, d0
; CHECK-NEXT: ret
entry:
%arg1 = call i64 @llvm.vector.reduce.umax.v3i64(<3 x i64> %a)
ret i64 %arg1
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this case falling back? I guess we have a gap in our legalization for non-power-of-2 types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FewerElements for vector reductions returns UnableToLegalize when it encounters non-power-of-2-types.
The non-power-2 reductions are not well supported yet and could be something we can improve upon in future patches.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true for float reductions too I believe. We don't yet know about adding identity elements to pad reductions to legal lengths.

@dzhidzhoev
Copy link
Member

Can we run llvm/test/CodeGen/AArch64//vecreduce-umax-legalization.ll on GlobalISel with this commit?

@chuongg3 chuongg3 force-pushed the GlobalISel_VECREDUCE_MINMAX branch from 514a160 to 98d4c39 Compare October 30, 2023 10:51
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think this LGTM

; CHECK-GI-NEXT: fmov x8, d0
; CHECK-GI-NEXT: fmov x9, d1
; CHECK-GI-NEXT: cmp x8, x9
; CHECK-GI-NEXT: fcsel d0, d0, d1, lt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better as a csel, if the operands are already in gprs. Something to improve in regbankselect perhaps. And we may want to generate code similar to sdag in places, by using further steps in vector regs, but I think this looks good for now.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM

@chuongg3 chuongg3 force-pushed the GlobalISel_VECREDUCE_MINMAX branch from 69fd4dc to 05cf492 Compare November 9, 2023 16:10
@chuongg3 chuongg3 merged commit 451bc3e into llvm:main Nov 9, 2023
zahiraam pushed a commit to zahiraam/llvm-project that referenced this pull request Nov 20, 2023
Legalizes G_VECREDUCE_{MIN/MAX} and selects instructions for
vecreduce_{min/max}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants