-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AArch64][SME] Enable subreg liveness tracking when SME is available #92142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64][SME] Enable subreg liveness tracking when SME is available #92142
Conversation
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-aarch64 Author: Kerry McLaughlin (kmclaughlin-arm) ChangesThe SME dot instructions in these tests operate on contiguous register This is the first in a series of patches to improve the allocation of Patch is 93.39 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/92142.diff 4 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index 5d185fcaefc4d..bf268ffb56b3f 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -584,6 +584,10 @@ AArch64Subtarget::getAuthenticatedLRCheckMethod() const {
return AArch64PAuth::AuthCheckMethod::None;
}
+bool AArch64Subtarget::enableSubRegLiveness() const {
+ return hasSME() && isStreaming();
+}
+
bool AArch64Subtarget::enableMachinePipeliner() const {
return getSchedModel().hasInstrSchedModel();
}
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index 3f3eefc4f6807..4be32f00b1312 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -150,6 +150,7 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
const Triple &getTargetTriple() const { return TargetTriple; }
bool enableMachineScheduler() const override { return true; }
bool enablePostRAScheduler() const override { return usePostRAScheduler(); }
+ bool enableSubRegLiveness() const override;
bool enableMachinePipeliner() const override;
bool useDFAforSMS() const override { return false; }
diff --git a/llvm/test/CodeGen/AArch64/sme2-intrinsics-int-dots.ll b/llvm/test/CodeGen/AArch64/sme2-intrinsics-int-dots.ll
index e154a4df86efe..3ce77cd8e0321 100644
--- a/llvm/test/CodeGen/AArch64/sme2-intrinsics-int-dots.ll
+++ b/llvm/test/CodeGen/AArch64/sme2-intrinsics-int-dots.ll
@@ -26,18 +26,18 @@ define void @udot_multi_za32_u16_vg1x2(i32 %slice, <vscale x 16 x i8> %unused, <
define void @udot_multi_za32_u16_vg1x4(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 8 x i16> %zn0, <vscale x 8 x i16> %zn1, <vscale x 8 x i16> %zn2, <vscale x 8 x i16> %zn3,
; CHECK-LABEL: udot_multi_za32_u16_vg1x4:
; CHECK: // %bb.0:
-; CHECK-NEXT: mov z26.d, z7.d
-; CHECK-NEXT: mov z31.d, z4.d
-; CHECK-NEXT: mov w8, w0
; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov z26.d, z7.d
; CHECK-NEXT: mov z25.d, z6.d
-; CHECK-NEXT: mov z30.d, z3.d
+; CHECK-NEXT: mov z7.d, z4.d
+; CHECK-NEXT: mov w8, w0
; CHECK-NEXT: mov z24.d, z5.d
-; CHECK-NEXT: mov z29.d, z2.d
; CHECK-NEXT: ld1h { z27.h }, p0/z, [x1]
-; CHECK-NEXT: mov z28.d, z1.d
-; CHECK-NEXT: udot za.s[w8, 0, vgx4], { z28.h - z31.h }, { z24.h - z27.h }
-; CHECK-NEXT: udot za.s[w8, 7, vgx4], { z28.h - z31.h }, { z24.h - z27.h }
+; CHECK-NEXT: mov z6.d, z3.d
+; CHECK-NEXT: mov z5.d, z2.d
+; CHECK-NEXT: mov z4.d, z1.d
+; CHECK-NEXT: udot za.s[w8, 0, vgx4], { z4.h - z7.h }, { z24.h - z27.h }
+; CHECK-NEXT: udot za.s[w8, 7, vgx4], { z4.h - z7.h }, { z24.h - z27.h }
; CHECK-NEXT: ret
<vscale x 8 x i16> %zn4, <vscale x 8 x i16> %zn5, <vscale x 8 x i16> %zn6, <vscale x 8 x i16> %zn7) #0 {
call void @llvm.aarch64.sme.udot.za32.vg1x4.nxv8i16(i32 %slice, <vscale x 8 x i16> %zn0, <vscale x 8 x i16> %zn1, <vscale x 8 x i16> %zn2, <vscale x 8 x i16> %zn3,
@@ -68,18 +68,18 @@ define void @udot_multi_za32_u8_vg1x2(i32 %slice, <vscale x 16 x i8> %unused, <v
define void @udot_multi_za32_u8_vg1x4(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 16 x i8> %zn0, <vscale x 16 x i8> %zn1, <vscale x 16 x i8> %zn2, <vscale x 16 x i8> %zn3,
; CHECK-LABEL: udot_multi_za32_u8_vg1x4:
; CHECK: // %bb.0:
-; CHECK-NEXT: mov z26.d, z7.d
-; CHECK-NEXT: mov z31.d, z4.d
-; CHECK-NEXT: mov w8, w0
; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: mov z26.d, z7.d
; CHECK-NEXT: mov z25.d, z6.d
-; CHECK-NEXT: mov z30.d, z3.d
+; CHECK-NEXT: mov z7.d, z4.d
+; CHECK-NEXT: mov w8, w0
; CHECK-NEXT: mov z24.d, z5.d
-; CHECK-NEXT: mov z29.d, z2.d
; CHECK-NEXT: ld1b { z27.b }, p0/z, [x1]
-; CHECK-NEXT: mov z28.d, z1.d
-; CHECK-NEXT: udot za.s[w8, 0, vgx4], { z28.b - z31.b }, { z24.b - z27.b }
-; CHECK-NEXT: udot za.s[w8, 7, vgx4], { z28.b - z31.b }, { z24.b - z27.b }
+; CHECK-NEXT: mov z6.d, z3.d
+; CHECK-NEXT: mov z5.d, z2.d
+; CHECK-NEXT: mov z4.d, z1.d
+; CHECK-NEXT: udot za.s[w8, 0, vgx4], { z4.b - z7.b }, { z24.b - z27.b }
+; CHECK-NEXT: udot za.s[w8, 7, vgx4], { z4.b - z7.b }, { z24.b - z27.b }
; CHECK-NEXT: ret
<vscale x 16 x i8> %zn4, <vscale x 16 x i8> %zn5, <vscale x 16 x i8> %zn6, <vscale x 16 x i8> %zn7) #0 {
call void @llvm.aarch64.sme.udot.za32.vg1x4.nxv16i8(i32 %slice, <vscale x 16 x i8> %zn0, <vscale x 16 x i8> %zn1, <vscale x 16 x i8> %zn2, <vscale x 16 x i8> %zn3,
@@ -110,18 +110,18 @@ define void @udot_multi_za64_u16_vg1x2(i32 %slice, <vscale x 16 x i8> %unused, <
define void @udot_multi_za64_u16_vg1x4(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 8 x i16> %zn0, <vscale x 8 x i16> %zn1, <vscale x 8 x i16> %zn2, <vscale x 8 x i16> %zn3,
; CHECK-LABEL: udot_multi_za64_u16_vg1x4:
; CHECK: // %bb.0:
-; CHECK-NEXT: mov z26.d, z7.d
-; CHECK-NEXT: mov z31.d, z4.d
-; CHECK-NEXT: mov w8, w0
; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov z26.d, z7.d
; CHECK-NEXT: mov z25.d, z6.d
-; CHECK-NEXT: mov z30.d, z3.d
+; CHECK-NEXT: mov z7.d, z4.d
+; CHECK-NEXT: mov w8, w0
; CHECK-NEXT: mov z24.d, z5.d
-; CHECK-NEXT: mov z29.d, z2.d
; CHECK-NEXT: ld1h { z27.h }, p0/z, [x1]
-; CHECK-NEXT: mov z28.d, z1.d
-; CHECK-NEXT: udot za.d[w8, 0, vgx4], { z28.h - z31.h }, { z24.h - z27.h }
-; CHECK-NEXT: udot za.d[w8, 7, vgx4], { z28.h - z31.h }, { z24.h - z27.h }
+; CHECK-NEXT: mov z6.d, z3.d
+; CHECK-NEXT: mov z5.d, z2.d
+; CHECK-NEXT: mov z4.d, z1.d
+; CHECK-NEXT: udot za.d[w8, 0, vgx4], { z4.h - z7.h }, { z24.h - z27.h }
+; CHECK-NEXT: udot za.d[w8, 7, vgx4], { z4.h - z7.h }, { z24.h - z27.h }
; CHECK-NEXT: ret
<vscale x 8 x i16> %zn4, <vscale x 8 x i16> %zn5, <vscale x 8 x i16> %zn6, <vscale x 8 x i16> %zn7) #1 {
call void @llvm.aarch64.sme.udot.za64.vg1x4.nxv8i16(i32 %slice, <vscale x 8 x i16> %zn0, <vscale x 8 x i16> %zn1, <vscale x 8 x i16> %zn2, <vscale x 8 x i16> %zn3,
@@ -152,18 +152,18 @@ define void @usdot_multi_za32_u8_vg1x2(i32 %slice, <vscale x 16 x i8> %unused, <
define void @usdot_multi_za32_u8_vg1x4(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 16 x i8> %zn0, <vscale x 16 x i8> %zn1, <vscale x 16 x i8> %zn2, <vscale x 16 x i8> %zn3,
; CHECK-LABEL: usdot_multi_za32_u8_vg1x4:
; CHECK: // %bb.0:
-; CHECK-NEXT: mov z26.d, z7.d
-; CHECK-NEXT: mov z31.d, z4.d
-; CHECK-NEXT: mov w8, w0
; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: mov z26.d, z7.d
; CHECK-NEXT: mov z25.d, z6.d
-; CHECK-NEXT: mov z30.d, z3.d
+; CHECK-NEXT: mov z7.d, z4.d
+; CHECK-NEXT: mov w8, w0
; CHECK-NEXT: mov z24.d, z5.d
-; CHECK-NEXT: mov z29.d, z2.d
; CHECK-NEXT: ld1b { z27.b }, p0/z, [x1]
-; CHECK-NEXT: mov z28.d, z1.d
-; CHECK-NEXT: usdot za.s[w8, 0, vgx4], { z28.b - z31.b }, { z24.b - z27.b }
-; CHECK-NEXT: usdot za.s[w8, 7, vgx4], { z28.b - z31.b }, { z24.b - z27.b }
+; CHECK-NEXT: mov z6.d, z3.d
+; CHECK-NEXT: mov z5.d, z2.d
+; CHECK-NEXT: mov z4.d, z1.d
+; CHECK-NEXT: usdot za.s[w8, 0, vgx4], { z4.b - z7.b }, { z24.b - z27.b }
+; CHECK-NEXT: usdot za.s[w8, 7, vgx4], { z4.b - z7.b }, { z24.b - z27.b }
; CHECK-NEXT: ret
<vscale x 16 x i8> %zn4, <vscale x 16 x i8> %zn5, <vscale x 16 x i8> %zn6, <vscale x 16 x i8> %zn7) #0 {
call void @llvm.aarch64.sme.usdot.za32.vg1x4.nxv16i8(i32 %slice, <vscale x 16 x i8> %zn0, <vscale x 16 x i8> %zn1, <vscale x 16 x i8> %zn2, <vscale x 16 x i8> %zn3,
@@ -197,18 +197,18 @@ define void @sdot_multi_za32_u16_vg1x2(i32 %slice, <vscale x 16 x i8> %unused, <
define void @sdot_multi_za32_u16_vg1x4(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 8 x i16> %zn0, <vscale x 8 x i16> %zn1, <vscale x 8 x i16> %zn2, <vscale x 8 x i16> %zn3,
; CHECK-LABEL: sdot_multi_za32_u16_vg1x4:
; CHECK: // %bb.0:
-; CHECK-NEXT: mov z26.d, z7.d
-; CHECK-NEXT: mov z31.d, z4.d
-; CHECK-NEXT: mov w8, w0
; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov z26.d, z7.d
; CHECK-NEXT: mov z25.d, z6.d
-; CHECK-NEXT: mov z30.d, z3.d
+; CHECK-NEXT: mov z7.d, z4.d
+; CHECK-NEXT: mov w8, w0
; CHECK-NEXT: mov z24.d, z5.d
-; CHECK-NEXT: mov z29.d, z2.d
; CHECK-NEXT: ld1h { z27.h }, p0/z, [x1]
-; CHECK-NEXT: mov z28.d, z1.d
-; CHECK-NEXT: sdot za.s[w8, 0, vgx4], { z28.h - z31.h }, { z24.h - z27.h }
-; CHECK-NEXT: sdot za.s[w8, 7, vgx4], { z28.h - z31.h }, { z24.h - z27.h }
+; CHECK-NEXT: mov z6.d, z3.d
+; CHECK-NEXT: mov z5.d, z2.d
+; CHECK-NEXT: mov z4.d, z1.d
+; CHECK-NEXT: sdot za.s[w8, 0, vgx4], { z4.h - z7.h }, { z24.h - z27.h }
+; CHECK-NEXT: sdot za.s[w8, 7, vgx4], { z4.h - z7.h }, { z24.h - z27.h }
; CHECK-NEXT: ret
<vscale x 8 x i16> %zn4, <vscale x 8 x i16> %zn5, <vscale x 8 x i16> %zn6, <vscale x 8 x i16> %zn7) #0 {
call void @llvm.aarch64.sme.sdot.za32.vg1x4.nxv8i16(i32 %slice, <vscale x 8 x i16> %zn0, <vscale x 8 x i16> %zn1, <vscale x 8 x i16> %zn2, <vscale x 8 x i16> %zn3,
@@ -239,18 +239,18 @@ define void @sdot_multi_za32_u8_vg1x2(i32 %slice, <vscale x 16 x i8> %unused, <v
define void @sdot_multi_za32_u8_vg1x4(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 16 x i8> %zn0, <vscale x 16 x i8> %zn1, <vscale x 16 x i8> %zn2, <vscale x 16 x i8> %zn3,
; CHECK-LABEL: sdot_multi_za32_u8_vg1x4:
; CHECK: // %bb.0:
-; CHECK-NEXT: mov z26.d, z7.d
-; CHECK-NEXT: mov z31.d, z4.d
-; CHECK-NEXT: mov w8, w0
; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: mov z26.d, z7.d
; CHECK-NEXT: mov z25.d, z6.d
-; CHECK-NEXT: mov z30.d, z3.d
+; CHECK-NEXT: mov z7.d, z4.d
+; CHECK-NEXT: mov w8, w0
; CHECK-NEXT: mov z24.d, z5.d
-; CHECK-NEXT: mov z29.d, z2.d
; CHECK-NEXT: ld1b { z27.b }, p0/z, [x1]
-; CHECK-NEXT: mov z28.d, z1.d
-; CHECK-NEXT: sdot za.s[w8, 0, vgx4], { z28.b - z31.b }, { z24.b - z27.b }
-; CHECK-NEXT: sdot za.s[w8, 7, vgx4], { z28.b - z31.b }, { z24.b - z27.b }
+; CHECK-NEXT: mov z6.d, z3.d
+; CHECK-NEXT: mov z5.d, z2.d
+; CHECK-NEXT: mov z4.d, z1.d
+; CHECK-NEXT: sdot za.s[w8, 0, vgx4], { z4.b - z7.b }, { z24.b - z27.b }
+; CHECK-NEXT: sdot za.s[w8, 7, vgx4], { z4.b - z7.b }, { z24.b - z27.b }
; CHECK-NEXT: ret
<vscale x 16 x i8> %zn4, <vscale x 16 x i8> %zn5, <vscale x 16 x i8> %zn6, <vscale x 16 x i8> %zn7) #0 {
call void @llvm.aarch64.sme.sdot.za32.vg1x4.nxv16i8(i32 %slice, <vscale x 16 x i8> %zn0, <vscale x 16 x i8> %zn1, <vscale x 16 x i8> %zn2, <vscale x 16 x i8> %zn3,
@@ -281,18 +281,18 @@ define void @sdot_multi_za64_u16_vg1x2(i32 %slice, <vscale x 16 x i8> %unused, <
define void @sdot_multi_za64_u16_vg1x4(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 8 x i16> %zn0, <vscale x 8 x i16> %zn1, <vscale x 8 x i16> %zn2, <vscale x 8 x i16> %zn3,
; CHECK-LABEL: sdot_multi_za64_u16_vg1x4:
; CHECK: // %bb.0:
-; CHECK-NEXT: mov z26.d, z7.d
-; CHECK-NEXT: mov z31.d, z4.d
-; CHECK-NEXT: mov w8, w0
; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov z26.d, z7.d
; CHECK-NEXT: mov z25.d, z6.d
-; CHECK-NEXT: mov z30.d, z3.d
+; CHECK-NEXT: mov z7.d, z4.d
+; CHECK-NEXT: mov w8, w0
; CHECK-NEXT: mov z24.d, z5.d
-; CHECK-NEXT: mov z29.d, z2.d
; CHECK-NEXT: ld1h { z27.h }, p0/z, [x1]
-; CHECK-NEXT: mov z28.d, z1.d
-; CHECK-NEXT: sdot za.d[w8, 0, vgx4], { z28.h - z31.h }, { z24.h - z27.h }
-; CHECK-NEXT: sdot za.d[w8, 7, vgx4], { z28.h - z31.h }, { z24.h - z27.h }
+; CHECK-NEXT: mov z6.d, z3.d
+; CHECK-NEXT: mov z5.d, z2.d
+; CHECK-NEXT: mov z4.d, z1.d
+; CHECK-NEXT: sdot za.d[w8, 0, vgx4], { z4.h - z7.h }, { z24.h - z27.h }
+; CHECK-NEXT: sdot za.d[w8, 7, vgx4], { z4.h - z7.h }, { z24.h - z27.h }
; CHECK-NEXT: ret
<vscale x 8 x i16> %zn4, <vscale x 8 x i16> %zn5, <vscale x 8 x i16> %zn6, <vscale x 8 x i16> %zn7) #1 {
call void @llvm.aarch64.sme.sdot.za64.vg1x4.nxv8i16(i32 %slice, <vscale x 8 x i16> %zn0, <vscale x 8 x i16> %zn1, <vscale x 8 x i16> %zn2, <vscale x 8 x i16> %zn3,
@@ -309,9 +309,7 @@ define void @sdot_multi_za64_u16_vg1x4(i32 %slice, <vscale x 16 x i8> %unused, <
define void @udot_single_za32_u16_vg1x2(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 8 x i16> %zn0, <vscale x 8 x i16> %zn1, <vscale x 8 x i16> %zn2) #0 {
; CHECK-LABEL: udot_single_za32_u16_vg1x2:
; CHECK: // %bb.0:
-; CHECK-NEXT: // kill: def $z2 killed $z2 killed $z1_z2 def $z1_z2
; CHECK-NEXT: mov w8, w0
-; CHECK-NEXT: // kill: def $z1 killed $z1 killed $z1_z2 def $z1_z2
; CHECK-NEXT: udot za.s[w8, 0, vgx2], { z1.h, z2.h }, z3.h
; CHECK-NEXT: udot za.s[w8, 7, vgx2], { z1.h, z2.h }, z3.h
; CHECK-NEXT: ret
@@ -324,11 +322,7 @@ define void @udot_single_za32_u16_vg1x2(i32 %slice, <vscale x 16 x i8> %unused,
define void @udot_single_za32_u16_vg1x4(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 8 x i16> %zn0, <vscale x 8 x i16> %zn1, <vscale x 8 x i16> %zn2, <vscale x 8 x i16> %zn3, <vscale x 8 x i16> %zn4) #0 {
; CHECK-LABEL: udot_single_za32_u16_vg1x4:
; CHECK: // %bb.0:
-; CHECK-NEXT: // kill: def $z4 killed $z4 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
; CHECK-NEXT: mov w8, w0
-; CHECK-NEXT: // kill: def $z3 killed $z3 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
-; CHECK-NEXT: // kill: def $z2 killed $z2 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
-; CHECK-NEXT: // kill: def $z1 killed $z1 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
; CHECK-NEXT: udot za.s[w8, 0, vgx4], { z1.h - z4.h }, z5.h
; CHECK-NEXT: udot za.s[w8, 7, vgx4], { z1.h - z4.h }, z5.h
; CHECK-NEXT: ret
@@ -341,9 +335,7 @@ define void @udot_single_za32_u16_vg1x4(i32 %slice, <vscale x 16 x i8> %unused,
define void @udot_single_za32_u8_vg1x2(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 16 x i8> %zn0, <vscale x 16 x i8> %zn1, <vscale x 16 x i8> %zn2) #0 {
; CHECK-LABEL: udot_single_za32_u8_vg1x2:
; CHECK: // %bb.0:
-; CHECK-NEXT: // kill: def $z2 killed $z2 killed $z1_z2 def $z1_z2
; CHECK-NEXT: mov w8, w0
-; CHECK-NEXT: // kill: def $z1 killed $z1 killed $z1_z2 def $z1_z2
; CHECK-NEXT: udot za.s[w8, 0, vgx2], { z1.b, z2.b }, z3.b
; CHECK-NEXT: udot za.s[w8, 7, vgx2], { z1.b, z2.b }, z3.b
; CHECK-NEXT: ret
@@ -356,11 +348,7 @@ define void @udot_single_za32_u8_vg1x2(i32 %slice, <vscale x 16 x i8> %unused, <
define void @udot_single_za32_u8_vg1x4(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 16 x i8> %zn0, <vscale x 16 x i8> %zn1, <vscale x 16 x i8> %zn2, <vscale x 16 x i8> %zn3, <vscale x 16 x i8> %zn4) #0 {
; CHECK-LABEL: udot_single_za32_u8_vg1x4:
; CHECK: // %bb.0:
-; CHECK-NEXT: // kill: def $z4 killed $z4 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
; CHECK-NEXT: mov w8, w0
-; CHECK-NEXT: // kill: def $z3 killed $z3 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
-; CHECK-NEXT: // kill: def $z2 killed $z2 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
-; CHECK-NEXT: // kill: def $z1 killed $z1 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
; CHECK-NEXT: udot za.s[w8, 0, vgx4], { z1.b - z4.b }, z5.b
; CHECK-NEXT: udot za.s[w8, 7, vgx4], { z1.b - z4.b }, z5.b
; CHECK-NEXT: ret
@@ -373,9 +361,7 @@ define void @udot_single_za32_u8_vg1x4(i32 %slice, <vscale x 16 x i8> %unused, <
define void @udot_single_za64_u16_vg1x2(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 8 x i16> %zn0, <vscale x 8 x i16> %zn1, <vscale x 8 x i16> %zn2) #1 {
; CHECK-LABEL: udot_single_za64_u16_vg1x2:
; CHECK: // %bb.0:
-; CHECK-NEXT: // kill: def $z2 killed $z2 killed $z1_z2 def $z1_z2
; CHECK-NEXT: mov w8, w0
-; CHECK-NEXT: // kill: def $z1 killed $z1 killed $z1_z2 def $z1_z2
; CHECK-NEXT: udot za.d[w8, 0, vgx2], { z1.h, z2.h }, z3.h
; CHECK-NEXT: udot za.d[w8, 7, vgx2], { z1.h, z2.h }, z3.h
; CHECK-NEXT: ret
@@ -388,11 +374,7 @@ define void @udot_single_za64_u16_vg1x2(i32 %slice, <vscale x 16 x i8> %unused,
define void @udot_single_za64_u16_vg1x4(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 8 x i16> %zn0, <vscale x 8 x i16> %zn1, <vscale x 8 x i16> %zn2, <vscale x 8 x i16> %zn3, <vscale x 8 x i16> %zn4) #1 {
; CHECK-LABEL: udot_single_za64_u16_vg1x4:
; CHECK: // %bb.0:
-; CHECK-NEXT: // kill: def $z4 killed $z4 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
; CHECK-NEXT: mov w8, w0
-; CHECK-NEXT: // kill: def $z3 killed $z3 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
-; CHECK-NEXT: // kill: def $z2 killed $z2 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
-; CHECK-NEXT: // kill: def $z1 killed $z1 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
; CHECK-NEXT: udot za.d[w8, 0, vgx4], { z1.h - z4.h }, z5.h
; CHECK-NEXT: udot za.d[w8, 7, vgx4], { z1.h - z4.h }, z5.h
; CHECK-NEXT: ret
@@ -405,9 +387,7 @@ define void @udot_single_za64_u16_vg1x4(i32 %slice, <vscale x 16 x i8> %unused,
define void @usdot_single_za32_u8_vg1x2(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 16 x i8> %zn0, <vscale x 16 x i8> %zn1, <vscale x 16 x i8> %zn2) #0 {
; CHECK-LABEL: usdot_single_za32_u8_vg1x2:
; CHECK: // %bb.0:
-; CHECK-NEXT: // kill: def $z2 killed $z2 killed $z1_z2 def $z1_z2
; CHECK-NEXT: mov w8, w0
-; CHECK-NEXT: // kill: def $z1 killed $z1 killed $z1_z2 def $z1_z2
; CHECK-NEXT: usdot za.s[w8, 0, vgx2], { z1.b, z2.b }, z3.b
; CHECK-NEXT: usdot za.s[w8, 7, vgx2], { z1.b, z2.b }, z3.b
; CHECK-NEXT: ret
@@ -420,11 +400,7 @@ define void @usdot_single_za32_u8_vg1x2(i32 %slice, <vscale x 16 x i8> %unused,
define void @usdot_single_za32_u8_vg1x4(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 16 x i8> %zn0, <vscale x 16 x i8> %zn1, <vscale x 16 x i8> %zn2, <vscale x 16 x i8> %zn3, <vscale x 16 x i8> %zn4) #0 {
; CHECK-LABEL: usdot_single_za32_u8_vg1x4:
; CHECK: // %bb.0:
-; CHECK-NEXT: // kill: def $z4 killed $z4 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
; CHECK-NEXT: mov w8, w0
-; CHECK-NEXT: // kill: def $z3 killed $z3 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
-; CHECK-NEXT: // kill: def $z2 killed $z2 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
-; CHECK-NEXT: // kill: def $z1 killed $z1 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
; CHECK-NEXT: usdot za.s[w8, 0, vgx4], { z1.b - z4.b }, z5.b
; CHECK-NEXT: usdot za.s[w8, 7, vgx4], { z1.b - z4.b }, z5.b
; CHECK-NEXT: ret
@@ -440,9 +416,7 @@ define void @usdot_single_za32_u8_vg1x4(i32 %slice, <vscale x 16 x i8> %unused,
define void @sdot_single_za32_u16_vg1x2(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 8 x i16> %zn0, <vscale x 8 x i16> %zn1, <vscale x 8 x i16> %zn2) #0 {
; CHECK-LABEL: sdot_single_za32_u16_vg1x2:
; CHECK: // %bb.0:
-; CHECK-NEXT: // kill: def $z2 killed $z2 killed $z1_z2 def $z1_z2
; CHECK-NEXT: mov w8, w0
-; CHECK-NEXT: // kill: def $z1 killed $z1 killed $z1_z2 def $z1_z2
; CHECK-NEXT: sdot za.s[w8, 0, vgx2], { z1.h, z2.h }, z3.h
; CHECK-NEXT: sdot za.s[w8, 7, vgx2], { z1.h, z2.h }, z3.h
; CHECK-NEXT: ret
@@ -455,11 +429,7 @@ define void @sdot_single_za32_u16_vg1x2(i32 %slice, <vscale x 16 x i8> %unused,
define void @sdot_single_za32_u16_vg1x4(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 8 x i16> %zn0, <vscale x 8 x i16> %zn1, <vscale x 8 x i16> %zn2, <vscale x 8 x i16> %zn3, <vscale x 8 x i16> %zn4) #0 {
; CHECK-LABEL: sdot_single_za32_u16_vg1x4:
; CHECK: // %bb.0:
-; CHECK-NEXT: // kill: def $z4 killed $z4 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
; CHECK-NEXT: mov w8, w0
-; CHECK-NEXT: // kill: def $z3 killed $z3 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
-; CHECK-NEXT: // kill: def $z2 killed $z2 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
-; CHECK-NEXT: // kill: def $z1 killed $z1 killed $z1_z2_z3_z4 def $z1_z2_z3_z4
; CHECK-NEXT: sdot za.s[w8, 0, vgx4], { z1.h - z4.h }, z5.h
; CHECK-NEXT: sdot za.s[w8, 7, vgx4], { z1.h - z4.h }, z5.h
; CHECK-NEXT: ret
@@ -472,9 +442,7 @@ define void @sdot_single_za32_u16_vg1x4(i32 %slice, <vscale x 16 x i8> %unused,
define void @sdot_single_za32_u8_vg1x2(i32 %slice, <vscale x 16 x i8> %unused, <vscale x 16 x i8> %zn0, <vscale x 16 x i8> %zn1, <vscale x 16 x i8> %zn2) #0 {
; CHECK-LABEL: sdot_single_za32_u8_vg1x2:
; CHECK: // %bb.0:
-; CHECK-NEXT: // kill: def $z2 killed $z2 killed $z1_z2 def $z1_z2
; CHECK-NEXT: mov w8, w0
-; CHECK-NEXT: // kill: def $z1 kill...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very useful to me.
This sounds like a nice idea - I hadn't realized we did not have this enabled already. Whilst I can see how it would be especially beneficial for SME's multi-vector instructions, I feel it should either be useful in general (for tbl's and ld2/3/4 type instructions), or be wrong as something doesn't handle the subregister liveness correctly. In either case, what do you think about enabling this in general for AArch64 so we can be sure it performs correctly? |
I suspect this may not be enabled by default for compile-time reasons. It's worth doing some measurements on what the impact is and decide based on that whether we want to enable it for more cases. |
@@ -584,6 +584,10 @@ AArch64Subtarget::getAuthenticatedLRCheckMethod() const { | |||
return AArch64PAuth::AuthCheckMethod::None; | |||
} | |||
|
|||
bool AArch64Subtarget::enableSubRegLiveness() const { | |||
return hasSME() && isStreaming(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a function must have SME in order to be streaming, so hasSME()
is implied and can therefore be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this may not be enabled by default for compile-time reasons. It's worth doing some measurements on what the impact is and decide based on that whether we want to enable it for more cases.
Do you have any evidence or this? My understanding was that it was not enabled as it can cause mis-compiles (for example in https://reviews.llvm.org/D129646). I would strong advise against making this SME only. The testing it will get will not be enough to be sure it doesn't cause mis-compiles elsewhere.
@@ -584,6 +584,10 @@ AArch64Subtarget::getAuthenticatedLRCheckMethod() const { | |||
return AArch64PAuth::AuthCheckMethod::None; | |||
} | |||
|
|||
bool AArch64Subtarget::enableSubRegLiveness() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is probably worth adding an option like in the Arm/RISCV backends too, so it can be disabled in case of problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we could make the -enable-subreg-liveness option actually work and not require every target to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind looking into this, if someone else does not already. It would be good to get this patch in though, in case we do find any issues before the clang-19 release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is -enable-subreg-liveness only does anything for targets that return true for enableSubRegLiveness, rather than treating the target value as the default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the -aarch64-enable-subreg-liveness
flag from this patch, as the existing flag can be used to disable liveness tracking when enableSubRegLiveness just returns true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I put up #95437 for trying to fix the option.
I wasn't aware of any functional issues from enabling this. From reading the comments on that patch and seeing what other targets do, that issue could be due to the way they've modelled register aliasing for different values of 'LMUL'. I can't really see how the instruction they've pointed to as an example would do the expected thing for the way they have described it. The instruction doesn't use the FWIW, I'd be happy to enable subreg liveness tracking always if there is no downside to compile-time. Perhaps this is not an issue in practice, but if it is then I'd still recommend only enabling it for SME only. I agree it would be helpful to add a flag to manually disable subreg liveness tracking, if needed. |
It was apparently fixed by https://reviews.llvm.org/D129735. It needs to be the entire backend that works with subregister-liveness. It can be easy for one part of it not to work with them, but for it to go unnoticed whilst it is not enabled. There were some fixes we needed before I enabled this for Arm, and I remember I was running at least 4 different fuzzers at the time on MVE code.
I feel like that may be best so long as the compile-time impact is low(ish). We should see a perf improvement for Neon and SVE as well as SME. You could say either this is correct and the compile times is acceptable, in which case lets do it, or it is not acceptable anyway. |
74a5f9e
to
ef6e9f4
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
Using the LNT test suite, I took some measurements to understand the impact on compile-time and found that enabling this generally for AArch64 results in about a 0.1% increase based on instruction count. Many of the AArch64 test files require updating, but I don't think there are any functional issues exposed with these changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a lot more tests than I had expected. This seems to lead to a decent performance improvement, so LGTM. Thanks for the update.
@@ -584,6 +584,10 @@ AArch64Subtarget::getAuthenticatedLRCheckMethod() const { | |||
return AArch64PAuth::AuthCheckMethod::None; | |||
} | |||
|
|||
bool AArch64Subtarget::enableSubRegLiveness() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind looking into this, if someone else does not already. It would be good to get this patch in though, in case we do find any issues before the clang-19 release.
|
||
target triple="aarch64-linux-gnu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to keep the triples and attributes as part of the run line, as it makes adding multiple run lines much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a quick scan of the tests I couldn't see anything unexpected.
LGTM.
The SME dot instructions in these tests operate on contiguous register tuples which use one subregister from each of the loads. When using the strided register form for all loads, enabling subreg liveness tracking will allow us to recognise that there is no overlap between the register tuples used by each of the dot instructions. This is the first in a series of patches to improve the allocation of strided and contiguous registers for SME.
generally for AArch64 by default.
…-subreg-liveness flag can already be used to disable liveness tracking.
9c8d07a
to
5f3eaab
Compare
Could this have caused these errors: https://lab.llvm.org/buildbot/#/builders/55/builds/19 I don't really see how, but by exclusion I don't see any of the other in the blamelist to be plausible, so my guess is "random miscompilation in the compiler". |
I verified that rolling back this CL fixes the buildbot, so I reverted it. |
The SME dot instructions in these tests operate on contiguous register
tuples which use one subregister from each of the loads. When using the
strided register form for all loads, enabling subreg liveness tracking
will allow us to recognise that there is no overlap between the register
tuples used by each of the dot instructions.
This is the first in a series of patches to improve the allocation of
strided and contiguous registers for SME.