Skip to content

[AArch64] Optimized generated assembly for bool to svbool_t conversions #83001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 28, 2024

Conversation

Lukacma
Copy link
Contributor

@Lukacma Lukacma commented Feb 26, 2024

The original assembly was generating AND(WHILELO, SPLAT 1) pattern when only WHILELO was necessary.

This change relies on the fact that current lowering of SPLAT_VECTOR node relies on whilelo, which zeroes unused lanes.

The original assembly was generating `AND(WHILELO, SPLAT 1)` pattern when only `WHILELO` was necessary
@llvmbot
Copy link
Member

llvmbot commented Feb 26, 2024

@llvm/pr-subscribers-backend-aarch64

Author: None (Lukacma)

Changes

The original assembly was generating AND(WHILELO, SPLAT 1) pattern when only WHILELO was necessary.


Full diff: https://github.com/llvm/llvm-project/pull/83001.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+1)
  • (modified) llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll (+41-1)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index a3b7e3128ac1a4..dba3a787734721 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -276,6 +276,7 @@ static bool isZeroingInactiveLanes(SDValue Op) {
     if (ISD::isConstantSplatVectorAllOnes(Op.getNode()))
       return true;
     return false;
+  case ISD::SPLAT_VECTOR:
   case AArch64ISD::PTRUE:
   case AArch64ISD::SETCC_MERGE_ZERO:
     return true;
diff --git a/llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll b/llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll
index 82bf756f822898..c7c102f5d567d9 100644
--- a/llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll
+++ b/llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll
@@ -1,4 +1,4 @@
-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
 ; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s
 ; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sme < %s | FileCheck %s
 
@@ -150,6 +150,46 @@ define <vscale x 16 x i1> @chained_reinterpret() {
   ret <vscale x 16 x i1> %out
 }
 
+define <vscale x 16 x i1> @reinterpret_scalar_bool_h(i1 %x){
+; CHECK-LABEL: reinterpret_scalar_bool_h:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
+; CHECK-NEXT:    sbfx x8, x0, #0, #1
+; CHECK-NEXT:    whilelo p0.h, xzr, x8
+; CHECK-NEXT:    ret
+  %.splatinsert = insertelement <vscale x 8 x i1> poison, i1 %x, i64 0
+  %.splat = shufflevector <vscale x 8 x i1> %.splatinsert, <vscale x 8 x i1> poison, <vscale x 8 x i32> zeroinitializer
+  %out = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> %.splat)
+  ret <vscale x 16 x i1> %out
+}
+
+define <vscale x 16 x i1> @reinterpret_scalar_bool_s(i1 %x){
+; CHECK-LABEL: reinterpret_scalar_bool_s:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
+; CHECK-NEXT:    sbfx x8, x0, #0, #1
+; CHECK-NEXT:    whilelo p0.s, xzr, x8
+; CHECK-NEXT:    ret
+  %.splatinsert = insertelement <vscale x 4 x i1> poison, i1 %x, i64 0
+  %.splat = shufflevector <vscale x 4 x i1> %.splatinsert, <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer
+  %out = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %.splat)
+  ret <vscale x 16 x i1> %out
+}
+
+define <vscale x 16 x i1> @reinterpret_scalar_bool_q(i1 %x){
+; CHECK-LABEL: reinterpret_scalar_bool_q:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
+; CHECK-NEXT:    sbfx x8, x0, #0, #1
+; CHECK-NEXT:    whilelo p0.d, xzr, x8
+; CHECK-NEXT:    ret
+  %.splatinsert = insertelement <vscale x 2 x i1> poison, i1 %x, i64 0
+  %.splat = shufflevector <vscale x 2 x i1> %.splatinsert, <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer
+  %out = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv2i1(<vscale x 2 x i1> %.splat)
+  ret <vscale x 16 x i1> %out
+}
+
+
 declare <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 immarg)
 declare <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 immarg)
 declare <vscale x 8 x i1> @llvm.aarch64.sve.cmpgt.nxv8i16(<vscale x 8 x i1>, <vscale x 8 x i16>, <vscale x 8 x i16>)

@Lukacma
Copy link
Contributor Author

Lukacma commented Feb 26, 2024

The main concern I have about this commit is that it relies on splat_vector being always implemented using zeroing instruction. Do you think it is okay to embed this reliance into the code ?

@momchil-velikov
Copy link
Collaborator

The main concern I have about this commit is that it relies on splat_vector being always implemented using zeroing instruction.
Do you think it is okay to embed this reliance into the code ?

The real question is what is the canonical representation in the AArch64 backend of types like <vscale x N x i1> where N < 16.
Taking <vscale x 8 x i1> as an example, we know it has to represented with a bit-pattern that is vscale * 16 bits wide. Naturally, the "active" bits are the even-numbered bits and the "inactive" bits are the odd-numbered bits. The question is about the inactive bits. We can say they must be zero. Or we can say they can be undefined. We are going through a lot of effort to zero the "inactive" bits (that's the whole point of convert to/from svbool), which makes me think the canonical representation is with zero in "inactive" bits. Hence splat_vector must be lowered to instructions which yield zero in the inactive bits, other possible lowerings would be semantically incorrect.

@paulwalker-arm
Copy link
Collaborator

The main concern I have about this commit is that it relies on splat_vector being always implemented using zeroing instruction.
Do you think it is okay to embed this reliance into the code ?

The real question is what is the canonical representation in the AArch64 backend of types like <vscale x N x i1> where N < 16. Taking <vscale x 8 x i1> as an example, we know it has to represented with a bit-pattern that is vscale * 16 bits wide. Naturally, the "active" bits are the even-numbered bits and the "inactive" bits are the odd-numbered bits. The question is about the inactive bits. We can say they must be zero. Or we can say they can be undefined. We are going through a lot of effort to zero the "inactive" bits (that's the whole point of convert to/from svbool), which makes me think the canonical representation is with zero in "inactive" bits. Hence splat_vector must be lowered to instructions which yield zero in the inactive bits, other possible lowerings would be semantically incorrect.

By design the "invisible bits" within a predicate type are undefined. When changing the visibility of bits via the to/from_svbool intrinsics we didn't want to leave newly visible bits hanging and so to_svbool defines them as zero. isZeroingInactiveLanes exists to reduce the need to explicitly zero these lanes by entering certain operations into a contract to guarantee all isel related to them will zero their invisible lanes.

I suspect the reason that non-constant SPLAT_VECTOR's were omitted is because they don't typically come from ACLE code where to_svbool is prevalent. Out of interest, in what context are you seeing them today?

@Lukacma
Copy link
Contributor Author

Lukacma commented Feb 27, 2024

The main concern I have about this commit is that it relies on splat_vector being always implemented using zeroing instruction.
Do you think it is okay to embed this reliance into the code ?

The real question is what is the canonical representation in the AArch64 backend of types like <vscale x N x i1> where N < 16. Taking <vscale x 8 x i1> as an example, we know it has to represented with a bit-pattern that is vscale * 16 bits wide. Naturally, the "active" bits are the even-numbered bits and the "inactive" bits are the odd-numbered bits. The question is about the inactive bits. We can say they must be zero. Or we can say they can be undefined. We are going through a lot of effort to zero the "inactive" bits (that's the whole point of convert to/from svbool), which makes me think the canonical representation is with zero in "inactive" bits. Hence splat_vector must be lowered to instructions which yield zero in the inactive bits, other possible lowerings would be semantically incorrect.

By design the "invisible bits" within a predicate type are undefined. When changing the visibility of bits via the to/from_svbool intrinsics we didn't want to leave newly visible bits hanging and so to_svbool defines them as zero. isZeroingInactiveLanes exists to reduce the need to explicitly zero these lanes by entering certain operations into a contract to guarantee all isel related to them will zero their invisible lanes.

I suspect the reason that non-constant SPLAT_VECTOR's were omitted is because they don't typically come from ACLE code where to_svbool is prevalent. Out of interest, in what context are you seeing them today?

In the test I have an example, where previously splat_vectors were used to go from scalar to vector type and because they weren't contracted to zero lanes it resulted in extra instructions :

#define <vscale x 16 x i1> @reinterpret_scalar_bool_h(i1 %x){
; CHECK-LABEL: reinterpret_scalar_bool_h:
; CHECK:       // %bb.0:
; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
; CHECK-NEXT:    sbfx x8, x0, #0, #1
; CHECK-NEXT:    whilelo p0.h, xzr, x8
; CHECK-NEXT:    ret
  %.splatinsert = insertelement <vscale x 8 x i1> poison, i1 %x, i64 0
  %.splat = shufflevector <vscale x 8 x i1> %.splatinsert, <vscale x 8 x i1> poison, <vscale x 8 x i32> zeroinitializer
  %out = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> %.splat)
  ret <vscale x 16 x i1> %out
}

@paulwalker-arm
Copy link
Collaborator

Sure, I can see the tests but I meant from a C/C++ point of view.

@Lukacma
Copy link
Contributor Author

Lukacma commented Feb 27, 2024

Sure, I can see the tests but I meant from a C/C++ point of view.

It was "simple" casting from bool to svbool_t :

svbool_t dup_u16(_Bool x) {
    return svdup_n_b16(x);
}

@paulwalker-arm
Copy link
Collaborator

I see, thanks.

Copy link
Collaborator

@paulwalker-arm paulwalker-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'm happy for the default case to be cleaned up as a separate patch but if you'd rather do it here then I'll just re-review the updates.

@Lukacma
Copy link
Contributor Author

Lukacma commented Feb 28, 2024

I cleaned it up in this patch and added capability to see through bitcasts. Not sure if that is necessary for all cases, but it might happen in certain situations.

Copy link

github-actions bot commented Feb 28, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@Lukacma Lukacma merged commit 2640277 into llvm:main Feb 28, 2024
@Lukacma Lukacma deleted the whilelo_opt branch February 28, 2024 16:45
mylai-mtk pushed a commit to mylai-mtk/llvm-project that referenced this pull request Jul 12, 2024
…ns (llvm#83001)

In certain cases Legalizer was generating `AND(WHILELO, SPLAT 1)` instruction pattern, when `WHILELO` would be sufficient.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants