DAG: Avoid breaking legal vector_shuffle with multiple uses #123712

arsenm · 2025-01-21T08:36:24Z

Previously this combine would undo AMDGPU's new custom legalization of
wide vector shuffles into 2 element pieces. The comment also
states that this combine is only done before legalization,
but the case with a build_vector source was unconditional.

We probably don't want to do this if the multiple uses are full
scalarization of the vector, but this seems to work well enough.
Scalarizing extracts should have folded out pre-legalize.

arsenm · 2025-01-21T08:36:32Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-01-21T08:37:05Z

@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

Previously this combine would undo AMDGPU's new custom legalization of
wide vector shuffles into 2 element pieces. The comment also
states that this combine is only done before legalization,
but the case with a build_vector source was unconditional.

We probably don't want to do this if the multiple uses are full
scalarization of the vector, but this seems to work well enough.
Scalarizing extracts should have folded out pre-legalize.

Patch is 393.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123712.diff

10 Files Affected:

(modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+5)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v2f32.ll (+114-166)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v3f32.ll (+129-143)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v4f32.ll (+343-466)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v2i32.ll (+114-166)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v3i32.ll (+129-143)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v4i32.ll (+343-466)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v2p3.ll (+114-166)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v3p3.ll (+129-143)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v4p3.ll (+343-466)

diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 671a14e6250fc6..eaa08b0ef32af2 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -23172,6 +23172,11 @@ SDValue DAGCombiner::visitEXTRACT_VECTOR_ELT(SDNode *N) {
     }
 
     if (SVInVec.getOpcode() == ISD::BUILD_VECTOR) {
+      // TODO: Check if shuffle mask is legal?
+      if (LegalOperations && TLI.isOperationLegal(ISD::VECTOR_SHUFFLE, VecVT) &&
+          !VecOp.hasOneUse())
+        return SDValue();
+
       SDValue InOp = SVInVec.getOperand(OrigElt);
       if (InOp.getValueType() != ScalarVT) {
         assert(InOp.getValueType().isInteger() && ScalarVT.isInteger());
diff --git a/llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v2f32.ll b/llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v2f32.ll
index fafdac7ffac62f..54c5390ca76c62 100644
--- a/llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v2f32.ll
+++ b/llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v2f32.ll
@@ -176,8 +176,7 @@ define void @v_shuffle_v4f32_v2f32__3_0_u_u(ptr addrspace(1) inreg %ptr) {
 ; GFX90A-NEXT:    ;;#ASMSTART
 ; GFX90A-NEXT:    ; def v[2:3]
 ; GFX90A-NEXT:    ;;#ASMEND
-; GFX90A-NEXT:    v_mov_b32_e32 v0, v1
-; GFX90A-NEXT:    v_mov_b32_e32 v1, v2
+; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[2:3], v[0:1] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX90A-NEXT:    global_store_dwordx4 v4, v[0:3], s[16:17]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    s_setpc_b64 s[30:31]
@@ -192,8 +191,8 @@ define void @v_shuffle_v4f32_v2f32__3_0_u_u(ptr addrspace(1) inreg %ptr) {
 ; GFX940-NEXT:    ;;#ASMSTART
 ; GFX940-NEXT:    ; def v[2:3]
 ; GFX940-NEXT:    ;;#ASMEND
-; GFX940-NEXT:    v_mov_b32_e32 v0, v1
-; GFX940-NEXT:    v_mov_b32_e32 v1, v2
+; GFX940-NEXT:    s_nop 0
+; GFX940-NEXT:    v_pk_mov_b32 v[0:1], v[2:3], v[0:1] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX940-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1] sc0 sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
@@ -566,16 +565,15 @@ define void @v_shuffle_v4f32_v2f32__3_3_3_0(ptr addrspace(1) inreg %ptr) {
 ; GFX90A:       ; %bb.0:
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    ;;#ASMSTART
-; GFX90A-NEXT:    ; def v[0:1]
+; GFX90A-NEXT:    ; def v[2:3]
 ; GFX90A-NEXT:    ;;#ASMEND
-; GFX90A-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX90A-NEXT:    ;;#ASMSTART
-; GFX90A-NEXT:    ; def v[4:5]
+; GFX90A-NEXT:    ; def v[0:1]
 ; GFX90A-NEXT:    ;;#ASMEND
+; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
+; GFX90A-NEXT:    v_pk_mov_b32 v[2:3], v[0:1], v[2:3] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX90A-NEXT:    v_mov_b32_e32 v0, v1
-; GFX90A-NEXT:    v_mov_b32_e32 v2, v1
-; GFX90A-NEXT:    v_mov_b32_e32 v3, v4
-; GFX90A-NEXT:    global_store_dwordx4 v6, v[0:3], s[16:17]
+; GFX90A-NEXT:    global_store_dwordx4 v4, v[0:3], s[16:17]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -583,16 +581,15 @@ define void @v_shuffle_v4f32_v2f32__3_3_3_0(ptr addrspace(1) inreg %ptr) {
 ; GFX940:       ; %bb.0:
 ; GFX940-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX940-NEXT:    ;;#ASMSTART
-; GFX940-NEXT:    ; def v[0:1]
+; GFX940-NEXT:    ; def v[2:3]
 ; GFX940-NEXT:    ;;#ASMEND
-; GFX940-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX940-NEXT:    ;;#ASMSTART
-; GFX940-NEXT:    ; def v[4:5]
+; GFX940-NEXT:    ; def v[0:1]
 ; GFX940-NEXT:    ;;#ASMEND
+; GFX940-NEXT:    v_mov_b32_e32 v4, 0
+; GFX940-NEXT:    v_pk_mov_b32 v[2:3], v[0:1], v[2:3] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX940-NEXT:    v_mov_b32_e32 v0, v1
-; GFX940-NEXT:    v_mov_b32_e32 v2, v1
-; GFX940-NEXT:    v_mov_b32_e32 v3, v4
-; GFX940-NEXT:    global_store_dwordx4 v6, v[0:3], s[0:1] sc0 sc1
+; GFX940-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1] sc0 sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
   %vec0 = call <2 x float> asm "; def $0", "=v"()
@@ -675,31 +672,26 @@ define void @v_shuffle_v4f32_v2f32__3_3_3_2(ptr addrspace(1) inreg %ptr) {
 ; GFX90A-LABEL: v_shuffle_v4f32_v2f32__3_3_3_2:
 ; GFX90A:       ; %bb.0:
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX90A-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX90A-NEXT:    ;;#ASMSTART
-; GFX90A-NEXT:    ; def v[4:5]
+; GFX90A-NEXT:    ; def v[0:1]
 ; GFX90A-NEXT:    ;;#ASMEND
-; GFX90A-NEXT:    v_mov_b32_e32 v0, v5
-; GFX90A-NEXT:    v_mov_b32_e32 v1, v5
-; GFX90A-NEXT:    v_mov_b32_e32 v2, v5
-; GFX90A-NEXT:    v_mov_b32_e32 v3, v4
-; GFX90A-NEXT:    global_store_dwordx4 v6, v[0:3], s[16:17]
+; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
+; GFX90A-NEXT:    v_pk_mov_b32 v[2:3], v[0:1], v[0:1] op_sel:[1,0] op_sel_hi:[0,0]
+; GFX90A-NEXT:    v_mov_b32_e32 v0, v1
+; GFX90A-NEXT:    global_store_dwordx4 v4, v[0:3], s[16:17]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX940-LABEL: v_shuffle_v4f32_v2f32__3_3_3_2:
 ; GFX940:       ; %bb.0:
 ; GFX940-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX940-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX940-NEXT:    ;;#ASMSTART
-; GFX940-NEXT:    ; def v[4:5]
+; GFX940-NEXT:    ; def v[0:1]
 ; GFX940-NEXT:    ;;#ASMEND
-; GFX940-NEXT:    s_nop 0
-; GFX940-NEXT:    v_mov_b32_e32 v0, v5
-; GFX940-NEXT:    v_mov_b32_e32 v1, v5
-; GFX940-NEXT:    v_mov_b32_e32 v2, v5
-; GFX940-NEXT:    v_mov_b32_e32 v3, v4
-; GFX940-NEXT:    global_store_dwordx4 v6, v[0:3], s[0:1] sc0 sc1
+; GFX940-NEXT:    v_mov_b32_e32 v4, 0
+; GFX940-NEXT:    v_pk_mov_b32 v[2:3], v[0:1], v[0:1] op_sel:[1,0] op_sel_hi:[0,0]
+; GFX940-NEXT:    v_mov_b32_e32 v0, v1
+; GFX940-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1] sc0 sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
   %vec0 = call <2 x float> asm "; def $0", "=v"()
@@ -873,8 +865,7 @@ define void @v_shuffle_v4f32_v2f32__1_0_0_0(ptr addrspace(1) inreg %ptr) {
 ; GFX90A-NEXT:    ; def v[2:3]
 ; GFX90A-NEXT:    ;;#ASMEND
 ; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
-; GFX90A-NEXT:    v_mov_b32_e32 v0, v3
-; GFX90A-NEXT:    v_mov_b32_e32 v1, v2
+; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[2:3], v[2:3] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX90A-NEXT:    v_mov_b32_e32 v3, v2
 ; GFX90A-NEXT:    global_store_dwordx4 v4, v[0:3], s[16:17]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
@@ -887,8 +878,7 @@ define void @v_shuffle_v4f32_v2f32__1_0_0_0(ptr addrspace(1) inreg %ptr) {
 ; GFX940-NEXT:    ; def v[2:3]
 ; GFX940-NEXT:    ;;#ASMEND
 ; GFX940-NEXT:    v_mov_b32_e32 v4, 0
-; GFX940-NEXT:    v_mov_b32_e32 v0, v3
-; GFX940-NEXT:    v_mov_b32_e32 v1, v2
+; GFX940-NEXT:    v_pk_mov_b32 v[0:1], v[2:3], v[2:3] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX940-NEXT:    v_mov_b32_e32 v3, v2
 ; GFX940-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1] sc0 sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
@@ -972,8 +962,7 @@ define void @v_shuffle_v4f32_v2f32__3_0_0_0(ptr addrspace(1) inreg %ptr) {
 ; GFX90A-NEXT:    ; def v[0:1]
 ; GFX90A-NEXT:    ;;#ASMEND
 ; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
-; GFX90A-NEXT:    v_mov_b32_e32 v0, v1
-; GFX90A-NEXT:    v_mov_b32_e32 v1, v2
+; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[0:1], v[2:3] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX90A-NEXT:    v_mov_b32_e32 v3, v2
 ; GFX90A-NEXT:    global_store_dwordx4 v4, v[0:3], s[16:17]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
@@ -989,8 +978,7 @@ define void @v_shuffle_v4f32_v2f32__3_0_0_0(ptr addrspace(1) inreg %ptr) {
 ; GFX940-NEXT:    ; def v[0:1]
 ; GFX940-NEXT:    ;;#ASMEND
 ; GFX940-NEXT:    v_mov_b32_e32 v4, 0
-; GFX940-NEXT:    v_mov_b32_e32 v0, v1
-; GFX940-NEXT:    v_mov_b32_e32 v1, v2
+; GFX940-NEXT:    v_pk_mov_b32 v[0:1], v[0:1], v[2:3] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX940-NEXT:    v_mov_b32_e32 v3, v2
 ; GFX940-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1] sc0 sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
@@ -1138,14 +1126,13 @@ define void @v_shuffle_v4f32_v2f32__3_2_0_0(ptr addrspace(1) inreg %ptr) {
 ; GFX90A-NEXT:    ;;#ASMSTART
 ; GFX90A-NEXT:    ; def v[2:3]
 ; GFX90A-NEXT:    ;;#ASMEND
-; GFX90A-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX90A-NEXT:    ;;#ASMSTART
-; GFX90A-NEXT:    ; def v[4:5]
+; GFX90A-NEXT:    ; def v[0:1]
 ; GFX90A-NEXT:    ;;#ASMEND
-; GFX90A-NEXT:    v_mov_b32_e32 v0, v5
-; GFX90A-NEXT:    v_mov_b32_e32 v1, v4
+; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
+; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[0:1], v[0:1] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX90A-NEXT:    v_mov_b32_e32 v3, v2
-; GFX90A-NEXT:    global_store_dwordx4 v6, v[0:3], s[16:17]
+; GFX90A-NEXT:    global_store_dwordx4 v4, v[0:3], s[16:17]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -1155,14 +1142,13 @@ define void @v_shuffle_v4f32_v2f32__3_2_0_0(ptr addrspace(1) inreg %ptr) {
 ; GFX940-NEXT:    ;;#ASMSTART
 ; GFX940-NEXT:    ; def v[2:3]
 ; GFX940-NEXT:    ;;#ASMEND
-; GFX940-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX940-NEXT:    ;;#ASMSTART
-; GFX940-NEXT:    ; def v[4:5]
+; GFX940-NEXT:    ; def v[0:1]
 ; GFX940-NEXT:    ;;#ASMEND
+; GFX940-NEXT:    v_mov_b32_e32 v4, 0
+; GFX940-NEXT:    v_pk_mov_b32 v[0:1], v[0:1], v[0:1] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX940-NEXT:    v_mov_b32_e32 v3, v2
-; GFX940-NEXT:    v_mov_b32_e32 v0, v5
-; GFX940-NEXT:    v_mov_b32_e32 v1, v4
-; GFX940-NEXT:    global_store_dwordx4 v6, v[0:3], s[0:1] sc0 sc1
+; GFX940-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1] sc0 sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
   %vec0 = call <2 x float> asm "; def $0", "=v"()
@@ -1303,16 +1289,15 @@ define void @v_shuffle_v4f32_v2f32__3_3_1_0(ptr addrspace(1) inreg %ptr) {
 ; GFX90A:       ; %bb.0:
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    ;;#ASMSTART
-; GFX90A-NEXT:    ; def v[0:1]
+; GFX90A-NEXT:    ; def v[2:3]
 ; GFX90A-NEXT:    ;;#ASMEND
-; GFX90A-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX90A-NEXT:    ;;#ASMSTART
-; GFX90A-NEXT:    ; def v[4:5]
+; GFX90A-NEXT:    ; def v[0:1]
 ; GFX90A-NEXT:    ;;#ASMEND
+; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
+; GFX90A-NEXT:    v_pk_mov_b32 v[2:3], v[2:3], v[2:3] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX90A-NEXT:    v_mov_b32_e32 v0, v1
-; GFX90A-NEXT:    v_mov_b32_e32 v2, v5
-; GFX90A-NEXT:    v_mov_b32_e32 v3, v4
-; GFX90A-NEXT:    global_store_dwordx4 v6, v[0:3], s[16:17]
+; GFX90A-NEXT:    global_store_dwordx4 v4, v[0:3], s[16:17]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -1320,16 +1305,15 @@ define void @v_shuffle_v4f32_v2f32__3_3_1_0(ptr addrspace(1) inreg %ptr) {
 ; GFX940:       ; %bb.0:
 ; GFX940-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX940-NEXT:    ;;#ASMSTART
-; GFX940-NEXT:    ; def v[0:1]
+; GFX940-NEXT:    ; def v[2:3]
 ; GFX940-NEXT:    ;;#ASMEND
-; GFX940-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX940-NEXT:    ;;#ASMSTART
-; GFX940-NEXT:    ; def v[4:5]
+; GFX940-NEXT:    ; def v[0:1]
 ; GFX940-NEXT:    ;;#ASMEND
+; GFX940-NEXT:    v_mov_b32_e32 v4, 0
+; GFX940-NEXT:    v_pk_mov_b32 v[2:3], v[2:3], v[2:3] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX940-NEXT:    v_mov_b32_e32 v0, v1
-; GFX940-NEXT:    v_mov_b32_e32 v2, v5
-; GFX940-NEXT:    v_mov_b32_e32 v3, v4
-; GFX940-NEXT:    global_store_dwordx4 v6, v[0:3], s[0:1] sc0 sc1
+; GFX940-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1] sc0 sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
   %vec0 = call <2 x float> asm "; def $0", "=v"()
@@ -1723,8 +1707,7 @@ define void @v_shuffle_v4f32_v2f32__3_0_1_1(ptr addrspace(1) inreg %ptr) {
 ; GFX90A-NEXT:    ; def v[0:1]
 ; GFX90A-NEXT:    ;;#ASMEND
 ; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
-; GFX90A-NEXT:    v_mov_b32_e32 v0, v1
-; GFX90A-NEXT:    v_mov_b32_e32 v1, v2
+; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[0:1], v[2:3] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX90A-NEXT:    v_mov_b32_e32 v2, v3
 ; GFX90A-NEXT:    global_store_dwordx4 v4, v[0:3], s[16:17]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
@@ -1740,8 +1723,7 @@ define void @v_shuffle_v4f32_v2f32__3_0_1_1(ptr addrspace(1) inreg %ptr) {
 ; GFX940-NEXT:    ; def v[0:1]
 ; GFX940-NEXT:    ;;#ASMEND
 ; GFX940-NEXT:    v_mov_b32_e32 v4, 0
-; GFX940-NEXT:    v_mov_b32_e32 v0, v1
-; GFX940-NEXT:    v_mov_b32_e32 v1, v2
+; GFX940-NEXT:    v_pk_mov_b32 v[0:1], v[0:1], v[2:3] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX940-NEXT:    v_mov_b32_e32 v2, v3
 ; GFX940-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1] sc0 sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
@@ -1776,14 +1758,13 @@ define void @v_shuffle_v4f32_v2f32__3_2_1_1(ptr addrspace(1) inreg %ptr) {
 ; GFX90A-NEXT:    ;;#ASMSTART
 ; GFX90A-NEXT:    ; def v[2:3]
 ; GFX90A-NEXT:    ;;#ASMEND
-; GFX90A-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX90A-NEXT:    ;;#ASMSTART
-; GFX90A-NEXT:    ; def v[4:5]
+; GFX90A-NEXT:    ; def v[0:1]
 ; GFX90A-NEXT:    ;;#ASMEND
-; GFX90A-NEXT:    v_mov_b32_e32 v0, v5
-; GFX90A-NEXT:    v_mov_b32_e32 v1, v4
+; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
+; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[0:1], v[0:1] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX90A-NEXT:    v_mov_b32_e32 v2, v3
-; GFX90A-NEXT:    global_store_dwordx4 v6, v[0:3], s[16:17]
+; GFX90A-NEXT:    global_store_dwordx4 v4, v[0:3], s[16:17]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -1793,14 +1774,13 @@ define void @v_shuffle_v4f32_v2f32__3_2_1_1(ptr addrspace(1) inreg %ptr) {
 ; GFX940-NEXT:    ;;#ASMSTART
 ; GFX940-NEXT:    ; def v[2:3]
 ; GFX940-NEXT:    ;;#ASMEND
-; GFX940-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX940-NEXT:    ;;#ASMSTART
-; GFX940-NEXT:    ; def v[4:5]
+; GFX940-NEXT:    ; def v[0:1]
 ; GFX940-NEXT:    ;;#ASMEND
+; GFX940-NEXT:    v_mov_b32_e32 v4, 0
+; GFX940-NEXT:    v_pk_mov_b32 v[0:1], v[0:1], v[0:1] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX940-NEXT:    v_mov_b32_e32 v2, v3
-; GFX940-NEXT:    v_mov_b32_e32 v0, v5
-; GFX940-NEXT:    v_mov_b32_e32 v1, v4
-; GFX940-NEXT:    global_store_dwordx4 v6, v[0:3], s[0:1] sc0 sc1
+; GFX940-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1] sc0 sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
   %vec0 = call <2 x float> asm "; def $0", "=v"()
@@ -2152,8 +2132,7 @@ define void @v_shuffle_v4f32_v2f32__3_2_2_2(ptr addrspace(1) inreg %ptr) {
 ; GFX90A-NEXT:    ; def v[2:3]
 ; GFX90A-NEXT:    ;;#ASMEND
 ; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
-; GFX90A-NEXT:    v_mov_b32_e32 v0, v3
-; GFX90A-NEXT:    v_mov_b32_e32 v1, v2
+; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[2:3], v[2:3] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX90A-NEXT:    v_mov_b32_e32 v3, v2
 ; GFX90A-NEXT:    global_store_dwordx4 v4, v[0:3], s[16:17]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
@@ -2166,8 +2145,7 @@ define void @v_shuffle_v4f32_v2f32__3_2_2_2(ptr addrspace(1) inreg %ptr) {
 ; GFX940-NEXT:    ; def v[2:3]
 ; GFX940-NEXT:    ;;#ASMEND
 ; GFX940-NEXT:    v_mov_b32_e32 v4, 0
-; GFX940-NEXT:    v_mov_b32_e32 v0, v3
-; GFX940-NEXT:    v_mov_b32_e32 v1, v2
+; GFX940-NEXT:    v_pk_mov_b32 v[0:1], v[2:3], v[2:3] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX940-NEXT:    v_mov_b32_e32 v3, v2
 ; GFX940-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1] sc0 sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
@@ -2246,16 +2224,15 @@ define void @v_shuffle_v4f32_v2f32__3_0_2_2(ptr addrspace(1) inreg %ptr) {
 ; GFX90A:       ; %bb.0:
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    ;;#ASMSTART
-; GFX90A-NEXT:    ; def v[2:3]
+; GFX90A-NEXT:    ; def v[0:1]
 ; GFX90A-NEXT:    ;;#ASMEND
-; GFX90A-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX90A-NEXT:    ;;#ASMSTART
-; GFX90A-NEXT:    ; def v[4:5]
+; GFX90A-NEXT:    ; def v[2:3]
 ; GFX90A-NEXT:    ;;#ASMEND
-; GFX90A-NEXT:    v_mov_b32_e32 v0, v3
-; GFX90A-NEXT:    v_mov_b32_e32 v1, v4
+; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
+; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[2:3], v[0:1] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX90A-NEXT:    v_mov_b32_e32 v3, v2
-; GFX90A-NEXT:    global_store_dwordx4 v6, v[0:3], s[16:17]
+; GFX90A-NEXT:    global_store_dwordx4 v4, v[0:3], s[16:17]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -2263,16 +2240,15 @@ define void @v_shuffle_v4f32_v2f32__3_0_2_2(ptr addrspace(1) inreg %ptr) {
 ; GFX940:       ; %bb.0:
 ; GFX940-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX940-NEXT:    ;;#ASMSTART
-; GFX940-NEXT:    ; def v[2:3]
+; GFX940-NEXT:    ; def v[0:1]
 ; GFX940-NEXT:    ;;#ASMEND
-; GFX940-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX940-NEXT:    ;;#ASMSTART
-; GFX940-NEXT:    ; def v[4:5]
+; GFX940-NEXT:    ; def v[2:3]
 ; GFX940-NEXT:    ;;#ASMEND
-; GFX940-NEXT:    v_mov_b32_e32 v0, v3
-; GFX940-NEXT:    v_mov_b32_e32 v1, v4
+; GFX940-NEXT:    v_mov_b32_e32 v4, 0
+; GFX940-NEXT:    v_pk_mov_b32 v[0:1], v[2:3], v[0:1] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX940-NEXT:    v_mov_b32_e32 v3, v2
-; GFX940-NEXT:    global_store_dwordx4 v6, v[0:3], s[0:1] sc0 sc1
+; GFX940-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1] sc0 sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
   %vec0 = call <2 x float> asm "; def $0", "=v"()
@@ -2516,15 +2492,13 @@ define void @v_shuffle_v4f32_v2f32__3_3_1_2(ptr addrspace(1) inreg %ptr) {
 ; GFX90A-NEXT:    ;;#ASMSTART
 ; GFX90A-NEXT:    ; def v[2:3]
 ; GFX90A-NEXT:    ;;#ASMEND
-; GFX90A-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX90A-NEXT:    ;;#ASMSTART
-; GFX90A-NEXT:    ; def v[4:5]
+; GFX90A-NEXT:    ; def v[0:1]
 ; GFX90A-NEXT:    ;;#ASMEND
-; GFX90A-NEXT:    v_mov_b32_e32 v0, v5
-; GFX90A-NEXT:    v_mov_b32_e32 v1, v5
-; GFX90A-NEXT:    v_mov_b32_e32 v2, v3
-; GFX90A-NEXT:    v_mov_b32_e32 v3, v4
-; GFX90A-NEXT:    global_store_dwordx4 v6, v[0:3], s[16:17]
+; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
+; GFX90A-NEXT:    v_pk_mov_b32 v[2:3], v[2:3], v[0:1] op_sel:[1,0] op_sel_hi:[0,0]
+; GFX90A-NEXT:    v_mov_b32_e32 v0, v1
+; GFX90A-NEXT:    global_store_dwordx4 v4, v[0:3], s[16:17]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -2534,15 +2508,13 @@ define void @v_shuffle_v4f32_v2f32__3_3_1_2(ptr addrspace(1) inreg %ptr) {
 ; GFX940-NEXT:    ;;#ASMSTART
 ; GFX940-NEXT:    ; def v[2:3]
 ; GFX940-NEXT:    ;;#ASMEND
-; GFX940-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX940-NEXT:    ;;#ASMSTART
-; GFX940-NEXT:    ; def v[4:5]
+; GFX940-NEXT:    ; def v[0:1]
 ; GFX940-NEXT:    ;;#ASMEND
-; GFX940-NEXT:    v_mov_b32_e32 v2, v3
-; GFX940-NEXT:    v_mov_b32_e32 v0, v5
-; GFX940-NEXT:    v_mov_b32_e32 v1, v5
-; GFX940-NEXT:    v_mov_b32_e32 v3, v4
-; GFX940-NEXT:    global_store_dwordx4 v6, v[0:3], s[0:1] sc0 sc1
+; GFX940-NEXT:    v_mov_b32_e32 v4, 0
+; GFX940-NEXT:    v_pk_mov_b32 v[2:3], v[2:3], v[0:1] op_sel:[1,0] op_sel_hi:[0,0]
+; GFX940-NEXT:    v_mov_b32_e32 v0, v1
+; GFX940-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1] sc0 sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
   %vec0 = call <2 x float> asm "; def $0", "=v"()
@@ -2826,16 +2798,15 @@ define void @v_shuffle_v4f32_v2f32__3_0_3_3(ptr addrspace(1) inreg %ptr) {
 ; GFX90A:       ; %bb.0:
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    ;;#ASMSTART
-; GFX90A-NEXT:    ; def v[2:3]
+; GFX90A-NEXT:    ; def v[0:1]
 ; GFX90A-NEXT:    ;;#ASMEND
-; GFX90A-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX90A-NEXT:    ;;#ASMSTART
-; GFX90A-NEXT:    ; def v[4:5]
+; GFX90A-NEXT:    ; def v[2:3]
 ; GFX90A-NEXT:    ;;#ASMEND
-; GFX90A-NEXT:    v_mov_b32_e32 v0, v3
-; GFX90A-NEXT:    v_mov_b32_e32 v1, v4
+; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
+; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[2:3], v[0:1] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX90A-NEXT:    v_mov_b32_e32 v2, v3
-; GFX90A-NEXT:    global_store_dwordx4 v6, v[0:3], s[16:17]
+; GFX90A-NEXT:    global_store_dwordx4 v4, v[0:3], s[16:17]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -2843,16 +2814,15 @@ define void @v_shuffle_v4f32_v2f32__3_0_3_3(ptr addrspace(1) inreg %ptr) {
 ; GFX940:       ; %bb.0:
 ; GFX940-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX940-NEXT:    ;;#ASMSTART
-; GFX940-NEXT:    ; def v[2:3]
+; GFX940-NEXT:    ; def v[0:1]
 ; GFX940-NEXT:    ;;#ASMEND
-; GFX940-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX940-NEXT:    ;;#ASMSTART
-; GFX940-NEXT:    ; def v[4:5]
+; GFX940-NEXT:    ; def v[2:3]
 ; GFX940-NEXT:    ;;#ASMEND
-; GFX940-NEXT:    v_mov_b32_e32 v0, v3
-; GFX940-NEXT:    v_mov_b32_e32 v1, v4
+; GFX940-NEXT:    v_mov_b32_e32 v4, 0
+; GFX940-NEXT:    v_pk_mov_b32 v[0:1], v[2:3], v[0:1] op_sel:[1,0] op_sel_hi:[0,0]
 ; GFX940-NEXT:    v_mov_b32_e32 v2, v3
-; GFX940-NEXT:    global_store_dwordx4 v6, v[0:3], s[0:1] sc0 sc1
+; GFX940-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1...
[truncated]

RKSimon · 2025-01-21T11:19:00Z

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

@@ -23172,6 +23172,11 @@ SDValue DAGCombiner::visitEXTRACT_VECTOR_ELT(SDNode *N) {
    }

    if (SVInVec.getOpcode() == ISD::BUILD_VECTOR) {
+      // TODO: Check if shuffle mask is legal?
+      if (LegalOperations && TLI.isOperationLegal(ISD::VECTOR_SHUFFLE, VecVT) &&


Have you investigate using isShuffleMaskLegal instead?

I started to, but the usage in the combiner seems inconsistent. Implementing it accurately results in regressions independently of this

preames

LGTM

I suspect we'll want to refine the profitability here over the time, but this seems reasonable as a stepping stone.

Previously this combine would undo AMDGPU's new custom legalization of wide vector shuffles into 2 element pieces. The comment also states that this combine is only done before legalization, but the case with a build_vector source was unconditional. We probably don't want to do this if the multiple uses are full scalarization of the vector, but this seems to work well enough. Scalarizing extracts should have folded out pre-legalize.

arsenm · 2025-01-30T03:54:09Z

Merge activity

Jan 29, 10:54 PM EST: A user started a stack merge that includes this pull request via Graphite.
Jan 29, 10:55 PM EST: A user merged this pull request with Graphite.

This was referenced Jan 21, 2025

DAG: Fix vector_shuffle -> splat fold defining undef lanes #123596

Merged

AMDGPU: Make vector_shuffle legal for v2i32 with v_pk_mov_b32 #123684

Merged

AMDGPU: Custom lower 32-bit element shuffles #123711

Merged

arsenm added the backend:AMDGPU label Jan 21, 2025 — with Graphite App

arsenm requested review from jayfoad, jrbyrnes, kerbowa, rampitec, shiltian, preames, RKSimon and topperc January 21, 2025 08:36

arsenm marked this pull request as ready for review January 21, 2025 08:36

llvmbot added the llvm:SelectionDAG SelectionDAGISel as well label Jan 21, 2025

RKSimon reviewed Jan 21, 2025

View reviewed changes

arsenm force-pushed the users/arsenm/custom-lower-vector-shuffle-i32-elements branch from 1c0f60a to 6434af5 Compare January 21, 2025 14:59

arsenm force-pushed the users/arsenm/dag/avoid-breaking-multi-use-legal-shuffle-after-legalize branch from 3529cc4 to 5db2a44 Compare January 21, 2025 14:59

arsenm force-pushed the users/arsenm/custom-lower-vector-shuffle-i32-elements branch from 6434af5 to 4e1bdb4 Compare January 21, 2025 17:00

arsenm force-pushed the users/arsenm/dag/avoid-breaking-multi-use-legal-shuffle-after-legalize branch from 5db2a44 to 625ee25 Compare January 21, 2025 17:01

preames approved these changes Jan 21, 2025

View reviewed changes

RKSimon approved these changes Jan 21, 2025

View reviewed changes

arsenm force-pushed the users/arsenm/custom-lower-vector-shuffle-i32-elements branch from 4e1bdb4 to 0a3791a Compare January 22, 2025 03:11

arsenm force-pushed the users/arsenm/dag/avoid-breaking-multi-use-legal-shuffle-after-legalize branch from 625ee25 to efabfc6 Compare January 22, 2025 03:11

arsenm force-pushed the users/arsenm/custom-lower-vector-shuffle-i32-elements branch from 0a3791a to 1c68965 Compare January 23, 2025 14:00

arsenm force-pushed the users/arsenm/dag/avoid-breaking-multi-use-legal-shuffle-after-legalize branch from efabfc6 to 3f44d3b Compare January 23, 2025 14:01

arsenm force-pushed the users/arsenm/custom-lower-vector-shuffle-i32-elements branch from 1c68965 to 6851cfb Compare January 28, 2025 01:34

arsenm force-pushed the users/arsenm/dag/avoid-breaking-multi-use-legal-shuffle-after-legalize branch from 3f44d3b to 30ea534 Compare January 28, 2025 01:34

Base automatically changed from users/arsenm/custom-lower-vector-shuffle-i32-elements to main January 28, 2025 04:17

arsenm added 2 commits January 28, 2025 04:19

Update tests for base commit changes

912a187

arsenm force-pushed the users/arsenm/dag/avoid-breaking-multi-use-legal-shuffle-after-legalize branch from 30ea534 to 912a187 Compare January 28, 2025 04:19

arsenm merged commit 97a1f49 into main Jan 30, 2025
8 checks passed

arsenm deleted the users/arsenm/dag/avoid-breaking-multi-use-legal-shuffle-after-legalize branch January 30, 2025 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DAG: Avoid breaking legal vector_shuffle with multiple uses #123712

DAG: Avoid breaking legal vector_shuffle with multiple uses #123712

Uh oh!

arsenm commented Jan 21, 2025

Uh oh!

arsenm commented Jan 21, 2025 •

edited

Loading

Uh oh!

llvmbot commented Jan 21, 2025 •

edited

Loading

Uh oh!

RKSimon Jan 21, 2025

Uh oh!

arsenm Jan 21, 2025

Uh oh!

preames left a comment

Uh oh!

arsenm commented Jan 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

DAG: Avoid breaking legal vector_shuffle with multiple uses #123712

DAG: Avoid breaking legal vector_shuffle with multiple uses #123712

Uh oh!

Conversation

arsenm commented Jan 21, 2025

Uh oh!

arsenm commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RKSimon Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

preames left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

Uh oh!

arsenm commented Jan 21, 2025 •

edited

Loading

llvmbot commented Jan 21, 2025 •

edited

Loading

arsenm commented Jan 30, 2025 •

edited

Loading