Skip to content

[AMDGPU][True16][CodeGen] sext i16 inreg in true16 mode #144024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 18, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions llvm/lib/Target/AMDGPU/SIInstructions.td
Original file line number Diff line number Diff line change
Expand Up @@ -2623,6 +2623,8 @@ def : GCNPat<
(i32 (DivergentSextInreg<i1> i32:$src)),
(V_BFE_I32_e64 i32:$src, (i32 0), (i32 1))>;

foreach p = [NotHasTrue16BitInsts, UseFakeTrue16Insts] in
let True16Predicate = p in {
def : GCNPat <
(i16 (DivergentSextInreg<i1> i16:$src)),
(V_BFE_I32_e64 $src, (i32 0), (i32 1))
Expand All @@ -2632,6 +2634,23 @@ def : GCNPat <
(i16 (DivergentSextInreg<i8> i16:$src)),
(V_BFE_I32_e64 $src, (i32 0), (i32 8))
>;
}

let True16Predicate = UseRealTrue16Insts in {
def : GCNPat <
(i16 (DivergentSextInreg<i1> i16:$src)),
(V_BFE_I32_e64
(REG_SEQUENCE VGPR_32, VGPR_16:$src, lo16, (i16 (IMPLICIT_DEF)), hi16),
(i32 0), (i32 1))
>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it should work, but it wastes the upper half of the register. Is there an instruction with a 16-bit result suitable for doing a sext from n to 16 bits? I did not find one. @jayfoad

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually for 1 bit, you could probably generate cndmask_b16. For sext 8 to 16 bits I don't know which instruction can do it optimally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this. AFAIX there is no single instruction that can do this, so really instead of this ugly pattern we should just say that i16 sext_inreg is not legal (when real true16 is enabled). Then it is the legalizer's job to legalize it, e.g. by promoting to i32, which is what this ugly pattern is doing anyway.

Copy link
Contributor Author

@broxigarchen broxigarchen Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Joe. The inst works but it mess up the isel/combine/coalescer. Thanks Jay let me try with disabling this in true16

Copy link
Contributor Author

@broxigarchen broxigarchen Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I tried looking into the legalizeDAG and test it a bit. It seems the legalizer might not work here.

We don't have a inplace sign_extend_inreg/sign_extend promote case in codegen. And it seems most of promote code is implemented with sign_extend/zero_extend/fp_extend... I think the promote code comes back to these sext_inreg patterns in the end, unless there is another ISD code can be used to do sign extension?

Copy link
Contributor Author

@broxigarchen broxigarchen Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping! This patch is required to unblock a downstream repo so might need some input on this urgently Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can separately legalize the 64-bit SEXT_INREG operation to split it into a 32-bit sext_inreg + a sext. The main problem would be how much code bothers checking if SEXT_INREG is legal before introducing it in post-legalize combines


def : GCNPat <
(i16 (DivergentSextInreg<i8> i16:$src)),
(V_BFE_I32_e64
(REG_SEQUENCE VGPR_32, VGPR_16:$src, lo16, (i16 (IMPLICIT_DEF)), hi16),
(i32 0), (i32 8))
>;
}

def : GCNPat<
(i32 (DivergentSextInreg<i8> i32:$src)),
Expand Down
24 changes: 24 additions & 0 deletions llvm/lib/Target/AMDGPU/VOP3Instructions.td
Original file line number Diff line number Diff line change
Expand Up @@ -319,11 +319,21 @@ let SchedRW = [Write64Bit] in {
} // End SchedRW = [Write64Bit]
} // End isReMaterializable = 1

foreach p = [NotHasTrue16BitInsts, UseFakeTrue16Insts] in
let True16Predicate = p in
def : GCNPat<
(i32 (DivergentUnaryFrag<sext> i16:$src)),
(i32 (V_BFE_I32_e64 i16:$src, (i32 0), (i32 0x10)))
>;

let True16Predicate = UseRealTrue16Insts in
def : GCNPat<
(i32 (DivergentUnaryFrag<sext> i16:$src)),
(i32 (V_BFE_I32_e64
(REG_SEQUENCE VGPR_32, VGPR_16:$src, lo16, (i16 (IMPLICIT_DEF)), hi16),
(i32 0), (i32 0x10)))
>;

let isReMaterializable = 1 in {
let SubtargetPredicate = isGFX6GFX7GFX10Plus in {
defm V_MULLIT_F32 : VOP3Inst <"v_mullit_f32", VOP3_Profile<VOP_F32_F32_F32_F32>>;
Expand Down Expand Up @@ -423,6 +433,8 @@ def V_INTERP_P1LV_F16 : VOP3Interp <"v_interp_p1lv_f16", VOP3_INTERP16<[f32, f32

} // End SubtargetPredicate = Has16BitInsts, isCommutable = 1

foreach p = [NotHasTrue16BitInsts, UseFakeTrue16Insts] in
let True16Predicate = p in
def : GCNPat<
(i64 (DivergentUnaryFrag<sext> i16:$src)),
(REG_SEQUENCE VReg_64,
Expand All @@ -432,6 +444,18 @@ def : GCNPat<
), VGPR_32)), sub1)
>;

let True16Predicate = UseRealTrue16Insts in
def : GCNPat<
(i64 (DivergentUnaryFrag<sext> i16:$src)),
(REG_SEQUENCE VReg_64,
(i32 (V_BFE_I32_e64
(REG_SEQUENCE VGPR_32, VGPR_16:$src, lo16, (i16 (IMPLICIT_DEF)), hi16),
(S_MOV_B32 (i32 0)), (S_MOV_B32 (i32 0x10)))), sub0,
(i32 (COPY_TO_REGCLASS
(V_ASHRREV_I32_e32 (S_MOV_B32 (i32 0x1f)), (i32 (V_BFE_I32_e64 $src, (S_MOV_B32 (i32 0)), (S_MOV_B32 (i32 0x10))))
), VGPR_32)), sub1)
>;

let SubtargetPredicate = isGFX8Plus, Uses = [MODE, M0, EXEC], OtherPredicates = [isNotGFX90APlus] in {
def V_INTERP_P1_F32_e64 : VOP3Interp <"v_interp_p1_f32", VOP3_INTERP>;
def V_INTERP_P2_F32_e64 : VOP3Interp <"v_interp_p2_f32", VOP3_INTERP>;
Expand Down
69 changes: 31 additions & 38 deletions llvm/test/CodeGen/AMDGPU/idot4s.ll
Original file line number Diff line number Diff line change
Expand Up @@ -1165,35 +1165,32 @@ define amdgpu_kernel void @idot4_acc16_vecMul(ptr addrspace(1) %src1,
; GFX11-DL-TRUE16-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX11-DL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-DL-TRUE16-NEXT: s_clause 0x1
; GFX11-DL-TRUE16-NEXT: global_load_b32 v1, v0, s[2:3]
; GFX11-DL-TRUE16-NEXT: global_load_b32 v2, v0, s[0:1]
; GFX11-DL-TRUE16-NEXT: global_load_b32 v1, v0, s[0:1]
; GFX11-DL-TRUE16-NEXT: global_load_b32 v2, v0, s[2:3]
; GFX11-DL-TRUE16-NEXT: global_load_d16_b16 v0, v3, s[4:5]
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(2)
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v6, v1, 0, 8
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(1)
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
; GFX11-DL-TRUE16-NEXT: v_ashrrev_i16 v6.h, 8, v2.l
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v7.l, v2.h
; GFX11-DL-TRUE16-NEXT: v_ashrrev_i16 v8.h, 8, v1.l
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v4, v4, 0, 8
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v5, v5, 0, 8
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v9.l, v1.h
; GFX11-DL-TRUE16-NEXT: v_ashrrev_i16 v2.h, 8, v2.h
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v5, v2, 0, 8
; GFX11-DL-TRUE16-NEXT: v_ashrrev_i16 v4.h, 8, v1.l
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v1.l, v1.h
; GFX11-DL-TRUE16-NEXT: v_ashrrev_i16 v7.h, 8, v2.l
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v2.l, v2.h
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v7.l, v5.l
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v4.l, v6.l
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v6, v1, 0, 8
; GFX11-DL-TRUE16-NEXT: v_ashrrev_i16 v1.h, 8, v1.h
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v8.l, v4.l
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v4, v9, 0, 8
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v5, v7, 0, 8
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
; GFX11-DL-TRUE16-NEXT: v_pk_mul_lo_u16 v6, v6, v8
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v1.l, v4.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_3)
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v5, v2, 0, 8
; GFX11-DL-TRUE16-NEXT: v_ashrrev_i16 v2.h, 8, v2.h
; GFX11-DL-TRUE16-NEXT: v_pk_mul_lo_u16 v4, v4, v7
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v1.l, v6.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.l
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-DL-TRUE16-NEXT: v_add_nc_u16 v0.l, v6.l, v0.l
; GFX11-DL-TRUE16-NEXT: v_add_nc_u16 v0.l, v4.l, v0.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX11-DL-TRUE16-NEXT: v_pk_mul_lo_u16 v1, v2, v1
; GFX11-DL-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v6.h
; GFX11-DL-TRUE16-NEXT: v_pk_mul_lo_u16 v1, v1, v2
; GFX11-DL-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v4.h
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-DL-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
; GFX11-DL-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.h
Expand Down Expand Up @@ -3435,35 +3432,31 @@ define amdgpu_kernel void @idot4_nonstandard_signed(ptr addrspace(1) %src1,
; GFX11-DL-TRUE16-NEXT: global_load_b32 v2, v0, s[0:1]
; GFX11-DL-TRUE16-NEXT: global_load_b32 v3, v0, s[2:3]
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(1)
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.l
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v2
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v4, 8, v2
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v1, v2, 0, 8
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v6, 8, v3
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v7.l, v2.h
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v2, 24, v2
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v4, v0, 0, 8
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
; GFX11-DL-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
; GFX11-DL-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v6.l
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v5, 8, v3
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.h
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v4, v4, 0, 8
; GFX11-DL-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.h
; GFX11-DL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v1.l, v0.l
; GFX11-DL-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v5.l
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v5, v6, 0, 8
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v1.l, v4.l
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v4, v5, 0, 8
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v5, v7, 0, 8
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v4, 24, v2
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v3, 24, v3
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
; GFX11-DL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v1.l, v0.l
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v1.l, v4.l
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v0.h, v1.l, v0.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v4, v4, 0, 8
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v1.h, v2.l, v0.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v1.l, v4.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v1.l, v3.l, v0.l
; GFX11-DL-TRUE16-NEXT: v_mov_b32_e32 v1, 0
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v0, v0, 0, 16
; GFX11-DL-TRUE16-NEXT: global_store_b32 v1, v0, s[4:5]
; GFX11-DL-TRUE16-NEXT: s_endpgm
Expand Down
79 changes: 37 additions & 42 deletions llvm/test/CodeGen/AMDGPU/idot4u.ll
Original file line number Diff line number Diff line change
Expand Up @@ -1669,40 +1669,38 @@ define amdgpu_kernel void @notdot4_mixedtypes(ptr addrspace(1) %src1,
; GFX11-DL-TRUE16-LABEL: notdot4_mixedtypes:
; GFX11-DL-TRUE16: ; %bb.0: ; %entry
; GFX11-DL-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
; GFX11-DL-TRUE16-NEXT: v_dual_mov_b32 v5, 0 :: v_dual_and_b32 v0, 0x3ff, v0
; GFX11-DL-TRUE16-NEXT: v_and_b32_e32 v0, 0x3ff, v0
; GFX11-DL-TRUE16-NEXT: s_load_b64 s[4:5], s[4:5], 0x34
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-DL-TRUE16-NEXT: v_mov_b32_e32 v6, 0
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-DL-TRUE16-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX11-DL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-DL-TRUE16-NEXT: s_clause 0x1
; GFX11-DL-TRUE16-NEXT: global_load_b32 v3, v0, s[0:1]
; GFX11-DL-TRUE16-NEXT: global_load_b32 v4, v0, s[2:3]
; GFX11-DL-TRUE16-NEXT: global_load_d16_b16 v0, v5, s[4:5]
; GFX11-DL-TRUE16-NEXT: global_load_b32 v4, v0, s[0:1]
; GFX11-DL-TRUE16-NEXT: global_load_b32 v5, v0, s[2:3]
; GFX11-DL-TRUE16-NEXT: global_load_d16_b16 v0, v6, s[4:5]
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(2)
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v3
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v4
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(1)
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v2, 8, v4
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v2, 8, v5
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v3, v4, 0, 8
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v7, v5, 0, 8
; GFX11-DL-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v1.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
; GFX11-DL-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v2, v6, 0, 8
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v2.l, v3.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v6, v7, 0, 8
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v3.l, v7.l
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v0.h, v1.l, v0.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v1.l, v2.l
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v2.l, v6.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_3)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v1.l, v2.l, v0.l
; GFX11-DL-TRUE16-NEXT: v_perm_b32 v1, v4, v4, 0xc0c0302
; GFX11-DL-TRUE16-NEXT: v_perm_b32 v2, v3, v3, 0xc0c0302
; GFX11-DL-TRUE16-NEXT: v_perm_b32 v1, v5, v5, 0xc0c0302
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v2.l, v3.l, v0.l
; GFX11-DL-TRUE16-NEXT: v_perm_b32 v2, v4, v4, 0xc0c0302
; GFX11-DL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-DL-TRUE16-NEXT: v_dot4_u32_u8 v0, v2, v1, v0
; GFX11-DL-TRUE16-NEXT: global_store_b16 v5, v0, s[4:5]
; GFX11-DL-TRUE16-NEXT: global_store_b16 v6, v0, s[4:5]
; GFX11-DL-TRUE16-NEXT: s_endpgm
;
; GFX11-DL-FAKE16-LABEL: notdot4_mixedtypes:
Expand Down Expand Up @@ -1964,44 +1962,41 @@ define amdgpu_kernel void @notdot4_mixedtypes2(ptr addrspace(1) %src1,
; GFX11-DL-TRUE16-LABEL: notdot4_mixedtypes2:
; GFX11-DL-TRUE16: ; %bb.0: ; %entry
; GFX11-DL-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
; GFX11-DL-TRUE16-NEXT: v_and_b32_e32 v0, 0x3ff, v0
; GFX11-DL-TRUE16-NEXT: v_dual_mov_b32 v5, 0 :: v_dual_and_b32 v0, 0x3ff, v0
; GFX11-DL-TRUE16-NEXT: s_load_b64 s[4:5], s[4:5], 0x34
; GFX11-DL-TRUE16-NEXT: v_mov_b32_e32 v4, 0
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-DL-TRUE16-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX11-DL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-DL-TRUE16-NEXT: s_clause 0x1
; GFX11-DL-TRUE16-NEXT: global_load_b32 v2, v0, s[2:3]
; GFX11-DL-TRUE16-NEXT: global_load_b32 v3, v0, s[0:1]
; GFX11-DL-TRUE16-NEXT: global_load_d16_b16 v0, v4, s[4:5]
; GFX11-DL-TRUE16-NEXT: global_load_b32 v3, v0, s[2:3]
; GFX11-DL-TRUE16-NEXT: global_load_b32 v4, v0, s[0:1]
; GFX11-DL-TRUE16-NEXT: global_load_d16_b16 v0, v5, s[4:5]
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(2)
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v2
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v3
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(1)
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v6, 8, v3
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v7.l, v3.h
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v3, 24, v3
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v2, 8, v4
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v6, v4, 0, 8
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.h
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v1, v1, 0, 8
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v5, v5, 0, 8
; GFX11-DL-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v6.l
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v6, 24, v2
; GFX11-DL-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
; GFX11-DL-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
; GFX11-DL-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v7, v7, 0, 8
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.l
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v2.l, v6.l
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v6, 24, v3
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v0.h, v1.l, v0.l
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v5.l, v6.l
; GFX11-DL-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.h
; GFX11-DL-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v3.h
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v1.l, v7.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v3, 24, v4
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v2.l, v1.h, v0.l
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v2, v5, 0, 8
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX11-DL-TRUE16-NEXT: v_bfe_i32 v2, v6, 0, 8
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v1.l, v0.h, v0.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v1.l, v2.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v3.l, v1.l, v0.l
; GFX11-DL-TRUE16-NEXT: global_store_b16 v4, v0, s[4:5]
; GFX11-DL-TRUE16-NEXT: global_store_b16 v5, v0, s[4:5]
; GFX11-DL-TRUE16-NEXT: s_endpgm
;
; GFX11-DL-FAKE16-LABEL: notdot4_mixedtypes2:
Expand Down
Loading
Loading