Skip to content

[AMDGPU] Promote uniform ops to i32 in GISel #106557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion llvm/lib/Target/AMDGPU/AMDGPUCombine.td
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,34 @@ def expand_promoted_fmed3 : GICombineRule<

} // End Predicates = [NotHasMed3_16]

def promote_i16_uniform_binops_frag : GICombinePatFrag<
(outs root:$dst), (ins),
!foreach(op, [G_SMIN, G_UMIN, G_SMAX, G_UMAX, G_ADD,
G_SUB, G_SHL, G_ASHR, G_LSHR, G_AND,
G_XOR, G_OR, G_MUL],
(pattern (op $dst, $lhs, $rhs)))>;

def promote_i16_uniform_ternary_frag : GICombinePatFrag<
(outs root:$dst), (ins),
!foreach(op, [G_ICMP, G_SELECT],
(pattern (op $dst, $first, $lhs, $rhs)))>;

let Predicates = [Has16BitInsts] in {
def promote_i16_uniform_binops : GICombineRule<
(defs root:$dst),
(match (promote_i16_uniform_binops_frag $dst):$mi,
[{ return matchPromote16to32(*${mi}); }]),
(apply [{ applyPromote16to32(*${mi}); }])
>;

def promote_i16_uniform_ternary : GICombineRule<
(defs root:$dst),
(match (promote_i16_uniform_ternary_frag $dst):$mi,
[{ return matchPromote16to32(*${mi}); }]),
(apply [{ applyPromote16to32(*${mi}); }])
>;
}

// Combines which should only apply on SI/CI
def gfx6gfx7_combines : GICombineGroup<[fcmp_select_to_fmin_fmax_legacy]>;

Expand All @@ -169,5 +197,6 @@ def AMDGPURegBankCombiner : GICombiner<
"AMDGPURegBankCombinerImpl",
[unmerge_merge, unmerge_cst, unmerge_undef,
zext_trunc_fold, int_minmax_to_med3, ptr_add_immed_chain,
fp_minmax_to_clamp, fp_minmax_to_med3, fmed3_intrinsic_to_clamp]> {
fp_minmax_to_clamp, fp_minmax_to_med3, fmed3_intrinsic_to_clamp,
promote_i16_uniform_binops, promote_i16_uniform_ternary]> {
}
109 changes: 109 additions & 0 deletions llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,9 @@ class AMDGPURegBankCombinerImpl : public Combiner {
void applyMed3(MachineInstr &MI, Med3MatchInfo &MatchInfo) const;
void applyClamp(MachineInstr &MI, Register &Reg) const;

bool matchPromote16to32(MachineInstr &MI) const;
void applyPromote16to32(MachineInstr &MI) const;

private:
SIModeRegisterDefaults getMode() const;
bool getIEEE() const;
Expand Down Expand Up @@ -348,6 +351,112 @@ bool AMDGPURegBankCombinerImpl::matchFPMed3ToClamp(MachineInstr &MI,
return false;
}

bool AMDGPURegBankCombinerImpl::matchPromote16to32(MachineInstr &MI) const {
Register Dst = MI.getOperand(0).getReg();
LLT DstTy = MRI.getType(Dst);
const auto *RB = MRI.getRegBankOrNull(Dst);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally would use RBI.getRegBank to handle the case where the register has a concrete register class. Failing that, this needs to return false on null


// Only promote between 2 and 16 bits.
// For ICMP use the LHS of the comparison to get the type.
unsigned TyOpIdx = (MI.getOpcode() == AMDGPU::G_ICMP) ? 2 : 0;
LLT OpTy = MRI.getType(MI.getOperand(TyOpIdx).getReg());
if (OpTy.getScalarSizeInBits() < 2 || OpTy.getScalarSizeInBits() > 16)
return false;

// Only promote uniform instructions.
if (RB->getID() != AMDGPU::SGPRRegBankID)
return false;
Comment on lines +366 to +368
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check this first, up when RB is assigned. Also this is a case where really what we care about is scalar, not uniformity


// TODO: Support vectors. Vectors will create illegal ops, such as
// 2x32 exts, that we'd need to legalize.
// We could just scalarize all vectors but then we don't respect
// the legalizer's rules. Ideally we should be able to call
// the legalizer here, or this should move into the legalizer
// if it can tell between uniform and non-uniform values at
// some point.
if (DstTy.isVector())
return false;

// Promote only if:
// - We have 16 bit insts (not true 16 bit insts).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the rationale here. Why would you only promote uniform 16-bit ops if you do have 16-bit VALU instructions? (I guess if you do not have 16-bit VALU instructions then these 16-bit ops would already have been legalized to 32-bit. But that does not really explain why it makes sense to check has16BitInsts again here.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I borrowed this from AMDGPUCodeGenPrepare, @arsenm wrote it I think?

I also don't think all of the operations are legalized to 32 bits, at least not in the DAG part. I need to look a bit longer the GlobalISel case (my focus has been on the DAG and I want to land that first)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if you do not have 16-bit VALU instructions then these 16-bit ops would already have been legalized to 32-bit

Yes, but that should already be obvious from what types are legal at this point

// - This is already checked by the predicate on the combine rule.
// - We don't have packed instructions (for vector types only).
// TODO: For vector types, the set of packed operations is more limited, so
// may want to promote some anyway.
assert(STI.has16BitInsts());
return (DstTy.isVector() ? !STI.hasVOP3PInsts() : true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an and and doesn't need parentheses. Shouldn't need the assert, it's implied by legality

}

static unsigned getExtOpcodeForPromotedOp(MachineInstr &MI) {
switch (MI.getOpcode()) {
case AMDGPU::G_ASHR:
case AMDGPU::G_SMIN:
case AMDGPU::G_SMAX:
return AMDGPU::G_SEXT;
case AMDGPU::G_LSHR:
case AMDGPU::G_UMIN:
case AMDGPU::G_UMAX:
return AMDGPU::G_ZEXT;
case AMDGPU::G_ADD:
case AMDGPU::G_SUB:
case AMDGPU::G_AND:
case AMDGPU::G_OR:
case AMDGPU::G_XOR:
case AMDGPU::G_SHL:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking for all shifts the RHS should use ZEXT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be able to use G_ANYEXT, only the bottom 5 bits are read (but I suppose the generic shifts don't know that)

case AMDGPU::G_SELECT:
case AMDGPU::G_MUL:
// operation result won't be influenced by garbage high bits.
return AMDGPU::G_ANYEXT;
case AMDGPU::G_ICMP: {
return CmpInst::isSigned(cast<GICmp>(MI).getCond()) ? AMDGPU::G_SEXT
: AMDGPU::G_ZEXT;
}
default:
llvm_unreachable("unexpected opcode!");
}
}

void AMDGPURegBankCombinerImpl::applyPromote16to32(MachineInstr &MI) const {
const unsigned Opc = MI.getOpcode();
const unsigned ExtOpc = getExtOpcodeForPromotedOp(MI);

Register Dst = MI.getOperand(0).getReg();

const bool IsSelectOrCmp = (Opc == AMDGPU::G_SELECT || Opc == AMDGPU::G_ICMP);
const bool IsShift =
(Opc == AMDGPU::G_ASHR || Opc == AMDGPU::G_LSHR || Opc == AMDGPU::G_SHL);
Register LHS = MI.getOperand(IsSelectOrCmp + 1).getReg();
Register RHS = MI.getOperand(IsSelectOrCmp + 2).getReg();

assert(MRI.getRegBankOrNull(Dst)->getID() == AMDGPU::SGPRRegBankID &&
MRI.getRegBankOrNull(LHS)->getID() == AMDGPU::SGPRRegBankID &&
MRI.getRegBankOrNull(RHS)->getID() == AMDGPU::SGPRRegBankID);

const RegisterBank &RB = *MRI.getRegBankOrNull(Dst);
LLT S32 = MRI.getType(LHS).changeElementSize(32);

B.setInstrAndDebugLoc(MI);
LHS = B.buildInstr(ExtOpc, {S32}, {LHS}).getReg(0);
RHS = B.buildInstr(IsShift ? AMDGPU::G_ZEXT : ExtOpc, {S32}, {RHS}).getReg(0);

MRI.setRegBank(LHS, RB);
MRI.setRegBank(RHS, RB);

MachineInstr *NewInst;
if (IsSelectOrCmp)
NewInst = B.buildInstr(Opc, {Dst}, {MI.getOperand(1), LHS, RHS});
else
NewInst = B.buildInstr(Opc, {S32}, {LHS, RHS});

if (Opc != AMDGPU::G_ICMP) {
Register Dst32 = NewInst->getOperand(0).getReg();
MRI.setRegBank(Dst32, RB);
B.buildTrunc(Dst, Dst32);
}

MI.eraseFromParent();
}

void AMDGPURegBankCombinerImpl::applyClamp(MachineInstr &MI,
Register &Reg) const {
B.buildInstr(AMDGPU::G_AMDGPU_CLAMP, {MI.getOperand(0)}, {Reg},
Expand Down
114 changes: 60 additions & 54 deletions llvm/test/CodeGen/AMDGPU/GlobalISel/andn2.ll
Original file line number Diff line number Diff line change
Expand Up @@ -349,63 +349,67 @@ define amdgpu_ps <2 x i32> @s_andn2_v2i32_commute(<2 x i32> inreg %src0, <2 x i3
}

define amdgpu_ps i16 @s_andn2_i16(i16 inreg %src0, i16 inreg %src1) {
; GCN-LABEL: s_andn2_i16:
; GCN: ; %bb.0:
; GCN-NEXT: s_andn2_b32 s0, s2, s3
; GCN-NEXT: ; return to shader part epilog
; GFX6-LABEL: s_andn2_i16:
; GFX6: ; %bb.0:
; GFX6-NEXT: s_andn2_b32 s0, s2, s3
; GFX6-NEXT: ; return to shader part epilog
;
; GFX10-LABEL: s_andn2_i16:
; GFX10: ; %bb.0:
; GFX10-NEXT: s_andn2_b32 s0, s2, s3
; GFX10-NEXT: ; return to shader part epilog
; GFX9-LABEL: s_andn2_i16:
; GFX9: ; %bb.0:
; GFX9-NEXT: s_xor_b32 s0, s3, -1
; GFX9-NEXT: s_and_b32 s0, s2, s0
; GFX9-NEXT: ; return to shader part epilog
;
; GFX11-LABEL: s_andn2_i16:
; GFX11: ; %bb.0:
; GFX11-NEXT: s_and_not1_b32 s0, s2, s3
; GFX11-NEXT: ; return to shader part epilog
; GFX10PLUS-LABEL: s_andn2_i16:
; GFX10PLUS: ; %bb.0:
; GFX10PLUS-NEXT: s_xor_b32 s0, s3, -1
; GFX10PLUS-NEXT: s_and_b32 s0, s2, s0
; GFX10PLUS-NEXT: ; return to shader part epilog
%not.src1 = xor i16 %src1, -1
%and = and i16 %src0, %not.src1
ret i16 %and
}

define amdgpu_ps i16 @s_andn2_i16_commute(i16 inreg %src0, i16 inreg %src1) {
; GCN-LABEL: s_andn2_i16_commute:
; GCN: ; %bb.0:
; GCN-NEXT: s_andn2_b32 s0, s2, s3
; GCN-NEXT: ; return to shader part epilog
; GFX6-LABEL: s_andn2_i16_commute:
; GFX6: ; %bb.0:
; GFX6-NEXT: s_andn2_b32 s0, s2, s3
; GFX6-NEXT: ; return to shader part epilog
;
; GFX10-LABEL: s_andn2_i16_commute:
; GFX10: ; %bb.0:
; GFX10-NEXT: s_andn2_b32 s0, s2, s3
; GFX10-NEXT: ; return to shader part epilog
; GFX9-LABEL: s_andn2_i16_commute:
; GFX9: ; %bb.0:
; GFX9-NEXT: s_xor_b32 s0, s3, -1
; GFX9-NEXT: s_and_b32 s0, s0, s2
; GFX9-NEXT: ; return to shader part epilog
;
; GFX11-LABEL: s_andn2_i16_commute:
; GFX11: ; %bb.0:
; GFX11-NEXT: s_and_not1_b32 s0, s2, s3
; GFX11-NEXT: ; return to shader part epilog
; GFX10PLUS-LABEL: s_andn2_i16_commute:
; GFX10PLUS: ; %bb.0:
; GFX10PLUS-NEXT: s_xor_b32 s0, s3, -1
; GFX10PLUS-NEXT: s_and_b32 s0, s0, s2
; GFX10PLUS-NEXT: ; return to shader part epilog
%not.src1 = xor i16 %src1, -1
%and = and i16 %not.src1, %src0
ret i16 %and
}

define amdgpu_ps { i16, i16 } @s_andn2_i16_multi_use(i16 inreg %src0, i16 inreg %src1) {
; GCN-LABEL: s_andn2_i16_multi_use:
; GCN: ; %bb.0:
; GCN-NEXT: s_xor_b32 s1, s3, -1
; GCN-NEXT: s_andn2_b32 s0, s2, s3
; GCN-NEXT: ; return to shader part epilog
; GFX6-LABEL: s_andn2_i16_multi_use:
; GFX6: ; %bb.0:
; GFX6-NEXT: s_xor_b32 s1, s3, -1
; GFX6-NEXT: s_andn2_b32 s0, s2, s3
; GFX6-NEXT: ; return to shader part epilog
;
; GFX10-LABEL: s_andn2_i16_multi_use:
; GFX10: ; %bb.0:
; GFX10-NEXT: s_andn2_b32 s0, s2, s3
; GFX10-NEXT: s_xor_b32 s1, s3, -1
; GFX10-NEXT: ; return to shader part epilog
; GFX9-LABEL: s_andn2_i16_multi_use:
; GFX9: ; %bb.0:
; GFX9-NEXT: s_xor_b32 s1, s3, -1
; GFX9-NEXT: s_and_b32 s0, s2, s1
; GFX9-NEXT: ; return to shader part epilog
;
; GFX11-LABEL: s_andn2_i16_multi_use:
; GFX11: ; %bb.0:
; GFX11-NEXT: s_and_not1_b32 s0, s2, s3
; GFX11-NEXT: s_xor_b32 s1, s3, -1
; GFX11-NEXT: ; return to shader part epilog
; GFX10PLUS-LABEL: s_andn2_i16_multi_use:
; GFX10PLUS: ; %bb.0:
; GFX10PLUS-NEXT: s_xor_b32 s1, s3, -1
; GFX10PLUS-NEXT: s_and_b32 s0, s2, s1
; GFX10PLUS-NEXT: ; return to shader part epilog
%not.src1 = xor i16 %src1, -1
%and = and i16 %src0, %not.src1
%insert.0 = insertvalue { i16, i16 } undef, i16 %and, 0
Expand All @@ -414,23 +418,25 @@ define amdgpu_ps { i16, i16 } @s_andn2_i16_multi_use(i16 inreg %src0, i16 inreg
}

define amdgpu_ps { i16, i16 } @s_andn2_i16_multi_foldable_use(i16 inreg %src0, i16 inreg %src1, i16 inreg %src2) {
; GCN-LABEL: s_andn2_i16_multi_foldable_use:
; GCN: ; %bb.0:
; GCN-NEXT: s_andn2_b32 s0, s2, s4
; GCN-NEXT: s_andn2_b32 s1, s3, s4
; GCN-NEXT: ; return to shader part epilog
; GFX6-LABEL: s_andn2_i16_multi_foldable_use:
; GFX6: ; %bb.0:
; GFX6-NEXT: s_andn2_b32 s0, s2, s4
; GFX6-NEXT: s_andn2_b32 s1, s3, s4
; GFX6-NEXT: ; return to shader part epilog
;
; GFX10-LABEL: s_andn2_i16_multi_foldable_use:
; GFX10: ; %bb.0:
; GFX10-NEXT: s_andn2_b32 s0, s2, s4
; GFX10-NEXT: s_andn2_b32 s1, s3, s4
; GFX10-NEXT: ; return to shader part epilog
; GFX9-LABEL: s_andn2_i16_multi_foldable_use:
; GFX9: ; %bb.0:
; GFX9-NEXT: s_xor_b32 s1, s4, -1
; GFX9-NEXT: s_and_b32 s0, s2, s1
; GFX9-NEXT: s_and_b32 s1, s3, s1
; GFX9-NEXT: ; return to shader part epilog
;
; GFX11-LABEL: s_andn2_i16_multi_foldable_use:
; GFX11: ; %bb.0:
; GFX11-NEXT: s_and_not1_b32 s0, s2, s4
; GFX11-NEXT: s_and_not1_b32 s1, s3, s4
; GFX11-NEXT: ; return to shader part epilog
; GFX10PLUS-LABEL: s_andn2_i16_multi_foldable_use:
; GFX10PLUS: ; %bb.0:
; GFX10PLUS-NEXT: s_xor_b32 s1, s4, -1
; GFX10PLUS-NEXT: s_and_b32 s0, s2, s1
; GFX10PLUS-NEXT: s_and_b32 s1, s3, s1
; GFX10PLUS-NEXT: ; return to shader part epilog
%not.src2 = xor i16 %src2, -1
%and0 = and i16 %src0, %not.src2
%and1 = and i16 %src1, %not.src2
Expand Down
Loading
Loading