Skip to content

[LegalizeTypes][X86][PowerPC] Use shift by 1 instead of adding a value to itself to double. #86857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion llvm/lib/CodeGen/SelectionDAG/LegalizeTypesGeneric.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -428,7 +428,8 @@ SDValue DAGTypeLegalizer::ExpandOp_INSERT_VECTOR_ELT(SDNode *N) {
std::swap(Lo, Hi);

SDValue Idx = N->getOperand(2);
Idx = DAG.getNode(ISD::ADD, dl, Idx.getValueType(), Idx, Idx);
Idx = DAG.getNode(ISD::SHL, dl, Idx.getValueType(), Idx,
DAG.getShiftAmountConstant(1, Idx.getValueType(), dl));
NewVec = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, NewVecVT, NewVec, Lo, Idx);
Idx = DAG.getNode(ISD::ADD, dl,
Idx.getValueType(), Idx,
Expand Down
13 changes: 7 additions & 6 deletions llvm/test/CodeGen/PowerPC/aix-vec_insert_elt.ll
Original file line number Diff line number Diff line change
Expand Up @@ -165,12 +165,12 @@ define <2 x i64> @testDoubleword(<2 x i64> %a, i64 %b, i64 %idx) {
;
; CHECK-32-LABEL: testDoubleword:
; CHECK-32: # %bb.0: # %entry
; CHECK-32-NEXT: add 5, 6, 6
; CHECK-32-NEXT: addi 7, 1, -32
; CHECK-32-NEXT: slwi 5, 6, 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows the advantage, after changing add to shift, on ppc,

  • the shift can be combined with other shift, like rlwinm
  • the shift lowest bit is 0, so ori can be used instead of addi(although this may not be an improvement, ori and addi have same latency)

; CHECK-32-NEXT: rlwinm 6, 6, 3, 28, 28
; CHECK-32-NEXT: stxv 34, -32(1)
; CHECK-32-NEXT: rlwinm 6, 5, 2, 28, 29
; CHECK-32-NEXT: stwx 3, 7, 6
; CHECK-32-NEXT: addi 3, 5, 1
; CHECK-32-NEXT: ori 3, 5, 1
; CHECK-32-NEXT: addi 5, 1, -16
; CHECK-32-NEXT: lxv 0, -32(1)
; CHECK-32-NEXT: rlwinm 3, 3, 2, 28, 29
Expand All @@ -187,10 +187,11 @@ define <2 x i64> @testDoubleword(<2 x i64> %a, i64 %b, i64 %idx) {
;
; CHECK-32-P10-LABEL: testDoubleword:
; CHECK-32-P10: # %bb.0: # %entry
; CHECK-32-P10-NEXT: add 5, 6, 6
; CHECK-32-P10-NEXT: slwi 6, 5, 2
; CHECK-32-P10-NEXT: slwi 5, 6, 1
; CHECK-32-P10-NEXT: slwi 6, 6, 3
; CHECK-32-P10-NEXT: vinswlx 2, 6, 3
; CHECK-32-P10-NEXT: addi 3, 5, 1
; CHECK-32-P10-NEXT: li 3, 1
; CHECK-32-P10-NEXT: rlwimi 3, 5, 0, 0, 30
Copy link
Collaborator

@chenzheng1030 chenzheng1030 Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exposes an combine opportunity on PPC.

li 3, 1
rlwimi 3, 5, 0, 0, 30

should be just:

addi 3, 5, 1 //ori 3, 5, 1

We will handle this later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

li 3, 1 should be a zero cycle operation on P10

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do it in more generically.

slwi 5, 6, n
li 3, Val     (if Val != 0,1,-1, li is not zero operation onP10
rlwimi 3, 5, 0, 0, K    (K=31-n...31)

if 2^(31-n) >val

we can convert to

addi 3, 5, Val

or

ori 3, 5, Val

; CHECK-32-P10-NEXT: slwi 3, 3, 2
; CHECK-32-P10-NEXT: vinswlx 2, 3, 4
; CHECK-32-P10-NEXT: blr
Expand Down
8 changes: 4 additions & 4 deletions llvm/test/CodeGen/PowerPC/vec_insert_elt.ll
Original file line number Diff line number Diff line change
Expand Up @@ -241,14 +241,14 @@ define <2 x i64> @testDoubleword(<2 x i64> %a, i64 %b, i64 %idx) {
;
; AIX-P8-32-LABEL: testDoubleword:
; AIX-P8-32: # %bb.0: # %entry
; AIX-P8-32-NEXT: add r6, r6, r6
; AIX-P8-32-NEXT: addi r5, r1, -32
; AIX-P8-32-NEXT: rlwinm r7, r6, 2, 28, 29
; AIX-P8-32-NEXT: slwi r7, r6, 1
; AIX-P8-32-NEXT: rlwinm r6, r6, 3, 28, 28
; AIX-P8-32-NEXT: stxvw4x v2, 0, r5
; AIX-P8-32-NEXT: stwx r3, r5, r7
; AIX-P8-32-NEXT: stwx r3, r5, r6
; AIX-P8-32-NEXT: addi r3, r1, -16
; AIX-P8-32-NEXT: lxvw4x vs0, 0, r5
; AIX-P8-32-NEXT: addi r5, r6, 1
; AIX-P8-32-NEXT: ori r5, r7, 1
; AIX-P8-32-NEXT: rlwinm r5, r5, 2, 28, 29
; AIX-P8-32-NEXT: stxvw4x vs0, 0, r3
; AIX-P8-32-NEXT: stwx r4, r3, r5
Expand Down
8 changes: 4 additions & 4 deletions llvm/test/CodeGen/X86/insertelement-var-index.ll
Original file line number Diff line number Diff line change
Expand Up @@ -1019,7 +1019,7 @@ define <2 x i64> @arg_i64_v2i64(<2 x i64> %v, i64 %x, i32 %y) nounwind {
; X86AVX2-NEXT: movl %edx, (%esp,%esi,4)
; X86AVX2-NEXT: vmovaps (%esp), %xmm0
; X86AVX2-NEXT: vmovaps %xmm0, {{[0-9]+}}(%esp)
; X86AVX2-NEXT: incl %ecx
; X86AVX2-NEXT: orl $1, %ecx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why OR is better than INC? I see they have the same TP/ports but OR has one more byte longer and Lat seems longer too in some case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we're missing addlike matching in the inc patterns?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @goldsteinn who was working on something similar for #83691

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: that patch doesn't fix the issue.

Copy link
Collaborator Author

@topperc topperc Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly my or_is_add patch did not work either. This is the final DAG

SelectionDAG has 36 nodes:                                                       
  t0: ch,glue = EntryToken                                                       
                t2: v2i64,ch = CopyFromReg t0, Register:v2i64 %0                 
              t52: ch = store<(store (s128) into %stack.1)> t0, t2, FrameIndex:i32<1>, undef:i32
              t17: i32,ch = load<(load (s32) from %fixed-stack.2)> t0, FrameIndex:i32<-1>, undef:i32
                  t43: i32 = and t42, Constant:i32<3>                            
                t51: i32 = shl t43, Constant:i8<2>                               
              t48: i32 = or disjoint FrameIndex:i32<1>, t51                      
            t46: ch = store<(store (s32))> t52, t17, t48, undef:i32              
          t47: v4i32,ch = load<(load (s128) from %stack.1)> t46, FrameIndex:i32<1>, undef:i32
        t32: ch = store<(store (s128) into %stack.0)> t0, t47, FrameIndex:i32<0>, undef:i32
          t19: i32 = add FrameIndex:i32<-1>, Constant:i32<4>                     
        t20: i32,ch = load<(load (s32) from %fixed-stack.2 + 4)> t0, t19, undef:i32
              t30: i32 = or t42, Constant:i32<1>                                 
            t35: i32 = and t30, Constant:i32<3>                                  
          t55: i32 = shl t35, Constant:i8<2>                                     
        t53: i32 = or disjoint FrameIndex:i32<0>, t55                            
      t38: ch = store<(store (s32))> t32, t20, t53, undef:i32                    
    t56: v2i64,ch = load<(load (s128) from %stack.0)> t38, FrameIndex:i32<0>, undef:i32
  t14: ch,glue = CopyToReg t0, Register:v2i64 $xmm0, t56                         
      t9: i32,ch = load<(load (s32) from %fixed-stack.0)> t0, FrameIndex:i32<-3>, undef:i32
    t24: i32 = shl t9, Constant:i8<1>                                            
  t42: i32 = freeze t24                                                          
  t15: ch = X86ISD::RET_GLUE t14, TargetConstant:i32<0>, Register:v2i64 $xmm0, t14:1

t30 is the or. It uses the t42 freeze of t24. We should have been able to hoist the freeze above the t24 since its the only user.

You can see there are 2 uses of t42. Both of those are nodes that a freeze was hoisted above. After the first freeze hoisted, the shl was used by the hoisted freeze and one of the other nodes so the freeze wasn't the only user of the shl. Then we did the second hoist and the second hoisted freeze CSEd with the previous hoisted freeze and decreased the use count of the shl making the freeze the only user. We did not revisit the freeze after that.

; X86AVX2-NEXT: andl $3, %ecx
; X86AVX2-NEXT: movl %eax, 16(%esp,%ecx,4)
; X86AVX2-NEXT: vmovaps {{[0-9]+}}(%esp), %xmm0
Expand Down Expand Up @@ -1369,7 +1369,7 @@ define <2 x i64> @load_i64_v2i64(<2 x i64> %v, ptr %p, i32 %y) nounwind {
; X86AVX2-NEXT: movl %edx, (%esp,%esi,4)
; X86AVX2-NEXT: vmovaps (%esp), %xmm0
; X86AVX2-NEXT: vmovaps %xmm0, {{[0-9]+}}(%esp)
; X86AVX2-NEXT: incl %eax
; X86AVX2-NEXT: orl $1, %eax
; X86AVX2-NEXT: andl $3, %eax
; X86AVX2-NEXT: movl %ecx, 16(%esp,%eax,4)
; X86AVX2-NEXT: vmovaps {{[0-9]+}}(%esp), %xmm0
Expand Down Expand Up @@ -1754,7 +1754,7 @@ define <4 x i64> @arg_i64_v4i64(<4 x i64> %v, i64 %x, i32 %y) nounwind {
; X86AVX2-NEXT: movl %edx, (%esp,%esi,4)
; X86AVX2-NEXT: vmovaps (%esp), %ymm0
; X86AVX2-NEXT: vmovaps %ymm0, {{[0-9]+}}(%esp)
; X86AVX2-NEXT: incl %ecx
; X86AVX2-NEXT: orl $1, %ecx
; X86AVX2-NEXT: andl $7, %ecx
; X86AVX2-NEXT: movl %eax, 32(%esp,%ecx,4)
; X86AVX2-NEXT: vmovaps {{[0-9]+}}(%esp), %ymm0
Expand Down Expand Up @@ -2137,7 +2137,7 @@ define <4 x i64> @load_i64_v4i64(<4 x i64> %v, ptr %p, i32 %y) nounwind {
; X86AVX2-NEXT: movl %edx, (%esp,%esi,4)
; X86AVX2-NEXT: vmovaps (%esp), %ymm0
; X86AVX2-NEXT: vmovaps %ymm0, {{[0-9]+}}(%esp)
; X86AVX2-NEXT: incl %eax
; X86AVX2-NEXT: orl $1, %eax
; X86AVX2-NEXT: andl $7, %eax
; X86AVX2-NEXT: movl %ecx, 32(%esp,%eax,4)
; X86AVX2-NEXT: vmovaps {{[0-9]+}}(%esp), %ymm0
Expand Down