Skip to content

[MachinePipeliner] Fix incorrect handlings of unpipelineable insts #126057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 21, 2025

Conversation

kasuga-fj
Copy link
Contributor

@kasuga-fj kasuga-fj commented Feb 6, 2025

There was a case where normalizeNonPipelinedInstructions didn't schedule unpipelineable instructions correctly, which could generate illegal code. This patch fixes this issue by rejecting the schedule if we fail to insert the unpipelineable instructions at stage 0.

Here is a part of the debug output for sms-unpipeline-insts3.mir before applying this patch.

SU(0):   %27:gpr32 = PHI %21:gpr32all, %bb.3, %28:gpr32all, %bb.4
  Successors:
    SU(14): Data Latency=0 Reg=%27
    SU(15): Anti Latency=1

...

SU(14):   %41:gpr32 = ADDWrr %27:gpr32, %12:gpr32common
  Predecessors:
    SU(0): Data Latency=0 Reg=%27
    SU(16): Ord  Latency=0 Artificial
  Successors:
    SU(15): Data Latency=1 Reg=%41
SU(15):   %28:gpr32all = COPY %41:gpr32
  Predecessors:
    SU(14): Data Latency=1 Reg=%41
    SU(0): Anti Latency=1
SU(16):   %30:ppr = WHILELO_PWW_S %27:gpr32, %15:gpr32, implicit-def $nzcv
  Predecessors:
    SU(0): Data Latency=0 Reg=%27
  Successors:
    SU(14): Ord  Latency=0 Artificial

...

Do not pipeline SU(16)
Do not pipeline SU(1)
Do not pipeline SU(0)
Do not pipeline SU(15)
Do not pipeline SU(14)
SU(0) is not pipelined; moving from cycle 19 to 0 Instr: ...
SU(1) is not pipelined; moving from cycle 10 to 0 Instr: ...
SU(15) is not pipelined; moving from cycle 28 to 19 Instr: ...
SU(16) is not pipelined; moving from cycle 19 to 0 Instr: ...
Schedule Found? 1 (II=10)

...

cycle 9 (1) (14) %41:gpr32 = ADDWrr %27:gpr32, %12:gpr32common

cycle 9 (1) (15) %28:gpr32all = COPY %41:gpr32

The SUs are traversed in the order of the original basic block, so in this case a new cycle of each instruction is determined in the order of SU(0), SU(1), SU(14), SU(15), SU(16). Since there is an artificial dependence from SU(16) to SU(14), which is contradict to the original SU order, the new cycle of SU(14) must be greater than or equal to the cycle of SU(16) at that time. This results in the failure of scheduling SU(14) at stage 0. For now, we reject the schedule for such cases.

@llvmbot
Copy link
Member

llvmbot commented Feb 6, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Ryotaro Kasuga (kasuga-fj)

Changes

There was a case where normalizeNonPipelinedInstructions didn't schedule unpipelineable instructions correctly, which could generate illegal code. This patch fixes this issue by rejecting the schedule if we fail to insert the unpipelineable instructions in stage 0.

Here is a part of the previous debug output for the added test, where SU(14) and SU(15) are scheduled in Stage 1:

Do not pipeline SU(16)
Do not pipeline SU(1)
Do not pipeline SU(0)
Do not pipeline SU(15)
Do not pipeline SU(14)
SU(0) is not pipelined; moving from cycle 19 to 0 Instr: ...
SU(1) is not pipelined; moving from cycle 10 to 0 Instr: ...
SU(15) is not pipelined; moving from cycle 28 to 19 Instr: ...
SU(16) is not pipelined; moving from cycle 19 to 0 Instr: ...
Schedule Found? 1 (II=10)

...

cycle 9 (1) (14) %41:gpr32 = ADDWrr %27:gpr32, %12:gpr32common

cycle 9 (1) (15) %28:gpr32all = COPY %41:gpr32

Full diff: https://github.com/llvm/llvm-project/pull/126057.diff

2 Files Affected:

  • (modified) llvm/lib/CodeGen/MachinePipeliner.cpp (+6)
  • (added) llvm/test/CodeGen/AArch64/sms-unpipeline-insts3.mir (+226)
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index dbf320f88fd65c7..df55a013f75ca80 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -3178,6 +3178,12 @@ bool SMSchedule::normalizeNonPipelinedInstructions(
                         << ") is not pipelined; moving from cycle " << OldCycle
                         << " to " << NewCycle << " Instr:" << *SU.getInstr());
     }
+
+    // There is a case where the `NewCycle` is too large to be scheduled in
+    // Stage 0. In this case, we reject the schedule.
+    if (FirstCycle + InitiationInterval <= NewCycle)
+      return false;
+
     NewLastCycle = std::max(NewLastCycle, NewCycle);
   }
   LastCycle = NewLastCycle;
diff --git a/llvm/test/CodeGen/AArch64/sms-unpipeline-insts3.mir b/llvm/test/CodeGen/AArch64/sms-unpipeline-insts3.mir
new file mode 100644
index 000000000000000..01938f37e2ac5df
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sms-unpipeline-insts3.mir
@@ -0,0 +1,226 @@
+# RUN: llc --verify-machineinstrs -run-pass=pipeliner -o - %s -aarch64-enable-pipeliner -pipeliner-enable-copytophi=1
+
+--- |
+  target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
+  
+  @glb = internal unnamed_addr global { [256 x i32], [256 x i32], [256 x i32] } zeroinitializer
+  
+  ; Function Attrs: nounwind vscale_range(1,16)
+  define internal void @f(i32 %0, i32 %1) #0 {
+  entry:
+    %reass.sub = sub i32 %1, %0
+    %invariant.op = add i32 %0, 1
+    %invariant.op3 = add i32 %0, 2
+    %omp_loop.cmp5.not = icmp eq i32 %reass.sub, -1
+    br i1 %omp_loop.cmp5.not, label %exit, label %preheader
+  
+  preheader:                                        ; preds = %entry
+    %2 = add i32 %1, 1
+    %3 = icmp slt i32 %2, %invariant.op
+    br i1 %3, label %body.preheader, label %vector.ph
+  
+  body.preheader:                                   ; preds = %preheader
+    %4 = add i32 %1, 1
+    %5 = sub i32 %4, %0
+    br label %body
+  
+  vector.ph:                                        ; preds = %preheader
+    %6 = add i32 %1, 1
+    %7 = sub i32 %6, %0
+    %8 = tail call i32 @llvm.vscale.i32()
+    %9 = shl nuw nsw i32 %8, 2
+    %10 = tail call i32 @llvm.vscale.i32()
+    %11 = shl nuw nsw i32 %10, 2
+    %12 = call i32 @llvm.usub.sat.i32(i32 %7, i32 %11)
+    %active.lane.mask.entry = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 %7)
+    %13 = tail call <vscale x 4 x i32> @llvm.stepvector.nxv4i32()
+    %.splatinsert = insertelement <vscale x 4 x i32> poison, i32 %9, i64 0
+    %.splat = shufflevector <vscale x 4 x i32> %.splatinsert, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
+    %broadcast.splatinsert = insertelement <vscale x 4 x i32> poison, i32 %invariant.op, i64 0
+    %broadcast.splat = shufflevector <vscale x 4 x i32> %broadcast.splatinsert, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
+    %broadcast.splatinsert7 = insertelement <vscale x 4 x i32> poison, i32 %invariant.op3, i64 0
+    %broadcast.splat8 = shufflevector <vscale x 4 x i32> %broadcast.splatinsert7, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
+    br label %vector.body
+  
+  vector.body:                                      ; preds = %vector.body, %vector.ph
+    %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
+    %active.lane.mask = phi <vscale x 4 x i1> [ %active.lane.mask.entry, %vector.ph ], [ %active.lane.mask.next, %vector.body ]
+    %vec.ind = phi <vscale x 4 x i32> [ %13, %vector.ph ], [ %vec.ind.next, %vector.body ]
+    %14 = add <vscale x 4 x i32> %vec.ind, %broadcast.splat
+    %15 = extractelement <vscale x 4 x i32> %14, i64 0
+    %16 = sext i32 %15 to i64
+    %17 = add nsw i64 %16, -1
+    %18 = getelementptr i32, ptr @glb, i64 %17
+    call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> %14, ptr %18, i32 4, <vscale x 4 x i1> %active.lane.mask)
+    %19 = add <vscale x 4 x i32> %vec.ind, %broadcast.splat8
+    %20 = mul <vscale x 4 x i32> %14, %19
+    %21 = sdiv <vscale x 4 x i32> %20, splat (i32 2)
+    %22 = getelementptr i32, ptr getelementptr inbounds nuw (i8, ptr @glb, i64 1024), i64 %17
+    call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> %21, ptr %22, i32 4, <vscale x 4 x i1> %active.lane.mask)
+    %23 = getelementptr i32, ptr getelementptr inbounds nuw (i8, ptr @glb, i64 2048), i64 %17
+    %wide.masked.load = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0(ptr %23, i32 4, <vscale x 4 x i1> %active.lane.mask, <vscale x 4 x i32> poison)
+    %24 = add <vscale x 4 x i32> %wide.masked.load, %21
+    call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> %24, ptr %23, i32 4, <vscale x 4 x i1> %active.lane.mask)
+    %25 = tail call i32 @llvm.vscale.i32()
+    %26 = shl nuw nsw i32 %25, 2
+    %index.next = add i32 %index, %26
+    %active.lane.mask.next = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 %index, i32 %12)
+    %vec.ind.next = add <vscale x 4 x i32> %vec.ind, %.splat
+    %27 = extractelement <vscale x 4 x i1> %active.lane.mask.next, i64 0
+    br i1 %27, label %vector.body, label %exit
+  
+  exit:                                             ; preds = %vector.body, %body, %entry
+    ret void
+  
+  body:                                             ; preds = %body.preheader, %body
+    %lsr.iv2 = phi i32 [ %invariant.op3, %body.preheader ], [ %lsr.iv.next3, %body ]
+    %lsr.iv = phi i32 [ %5, %body.preheader ], [ %lsr.iv.next, %body ]
+    %28 = add i32 %lsr.iv2, -1
+    %29 = sext i32 %28 to i64
+    %30 = add nsw i64 %29, -1
+    %31 = getelementptr i32, ptr @glb, i64 %30
+    store i32 %28, ptr %31, align 4
+    %32 = mul i32 %28, %lsr.iv2
+    %33 = sdiv i32 %32, 2
+    %34 = getelementptr i32, ptr getelementptr inbounds nuw (i8, ptr @glb, i64 1024), i64 %30
+    store i32 %33, ptr %34, align 4
+    %35 = getelementptr i32, ptr getelementptr inbounds nuw (i8, ptr @glb, i64 2048), i64 %30
+    %36 = load i32, ptr %35, align 4
+    %37 = add i32 %36, %33
+    store i32 %37, ptr %35, align 4
+    %lsr.iv.next = add i32 %lsr.iv, -1
+    %lsr.iv.next3 = add i32 %lsr.iv2, 1
+    %exitcond.not = icmp eq i32 %lsr.iv.next, 0
+    br i1 %exitcond.not, label %exit, label %body
+  }
+  
+  ; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
+  declare <vscale x 4 x i32> @llvm.stepvector.nxv4i32() #1
+  
+  ; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
+  declare i32 @llvm.vscale.i32() #1
+  
+  ; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
+  declare <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32, i32) #1
+  
+  ; Function Attrs: nocallback nofree nosync nounwind willreturn memory(argmem: write)
+  declare void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32>, ptr captures(none), i32 immarg, <vscale x 4 x i1>) #2
+  
+  ; Function Attrs: nocallback nofree nosync nounwind willreturn memory(argmem: read)
+  declare <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0(ptr captures(none), i32 immarg, <vscale x 4 x i1>, <vscale x 4 x i32>) #3
+  
+  ; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
+  declare i32 @llvm.usub.sat.i32(i32, i32) #4
+  
+  attributes #0 = { nounwind vscale_range(1,16) "frame-pointer"="non-leaf" "target-cpu"="neoverse-v1" "target-features"="+sve" }
+  attributes #1 = { nocallback nofree nosync nounwind willreturn memory(none) }
+  attributes #2 = { nocallback nofree nosync nounwind willreturn memory(argmem: write) }
+  attributes #3 = { nocallback nofree nosync nounwind willreturn memory(argmem: read) }
+  attributes #4 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
+
+...
+---
+name:            f
+tracksRegLiveness: true
+body:             |
+  bb.0.entry:
+    successors: %bb.5(0x30000000), %bb.1(0x50000000)
+    liveins: $w0, $w1
+  
+    %20:gpr32common = COPY $w1
+    %19:gpr32common = COPY $w0
+    %21:gpr32common = SUBWrr %20, %19
+    dead $wzr = ADDSWri %21, 1, 0, implicit-def $nzcv
+    Bcc 0, %bb.5, implicit $nzcv
+    B %bb.1
+  
+  bb.1.preheader:
+    successors: %bb.2(0x40000000), %bb.3(0x40000000)
+  
+    %22:gpr32common = ADDWri %19, 1, 0
+    %23:gpr32sp = ADDWri %19, 2, 0
+    %25:gpr32common = ADDWri %20, 1, 0
+    dead $wzr = SUBSWrr killed %25, %22, implicit-def $nzcv
+    Bcc 10, %bb.3, implicit $nzcv
+    B %bb.2
+  
+  bb.2.body.preheader:
+    successors: %bb.6(0x80000000)
+  
+    %1:gpr32sp = COPY %23
+    %55:gpr32sp = ADDWri %21, 1, 0
+    %2:gpr32all = COPY %55
+    %57:gpr64common = MOVaddr target-flags(aarch64-page) @glb, target-flags(aarch64-pageoff, aarch64-nc) @glb
+    B %bb.6
+  
+  bb.3.vector.ph:
+    successors: %bb.4(0x80000000)
+  
+    %29:gpr32common = ADDWri %21, 1, 0
+    %30:gpr64 = CNTW_XPiI 31, 1, implicit $vg
+    %31:gpr32common = COPY %30.sub_32
+    %32:gpr32 = SUBSWrr %29, %31, implicit-def $nzcv
+    %33:gpr32 = COPY $wzr
+    %34:gpr32 = CSELWr %33, killed %32, 3, implicit $nzcv
+    %4:ppr = WHILELO_PWW_S %33, %29, implicit-def dead $nzcv
+    %5:zpr = INDEX_II_S 0, 1, implicit $vg
+    %6:zpr = DUP_ZR_S %31
+    %7:zpr = DUP_ZR_S %22
+    %8:zpr = DUP_ZR_S %23
+    %27:gpr32all = COPY %33
+    %37:gpr64common = MOVaddr target-flags(aarch64-page) @glb, target-flags(aarch64-pageoff, aarch64-nc) @glb
+    %39:gpr64common = MOVi64imm -1
+    %41:ppr_3b = PTRUE_S 31, implicit $vg
+    %44:gpr64common = MOVi64imm 255
+    %45:gpr64common = MOVi64imm 511
+  
+  bb.4.vector.body:
+    successors: %bb.4(0x7c000000), %bb.5(0x04000000)
+  
+    %9:gpr32 = PHI %27, %bb.3, %12, %bb.4
+    %10:ppr_3b = PHI %4, %bb.3, %13, %bb.4
+    %11:zpr = PHI %5, %bb.3, %14, %bb.4
+    %35:zpr = ADD_ZZZ_S %11, %7
+    %36:gpr32 = COPY %35.ssub
+    %38:gpr64sp = ADDXrx %37, killed %36, 50
+    ST1W %35, %10, %38, %39 :: (store unknown-size into %ir.18, align 4)
+    %40:zpr = ADD_ZZZ_S %11, %8
+    %42:zpr = MUL_ZPZZ_S_UNDEF %41, %35, killed %40
+    %43:zpr = ASRD_ZPmI_S %41, %42, 1
+    ST1W %43, %10, %38, %44 :: (store unknown-size into %ir.22, align 4)
+    %46:zpr = LD1W %10, %38, %45 :: (load unknown-size from %ir.23, align 4)
+    %47:zpr = ADD_ZZZ_S killed %46, %43
+    ST1W killed %47, %10, %38, %45 :: (store unknown-size into %ir.23, align 4)
+    %50:gpr32 = ADDWrr %9, %31
+    %12:gpr32all = COPY %50
+    %13:ppr = WHILELO_PWW_S %9, %34, implicit-def $nzcv
+    %14:zpr = ADD_ZZZ_S %11, %6
+    Bcc 4, %bb.4, implicit $nzcv
+    B %bb.5
+  
+  bb.5.exit:
+    RET_ReallyLR
+  
+  bb.6.body:
+    successors: %bb.5(0x04000000), %bb.6(0x7c000000)
+  
+    %15:gpr32common = PHI %1, %bb.2, %18, %bb.6
+    %16:gpr32sp = PHI %2, %bb.2, %17, %bb.6
+    %56:gpr32common = SUBWri %15, 1, 0
+    %58:gpr64sp = ADDXrx %57, %56, 50
+    STURWi %56, %58, -4 :: (store (s32) into %ir.31)
+    %59:gpr32 = MADDWrrr %56, %15, $wzr
+    %60:gpr32 = ADDWrs %59, %59, 95
+    %61:gpr32 = SBFMWri killed %60, 1, 31
+    STRWui %61, %58, 255 :: (store (s32) into %ir.34)
+    %62:gpr32 = LDRWui %58, 511 :: (load (s32) from %ir.35)
+    %63:gpr32 = ADDWrr killed %62, %61
+    STRWui killed %63, %58, 511 :: (store (s32) into %ir.35)
+    %64:gpr32 = SUBSWri %16, 1, 0, implicit-def $nzcv
+    %17:gpr32all = COPY %64
+    %65:gpr32sp = ADDWri %15, 1, 0
+    %18:gpr32all = COPY %65
+    Bcc 0, %bb.5, implicit $nzcv
+    B %bb.6
+
+...

@kasuga-fj kasuga-fj requested review from dpenry and ytmukai February 6, 2025 12:43
There was a case where `normalizeNonPipelinedInstructions` didn't
schedule unpipelineable instructions correctly, which could generate
illegal code. This patch fixes this issue by rejecting the schedule if
we fail to insert the unpipelineable instructions in stage 0.

Here is a part of the previous debug output for the added test, where
`SU(14)` and `SU(15)` are scheduled in Stage 1:

```
Do not pipeline SU(16)
Do not pipeline SU(1)
Do not pipeline SU(0)
Do not pipeline SU(15)
Do not pipeline SU(14)
SU(0) is not pipelined; moving from cycle 19 to 0 Instr: ...
SU(1) is not pipelined; moving from cycle 10 to 0 Instr: ...
SU(15) is not pipelined; moving from cycle 28 to 19 Instr: ...
SU(16) is not pipelined; moving from cycle 19 to 0 Instr: ...
Schedule Found? 1 (II=10)

...

cycle 9 (1) (14) %41:gpr32 = ADDWrr %27:gpr32, %12:gpr32common

cycle 9 (1) (15) %28:gpr32all = COPY %41:gpr32
```
@kasuga-fj kasuga-fj force-pushed the pipeliner-unpipeline-insts branch from 3226021 to 09159bd Compare February 7, 2025 06:52
@kasuga-fj
Copy link
Contributor Author

Gentle ping

@dpenry
Copy link
Contributor

dpenry commented Feb 18, 2025

The description of the problem doesn't say enough about what was going wrong to form any opinion of the correctness of this patch. Nor does the debug message give enough information to even confirm that there was a problem.

@kasuga-fj
Copy link
Contributor Author

kasuga-fj commented Feb 19, 2025

Below is a portion of dependencies of this case.

SU(0):   %27:gpr32 = PHI %21:gpr32all, %bb.3, %28:gpr32all, %bb.4
  # preds left       : 0
  # succs left       : 3
  # rdefs left       : 0
  Latency            : 0
  Depth              : 0
  Height             : 1
  Successors:
    SU(16): Data Latency=0 Reg=%27
    SU(14): Data Latency=0 Reg=%27
    SU(15): Anti Latency=1
SU(1):   %29:ppr_3b = PHI %16:ppr, %bb.3, %30:ppr, %bb.4
  # preds left       : 0
  # succs left       : 5
  # rdefs left       : 0
  Latency            : 0
  Depth              : 0
  Height             : 9
  Successors:
    SU(13): Data Latency=0 Reg=%29
    SU(11): Data Latency=0 Reg=%29
    SU(10): Data Latency=0 Reg=%29
    SU(6): Data Latency=0 Reg=%29
    SU(16): Anti Latency=1

...

SU(14):   %41:gpr32 = ADDWrr %27:gpr32, %12:gpr32common
  # preds left       : 2
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 1
  Depth              : 1
  Height             : 1
  Predecessors:
    SU(0): Data Latency=0 Reg=%27
    SU(16): Ord  Latency=0 Artificial
  Successors:
    SU(15): Data Latency=1 Reg=%41
SU(15):   %28:gpr32all = COPY %41:gpr32
  # preds left       : 2
  # succs left       : 0
  # rdefs left       : 0
  Latency            : 1
  Depth              : 2
  Height             : 0
  Predecessors:
    SU(14): Data Latency=1 Reg=%41
    SU(0): Anti Latency=1
SU(16):   %30:ppr = WHILELO_PWW_S %27:gpr32, %15:gpr32, implicit-def $nzcv
  # preds left       : 2
  # succs left       : 2
  # rdefs left       : 0
  Latency            : 3
  Depth              : 1
  Height             : 1
  Predecessors:
    SU(0): Data Latency=0 Reg=%27
    SU(1): Anti Latency=1
  Successors:
    ExitSU: Data Latency=1 Reg=$nzcv
    SU(14): Ord  Latency=0 Artificial

As you can see

  • There is an artificial dependency from SU(16) to SU(14), which is created by CopyToPhiMutation.
  • SU(14) and SU(15) will be unpipelineable because of the dependency chain SU(16) -> SU(0) -> SU(15) -> SU(14).

With respect to SU(14), SU(15), and SU(16)

  • The cycles are 19, 28, and 19 respectively, before calling normalizeNonPipelinedInstructions.
  • In the normalizeNonPipelinedInstructions, they are iterated in the order of the instructions, so the new cycles are determined in the order SU(14), SU(15), SU(16).

The new cycle of the SU(14) cannot be less than 19, because it is the maximum of all cycles of its predecessors, and SU(16) is scheduled at cycle 19 at the moment. But it must be moved to stage 0 (in this case cycle from 0 to 9), which means it is not correctly normalized. The same thing happens with SU(15).

Just in this case, traversing them in topological order when normalizing may solve the problem, but I'm not sure if it is right for all cases.

@kasuga-fj
Copy link
Contributor Author

Gentle ping

@dpenry
Copy link
Contributor

dpenry commented Mar 10, 2025

There's a bit of confusion of terminology around dependence chain, but what is going on here does make some sense to me now. SU(16) causes its two predecessors (0,1) to be unpipelineable. SU(0) is causing SU(15) to be unpipelineable because of traversing the anti-dependence backwards and then SU(14) is made unpipelineable by SU(15).

So, I think what's happened here is that the changes to the data dependence handling has probably confused the issue, because executing in SU order should be correct -- no actual dependence can be violated, given that the original basic block was in SU order and the transitive backwards closure of a non-pipelineable instruction is considered non-pipelineable. Is it possible that there was code removed from normalizeNonPipelinedInstructions which formerly ignored the artificial dependence when computing SU(14)'s scheduling time? Was the anti-dependence traversed backwards previously? Is the artificial dependence correct?

@kasuga-fj
Copy link
Contributor Author

There's a bit of confusion of terminology around dependence chai

Sorry for the confusion, yes, the term "dependency chain" seems to usually refer to barrier dependencies. What I meant was what you wrote.

So, I think what's happened here is that the changes to the data dependence handling has probably confused the issue

I'm not sure if I understand you correctly, but if "the changes to the data dependence handling" refers to #109918, I think this issue is not related to it. I've confirmed that the same problem happens in older versions of LLVM (before that patch was merged).

To answer your questions:

Is it possible that there was code removed from normalizeNonPipelinedInstructions which formerly ignored the artificial dependence when computing SU(14)'s scheduling time?

No, I think.

Was the anti-dependence traversed backwards previously?

Yes, it's been traversed backwards from the beginning.

if (SU->getInstr()->isPHI())
for (auto &Dep : SU->Succs)
if (Dep.getKind() == SDep::Anti)
Worklist.push_back(Dep.getSUnit());

Is the artificial dependence correct?

I think it is correct, because swapping the source and destination instruction doesn't change the original semantics. However, it does create the inconsistency with SU order. This artificial dependence was created by CopyToPhiMutation. I don't fully understand what it does, but it seems to be some heuristic to shorten the live-range of some variables. And it adds artificial dependencies to not break the topological order: not SU order.

In conclusion, I agree with you on the following.

executing in SU order should be correct -- no actual dependence can be violated

However, an artificial dependence is not an actual dependence. Ignoring all artificial dependencies in normalizeNonPipelinedInstructions may be a solution, but I don't know if it is correct in all cases.

@dpenry
Copy link
Contributor

dpenry commented Mar 20, 2025

OK. I suspect that ignoring the artificial dependence is OK because it's just trying to force scheduling results that reduce copying, but I can't be 100% sure. I think I'm good with just refusing to pipeline here for now, but it would be good for the comments in the code to expand a little more about what's going on, specifically mentioning that artificial dependences are preventing the "obvious" initial ordering from being used.

@kasuga-fj
Copy link
Contributor Author

OK. I suspect that ignoring the artificial dependence is OK because it's just trying to force scheduling results that reduce copying, but I can't be 100% sure.

I also think that ignoring the artificial dependencies doesn't break anything. However, we should also consider that there are any other Mutations, not only CopyToPhiMutation, appended by getSMSMutations. And they might add artificial dependencies for other reasons. I believe artificial dependencies should not affect correctness, but I'm not sure.

P.MF->getSubtarget().getSMSMutations(Mutations);

it would be good for the comments in the code to expand a little more about what's going on, specifically mentioning that artificial dependences are preventing the "obvious" initial ordering from being used.

Added detailed comments.

@kasuga-fj
Copy link
Contributor Author

Thanks!

@kasuga-fj kasuga-fj merged commit 857a04c into llvm:main Mar 21, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants