Skip to content

[Transform] Only use gc runtime allocator for stack-like alloca ops #287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Sep 4, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions lib/gc/Transforms/MemRefToCPURuntime.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ struct ConvertMemRefToCPURuntime
auto *ctx = &getContext();
// Create a local set to store operations that should not be transformed.
llvm::SmallSet<Operation *, 16> noTransformOps;
llvm::SmallVector<Operation *, 16> allocStack;

// Walk through the module to find func::FuncOp instances.
getOperation()->walk([&](func::FuncOp funcOp) {
Expand All @@ -96,6 +97,26 @@ struct ConvertMemRefToCPURuntime
}
}
}
} else if (isa<memref::AllocOp>(op)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that the walk() order may differ from the execution order. And what about the case

func AAA() {
     a=alloc()
     b=alloc()
     c=alloc()
     dealloc(b)
     dealloc(a)
     return c
}

The optimization still applies.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another topic is that, since we are going to handle general MLIR programs that may be hand-written by users, can your program handle this?

func AAA() {
     a=alloc()
     b=alloc()
     c=alloc()
     if(...) {
        dealloc(c)
     } else {
        dealloc(b)
     }
     dealloc(a)
}

It looks like stack-like, but it is not.

I would suggest a stricter checking:

  1. first, check if an alloc is "leaked" from the scope (already done in your code in main branch)
  2. for the non-leaked allocations only: check if the order of alloc/free is stack-like in the same Region, and ignore the leaked allocations. Usually, we can ensure the execution order in the same region.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another topic is that, since we are going to handle general MLIR programs that may be hand-written by users, can your program handle this?

func AAA() {
     a=alloc()
     b=alloc()
     c=alloc()
     if(...) {
        dealloc(c)
     } else {
        dealloc(b)
     }
     dealloc(a)
}

Not sure if it's a valid case or not, as it will always result in memory leak?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another topic is that, since we are going to handle general MLIR programs that may be hand-written by users, can your program handle this?

func AAA() {
     a=alloc()
     b=alloc()
     c=alloc()
     if(...) {
        dealloc(c)
     } else {
        dealloc(b)
     }
     dealloc(a)
}

Not sure if it's a valid case or not, as it will always result in memory leak?

I think we are targeting a general MLIR compiler, so any of the user input should be handled. Users can always write code like this. At least it will not produce wrong result - if they can live with memory leak.

allocStack.push_back(op);
} else if (isa<memref::DeallocOp>(op)) {
// fallback alloc / dealloc not in FILO fashion
Value deallocMemref = op->getOperands().front();
auto aliases = analysis.resolveReverse(deallocMemref);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the aliases for analyzing deallocMemref vs AllocMemref? They'll be exactly same, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to createBufferDeallocationPass, deallocMemref will be the same as AllocMemref. But just in case we will have an IR as below, and directly call the memref-to-cpuruntime pass. Do you think we can omit this possibility, and do not use alias analysis?

func.func @alloc_dealloc_view() {
  // CHECK-LABEL: func @alloc_dealloc_view()
  // CHECK: %[[m0:.*]] = cpuruntime.alloc() : memref<128xi8>
  %c0 = arith.constant 0: index
  %f0 = arith.constant 0.0: f32
  %alloc = memref.alloc() : memref<128xi8>
  %view = memref.view %alloc[%c0][] : memref<128xi8> to memref<32xf32>
  %subview = memref.subview %view[0][16][1] : memref<32xf32> to memref<16xf32>
  // CHECK: cpuruntime.dealloc
  memref.dealloc %subview : memref<16xf32>
  return
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have such case when it is not stack-like?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is from Issue#279. After bufferDeallocation pass, we will get the following IR, whose alloc / dealloc are in FIFO fashion. This case should be resolved by PR#283, with MergeAlloc pass enabled, thus the small temp buffers will be merged into a larger buffer.
It will be better that the stack-like alloc / dealloc to be checked, in case we still have some non-stack like allocations in some complex scenarios.

scf.parallel (%arg11) = (%c0_3) to (%c32_4) step (%c16_5) {
      %alloc_18 = memref.alloc() {alignment = 64 : i64} : memref<1x16x4096xbf16>
      %subview = memref.subview %arg2[0, %arg11, 0] [1, 16, 4096] [1, 1, 1] : memref<1x32x4096xbf16> to memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>>
      %subview_19 = memref.subview %expand_shape[0, %arg11, 0] [1, 16, 4096] [1, 1, 1] : memref<1x32x4096xbf16> to memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>>
      scf.parallel (%arg12, %arg13, %arg14) = (%c0, %c0, %c0) to (%c1, %c16, %c4096) step (%c1, %c1, %c1) {
        %3 = memref.load %subview[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>>
        %4 = memref.load %subview_19[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16, strided<[131072, 4096, 1], offset: ?>>
        %5 = arith.extf %4 fastmath<contract> : bf16 to f32
        %6 = arith.extf %3 fastmath<contract> : bf16 to f32
        %7 = arith.addf %6, %5 : f32
        %8 = arith.truncf %7 fastmath<contract> : f32 to bf16
        memref.store %8, %alloc_18[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16>
        scf.reduce 
      }
      %alloc_20 = memref.alloc() {alignment = 64 : i64} : memref<1x16x4096xf32>
      scf.parallel (%arg12, %arg13, %arg14) = (%c0, %c0, %c0) to (%c1, %c16, %c4096) step (%c1, %c1, %c1) {
        %3 = memref.load %alloc_18[%arg12, %arg13, %arg14] : memref<1x16x4096xbf16>
        %4 = arith.extf %3 : bf16 to f32
        memref.store %4, %alloc_20[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
        scf.reduce 
      }
      memref.dealloc %alloc_18 : memref<1x16x4096xbf16>
      %alloc_21 = memref.alloc() {alignment = 64 : i64} : memref<1x16x4096xf32>
      scf.parallel (%arg12, %arg13, %arg14) = (%c0, %c0, %c0) to (%c1, %c16, %c4096) step (%c1, %c1, %c1) {
        %3 = memref.load %alloc_20[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
        %4 = memref.load %0[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
        %5 = math.powf %3, %4 : f32
        memref.store %5, %alloc_21[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
        scf.reduce 
      }
      memref.dealloc %alloc_20 : memref<1x16x4096xf32>
      %alloc_22 = memref.alloc() {alignment = 64 : i64} : memref<1x16xf32>
      scf.parallel (%arg12, %arg13) = (%c0, %c0) to (%c1, %c16) step (%c1, %c1) {
        memref.store %cst, %alloc_22[%arg12, %arg13] : memref<1x16xf32>
        scf.for %arg14 = %c0 to %c4096 step %c1 {
          %3 = memref.load %alloc_21[%arg12, %arg13, %arg14] : memref<1x16x4096xf32>
          %4 = memref.load %alloc_22[%arg12, %arg13] : memref<1x16xf32>
          %5 = arith.addf %3, %4 : f32
          memref.store %5, %alloc_22[%arg12, %arg13] : memref<1x16xf32>
        }
        scf.reduce 
      }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But just in case we will have an IR as below, and directly call the memref-to-cpuruntime pass. Do you think we can omit this possibility, and do not use alias analysis?

I don't think it's valid to dealloc/free the partial buffer by either view or subivew of memref.alloc, please check the statement here: https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/MemRef/IR/MemRefOps.td#L552-L555

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out this reference. I will remove the alias.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have such case when it is not stack-like?

Ideally, all the separated memref.alloc within the same scope should be merged into a big chunk by memref-merge pass, this fallback can be a workaround before the memref-merge patch is pulled in or in case there'll be non FILO alloc/dealloc scenarios.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alloc/free used to be FILO (stack like). Is there any recent optimization in upstream changing that?

Ideally, all the separated memref.alloc within the same scope should be merged into a big chunk by memref-merge pass,

The buffer-merge pass is an optimization and can be turned off at -O0 in the future. I suppose we should always make this pass safe to run.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are handling dynamic shaped buffers, the buffer-merge won't work and we still need many small allocations. I would suggest in our pass pipeline that if stack-like allocator is on, we should try to make sure BufferDeallocationPipeline maintain a FILO pattern. If stack-like allocator is off, we can enable non-FILO pattern in BufferDeallocationPipeline

Value topAllocMemref = allocStack.back()->getResults().front();
if (aliases.find(topAllocMemref) != aliases.end()) {
allocStack.pop_back();
} else {
noTransformOps.insert(op);
for (int i = allocStack.size() - 1; i >= 0; --i) {
Operation *curAlloc = allocStack[i];
if (aliases.find(curAlloc->getResults().front()) !=
aliases.end()) {
noTransformOps.insert(curAlloc);
allocStack.erase(allocStack.begin() + i);
}
}
}
}
});
});
Expand Down