Skip to content

Commit b34382e

Browse files
author
Menooker
authored
[mlir][Memref] Add memref-merge optimization (#44)
1 parent bc0014b commit b34382e

15 files changed

+2240
-0
lines changed

docs/memref_schedule.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# RFC: Compile-time memref.alloc Scheduling/Merging
2+
3+
This document proposes a compile-time optimization on existing `memref.alloc` to reduce memory usage and improve memory locality.
4+
5+
## Current status of bufferization and memref pass pipeline
6+
Bufferization is a process in the current MLIR of converting ops with tensor semantics to ops with memref semantics. Currently, MLIR has two different bufferization passes, one-shot-bufferization and older/partial bufferization (legacy version).
7+
One-Shot Bufferize is a new tensor bufferization pass designed for IR in destination-passing style, and with aggressive in-place bufferization. The older/partial bufferization was built around multiple dialects. The community is trying to gradually deprecate the older bufferization and replace them with one-shot bufferization.
8+
The goal of bufferization is to use as little memory as possible and copy as little memory as possible, as a result, the exsiting focus is to determine in-place or out-of-place among the OpOperand and OpResult of individual ops, while not considering much about the overall memory reuse across Operators within a sub-graph (or partition).
9+
10+
The current implementation of Bufferization and memref pass pipeline focuses on copy-avoidance and in-place reusing of the memory. Consider a computation graph of 4 layers of matmul sharing the same weight:
11+
```mlir
12+
func.func @mlp(%x: tensor<128x128xf32>, %y: tensor<128x128xf32>) -> tensor<128x128xf32> {
13+
%a0 = tensor.empty() : tensor<128x128xf32>
14+
%a = linalg.matmul ins(%x, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%a0: tensor<128x128xf32>) -> tensor<128x128xf32>
15+
%b0 = tensor.empty() : tensor<128x128xf32>
16+
%b = linalg.matmul ins(%a, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%b0: tensor<128x128xf32>) -> tensor<128x128xf32>
17+
%c0 = tensor.empty() : tensor<128x128xf32>
18+
%c = linalg.matmul ins(%b, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%c0: tensor<128x128xf32>) -> tensor<128x128xf32>
19+
%d0 = tensor.empty() : tensor<128x128xf32>
20+
%d = linalg.matmul ins(%c, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%d0: tensor<128x128xf32>) -> tensor<128x128xf32>
21+
return %d : tensor<128x128xf32>
22+
}
23+
```
24+
25+
The bufferization pass will create an `memref.alloc` for each of the tensor `a0`, `b0` and `c0`. The bufferization result should be like:
26+
27+
```mlir
28+
func.func @mlp(%x: memref<128x128xf32>, %y: memref<128x128xf32>) -> memref<128x128xf32> {
29+
%a0 = memref.alloc() : memref<128x128xf32>
30+
linalg.matmul ins(%x, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%a0: memref<128x128xf32>)
31+
%b0 = memref.alloc() : memref<128x128xf32>
32+
linalg.matmul ins(%a0, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%b0: memref<128x128xf32>)
33+
%c0 = memref.alloc() : memref<128x128xf32>
34+
linalg.matmul ins(%b0, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%c0: memref<128x128xf32>)
35+
%d0 = memref.alloc() : memref<128x128xf32>
36+
linalg.matmul ins(%c0, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%d0: memref<128x128xf32>)
37+
return %d0 : memref<128x128xf32>
38+
}
39+
```
40+
41+
Without further optimizations, 3 temp buffers will be allocated at the runtime for these tensors. However, as we can see in the IR, the buffer `a0` is no longer used when buffer `c0` is allocated. So buffer `c0` can reuse the memory buffer of buffer `a0`, to reduce the memory size footprint and improve the locality.
42+
43+
An observation of the current bufferization and memref passes is that they do not consider the memory buffer planning - to reuse the buffer/memref for less total size and better locality.
44+
45+
## Proposal
46+
This RFC proposes an optimization to consolidate multiple allocations (`memref.alloc` ops) into a single `memref.alloc` op and each static-shaped `memref.alloc` op will be transformed into a "slice" from the `single allocated buffer` with `memref.view` and some compile-time decided `offsets`. This optimization works on `memref` instead of `tensor` ops, so it should be executed after bufferization pass, and before buffer-deallocation.
47+
48+
While merging the memory allocations, the transform should consider the lifetime of each allocated `memref`s. By lifetime, we mean the range of time when an memref allocated from `memref.alloc` is actively used. The references on `view`s of a "base" `memref` should contribute to the lifetime of the "base". A later `memref.alloc` should consider to reuse the memory of a previously allocated memref, if the lifetime of these two does not overlap. The transform will perform the "reusing" of memory by setting the `offset` of the later `memref.view` to a position within the memory range of a previous allocation's `memref.view` on the `single allocated buffer`.
49+
50+
Below is the expected transformation result of the example IR in the above section:
51+
52+
```mlir
53+
func.func @mlp(%x: memref<128x128xf32>, %y: memref<128x128xf32>) -> memref<128x128xf32> {
54+
%single_buffer = memref.alloc() : memref<131072xi8> // 128*128*sizeof(f32)*2
55+
%a0 = memref.view %single_buffer[0][] : memref<131072xi8> to memref<128x128xf32> // a0 takes the memory from byte offset 0
56+
linalg.matmul ins(%x, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%a0: memref<128x128xf32>)
57+
%b0 = memref.view %single_buffer[65536][] : memref<131072xi8> to memref<128x128xf32> // b0 takes the memory from byte offset 128*128*sizeof(f32)
58+
linalg.matmul ins(%a0, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%b0: memref<128x128xf32>)
59+
%c0 = memref.view %single_buffer[0][] : memref<131072xi8> to memref<128x128xf32> // c0 takes the memory from byte offset 0
60+
linalg.matmul ins(%b0, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%c0: memref<128x128xf32>)
61+
%d0 = memref.alloc() : memref<128x128xf32> // d0 is returned, do not merge
62+
linalg.matmul ins(%c0, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%d0: memref<128x128xf32>)
63+
return %d0 : memref<128x128xf32>
64+
}
65+
```
66+
67+
There is one single allocation `single_buffer` for all temp buffers and `alloc` ops for `a0`, `b0` and `c0` are removed. The returned memref `d0` is untouched. The memrefs `a0`, `b0` and `c0` are replaced by `memref.view` on `single_buffer`. Since `a0` and `b0`'s lifetime overlaps, the transformation will "allocate" different memory ranges on the `single_buffer` - note that `a0` and `b0` has different offsets `%single_buffer[0]` and `%single_buffer[65536]` and the memory ranges does not overlap. The memref `c0` does not overlap with `a0` in their lifetime, so that `c0` can reuse the memory range of `a0` by setting of offset to `%single_buffer[0]`, which is the same of `a0`. The final allocation size of temp memory buffer will be `128*128*sizeof(f32)*2` instead of three `memref<128x128xf32>` buffers in the original IR.
68+
69+
The transformation should only consider to merge a `memref.alloc` only if
70+
* the ownership of the memref does not escape from the function. That is, the current function is responsible to alloc and dealloc this memref
71+
* and, the allocated memref is contiguous and has static shape and identical layout.
72+
73+
In this RFC, we call these `memref.alloc` **mergeable** allocations.
74+
75+
The memrefs passed by function arguments, or returned by the function will be untouched by this optimization.
76+
77+
## Other solutions
78+
79+
Another (not yet existing) approach to resolve the memory reusing issue is to insert `memref.dealloc` as soon as the buffer is no longer used. For example, in the above "matmul" example, `memref.dealloc` can be inserted after the last use of `a0` at `linalg.matmul ins(%a0, %y...)`. So even without memref merging transformation, a common runtime memory allocator will try to reuse the memory free'd by `memref.dealloc(%a0)` when allocating buffer for `c0`. However, there are some disadvantages of this approach comparing to the compile-time memref merging transformation of this proposal:
80+
1. it depends on the implementation of the runtime memory allocator.
81+
2. the runtime memory allocator does not have full picture of the future allocation/deallocation patterns of the program. For example, if we change the above example to make buffer size `c0` greater than size of `a0`, the runtime memory allocator will not likely to be able to reuse the memory of `a0` for `c0`, becuase the free memory chunk size of `a0` does not fit allocation of `c0`. In contrast, the proposed optimization of this document has the knowledge of the allocation patterns. Thus, it can put the memory chunk for `a0` in a right place of the `single allocation buffer`, so that the allocation of `c0` can fit into it.
82+
3. calling runtime memory allocator for each buffer introduces more run time overhead than a single merged allocation after allocation merging.
83+
84+
However, utilizing runtime memory allocator can be viewed as a supplementary approach of the allocation merging at compile-time, for example, to handle memref with dynamic shapes. These two memory optimization approaches should coexist and cowork in the pass pipeline.
85+
86+
## Implementation
87+
*The detail implementation will leverage the exising algorithm used in GC V1.*
88+
89+
The transformation first needs to identify the `alloc scopes`, which are mlir `Block`s
90+
* implementing `AutomaticAllocationScope`
91+
* and is not `scf.for` (allocations in an `scf.for` can be hoisted to parent `AutomaticAllocationScope`)
92+
93+
For example, below is an example IR of a function with nested `scf.forall` ops.
94+
95+
```mlir
96+
func.func @mlp(...) { // <---- alloc scope 1
97+
scf.for(...) { // <---- NOT an alloc scope!
98+
// allocation inside will be merge to alloc scope 1 above
99+
}
100+
...
101+
scf.forall(...) { // <---- alloc scope 2
102+
...
103+
// allocation here will be merge to alloc scope 2
104+
%buf = memref.alloc() : ...
105+
scf.forall(...) { // <---- alloc scope 3
106+
}
107+
}
108+
}
109+
```
110+
111+
There will be three `alloc scopes` as marked in the comments above. An `alloc scope` marks the position to insert the `single allocation buffer` after allocation merging. After the transformation, all "mergeable" `memref.alloc` will be merged to the `single allocation buffer` of the nearest ancestor `alloc scope`.
112+
113+
The transformantion is consist of an analysis sub-pass and a mutation sub-pass. For each `alloc scope`, the analysis sub-pass finds the lifetime of each mergeable `memref.alloc` belonging to the `alloc scope`. And given the lifetime of each allocation, a memory planning algorithm will be run to find the `single allocation buffer` size of each `alloc scope` and the `offset` for each mergeable allocation within its `single allocation buffer`. Based on the memory planning result, the mutation sub-pass transforms the IR to
114+
1. insert `memref.alloc` at the front of `alloc scope` body for its `single allocation buffer`
115+
2. replace mergeable `memref.alloc` with `memref.view` on its `alloc scope`'s `single allocation buffer`
116+
117+
Ticks are assigned on each operation in the `func.func` by a increasing counter with pre-order recursive walking of the IR, as the "execution tick" for each operation. The lifetime analysis pass will assign two integers for each mergeable allocations as the analysis result: `begin_tick` and `end_tick`, to indicate the first and last tick of the use of the allocated memref in the IR. There should be special handling for loop and branch ops (`RegionBranchOpInterface` or `LoopLikeOpInterface`) which references memrefs allocated in parent scopes, to avoid wrong reuse of buffers used in the loop.
118+
119+
The analysis result for each mergeable allocations will be an integer range `[begin_tick,end_tick]`, where `begin_tick <= end_tick`.
120+
121+
The collected ticks for each buffer will be processed by the memory planning algorithm. It should output the total size of the `single allocation buffers` for each `alloc scopes`, and the `offsets` for each individual mergeable buffers. The algorithm should also consider the locality of the buffer to use, when multiple buffer localtion candidates are available.

0 commit comments

Comments
 (0)