-
Notifications
You must be signed in to change notification settings - Fork 10.5k
SILOptimizer: a new optimization to hoist destroys of memory locations #26803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@swift-ci test |
@swift-ci benchmark |
Performance: -O
Code size: -O
Performance: -Osize
Code size: -Osize
Performance: -Onone
Code size: -swiftlibs
How to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
Build failed |
Build failed |
@swift-ci test |
Build failed |
Build failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this review I'll need to assume that general destroy_addr hoisting and splitting is something we will want at -O. Initializing the MemoryLocation data and global dataflow state needed simultanesouly solve for all memory sublocations in the function is nontrivial. It would be gratuitous to do that just to handle a specific switch_enum pattern that might occur somewhere in that function. But I can instead view this as an eventual replacement for DestroyHoisting in CopyForwarding--that was originally designed as a -Onone peephole, which we later decided not to enable at -Onone for fear of affecting debugging. Our intention (for the past few years) was to just replace CopyForwarding/DestroyHoisting with Opaque Values and CopyPropagation. Assuming that happens, I wasn't sure under what circumstance we would still need destroy_addr hoisting. The problem of switching over a mutable enum is at least one good example of why destroy_addr hoisting will remain useful. I have a lingering concern because it should be possible to mutate a CoW enum payload in place at -Onone. For that, maybe we'll just need to wait for "move" semantics in the language.
This approach is also very different because it aggresively splits destroys. I've always assumed that canonical SIL should have combined destroys so it's easy to discover the lifetime of aggregates.
Setting all that aside, and assuming we just want to hoist and split all destroy_addr instructions at -O, this is a beautiful way to accomplish that. It will be great to have a complete solution that doesn't depend on a particular shape of SIL. The implementation in this PR is very clean and high quality.
I like the way that MemoryLocations and MemoryDataflow can be used as utilities now, each with independent state and lifetime, without being wedged into some inheritance pattern.
On the algorithm:
In the top-level comments, there's no explanation of how aggregates and their sublocations are handled. But that's really essential to why this approach was taken, rather than a more efficient per-variable lifetime discovery. I think of this design as a "destroy splitting" approach as much as destroy hoisting. Can you explain that it splits destroys into many smaller destroy operations, and why that would usually be desirable?
use(pair.first)
use(pair.second)
// any side effect
destroy(pair)
becomes:
use(pair.first)
destroy(pair.first)
use(pair.second)
destroy(pair.second)
Isn't this potentially a code size problem?
Also, this seems contrary to most other SIL analyses where the code tries to recognize aggregate value lifetimes as a single unit. For example, I'm not sure CopyForwarding will even kick in after this hoisting (and destroy splitting) is done.
On analysis invalidation:
This pass mutates the IR while querying the MemoryDataflow and MemoryLocations. For functions that do this, there should be a note explaining why this is guaranteed safe. I'm sure you thought about it, but I didn't try to prove to myself why it's safe. When other developers modify this code or cargo cult it, I want them to understand that this is normally something you need to think very carefully about. In particular explain why registerProjection is necessary and sufficient (I still wouldn't be able to explain that).
On performance:
It would be interesting to know why destroy hoisting causes significant regressions only to understand if there is some later optimization that currently depends on the shape of those destroys--maybe that optimization needs to be fixed.
I'm also baffled by the Onone performance gains that were reported. Seeing that, I initially assumed you had enabled this at Onone. But clearly these changes should not affect Onone, so what's going on?
On testing:
I'm guessing you already did this, but want to point out that store splitting should be fully tested (on SCK at least) without applying the isSplittingEnabler filter.
On compile time:
I think this style of dataflow can become expensive. For compile time, it's the pathological cases that matter, not the typical case. I usually see the cost show up as time needed to (re)initialize all the per-block bitsets. It will just boil down to how many memory locations need to be tracked and how often it gets recomputed. The way our pipeline works, analyses can easily be recomputed thousands of times. I'm not too worried about recomputing lifetimes yet since it's only being used a single function pass for now. But I'm afraid that running before SROA, and without opaque values means tracking lots of locations for no reason.
I'm sure, initially, the cost is hidden by other gratuitously costly parts of the pipeline: type checker, ARC, RLE, DSE, LLVM. But can you determine the worst case time in SCK, or maybe count the amount of worst-case allocation? If it shows up at all, we might want to filter the address producers to avoid allocating locations that are obviously going to be SROA'd anyway, or obviously can't be optimized, like address-only types. Alternatively, if you just implement a simple location filter and determine that it then never allocates more than 64 locations, there's nothing at all to worry about.
Bits usedLocs(locations.getNumLocations()); | ||
|
||
// Initialize the dataflow, which tells us which destroy_addr instructions are | ||
// reachable through a switch_enum, without a use of the location inbetween. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in between
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add brief comments explaining how this problem maps to the foward data flow: e.g.
Compute forward reaching split-enable-points. The split-enable-point is the first
program point at which it is useful to hoist the destroy. genSet is initialized
at each memory location's split-enable-point. killSet are the points
where the location is used. If a location is used on some path from
the gen-point to the destroy, then the destroy will not be hoisted
beyond that point. If the location is used on all paths, the destroy
will not be hoisted at all. If it is not used on some path, it will
be hoisted to the latest use points. Hence the use of
solveForwardWithUnion.
// This enables the switch-enum optimization (see above). | ||
// TODO: investigate the benchmark regressions and enable store-splitting more | ||
// aggressively. | ||
void DestroyHoisting::splitStores(MemoryDataflow &dataFlow) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, splitting a store means breaking it into multiple narrower stores. This can be called either decomposition or expansion, take your pick.
// destroy_addr %a | ||
// store %s to [init] %a | ||
// | ||
// This is not a good thing in general. If we would do it unconditionally, it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very curious why it's bad to split the stores. I'm guessing you are too and just haven't investigated yet...
} | ||
|
||
// Initialize the dataflow for moving destroys up the control flow. | ||
void DestroyHoisting::initDataflow(MemoryDataflow &dataFlow) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I'd like a brief comment somewhere that maps the problem to the dataflow.
Compute backward reaching destroys. genSet is initialized at the
destroy. killSet are the points where locations are used.
// Handling blocks which eventually end up in an unreachable is surprisingly | ||
// complicated. The conservative approach would be to assume all locations are | ||
// dead at the unreachable. But that would block release hoisting whenever there | ||
// is a unreachable block in the way (which is pretty common for calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Do you mean that we actually want to hoist real destroy_addr's that are on the unreachable path? So, if you pretend there's a destroy at the exit that will somehow prevent hoisting of the real destroy? I don't know what that would happen. Could you clarify this in the comments?
if (!dataFlow.getState(singlePred)->exitReachable) | ||
return false; | ||
|
||
// Check if none of the locations are touched in the unreachable-block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment illustrates why I prefer per-variable, path discovery data flow...
// point of \p builder. | ||
SILValue DestroyHoisting::createAddress(unsigned locIdx, SILBuilder &builder) { | ||
auto *loc = locations.getLocation(locIdx); | ||
if (loc->parentIdx < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worthwhile to have a MemoryLocationIndex wrapper so we don't need to guess what it means for a location to be negative.
// Recursively create (or get) the address of the parent location. | ||
SILValue baseAddr = createAddress(loc->parentIdx, builder); | ||
auto *projInst = cast<SingleValueInstruction>(loc->representativeValue); | ||
if (!isa<BeginAccessInst>(loc->representativeValue) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you handle beginAccess addresses? No matter where the address came from, the code always assumes that all address users are within the access scope.
domTree = DA->get(function); | ||
|
||
// Recursively create (or get) the address of the parent location. | ||
SILValue baseAddr = createAddress(loc->parentIdx, builder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is baseAddr created before checking if loc->representativeValue is available?
// The function entry block: insert all destroys which are still active. | ||
insertDestroys(activeDestroys, activeDestroys, toRemove, | ||
nullptr, block); | ||
} else if (SILBasicBlock *pred = block->getSinglePredecessorBlock()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a note that this condition assumes no critical edges are present.
@atrick Thanks for reviewing! First of all: this optimization does not split or combine destroys. It only moves destroys. I will add a comment explaining this. analysis invalidation: I'll add comments Onone performance: It's like with all other optimizations which run with -O/-Osize: the -Onone benchmarks mainly call generic code in the (optimized) stdlib. So optimization differences in the stdlib will be reflected in -Onone performance. I investigated some of the performance regressions and it turned out again, that many of them are false alarms. The code is identical, but different code alignment causes the perf differences. testing: yes, I tested the benchmarks with unconditional store splitting. I can also run the validation tests. compile time: it's a good idea to get statistics on the max size of the bitfields. I'll do that. |
… files and directories. Replacing the TemporaryFile/TemporaryDirectory classes with withTemporaryFile/withTemporaryDirectory functions. Using classes+deinit for resource management is problematic because the lifetime of classes is not guaranteed to be the lexical scope. Compiler optimizations may shrink the lifetime to the last use. One way to fix this is would be to make sure that the TemporaryFile class instances are referenced as long as the managed resource (= the file or directory) is in use. That could be done e.g. with stdlib's withExtendedLifetime function. But that would be cumbersome. The better way to fix this is to change the ABI to a closure based approach (there are several examples for this in the swift stdlib, e.g. withExtendedLifetime). It does not only fix the lifetime problem, but also makes the lifetime it immediately visible in the source code. Also, it avoids the memory allocation for allocating the class instance. This change is triggered by a new compiler optimization: swiftlang/swift#26803. With this optimization the temporary file in the ManifestLoader.parse() function would break. Although it would be easy to workaround the problem just in ManifestLoader.parse(), I decided to do the right fix, which is more future prove. Implementation note: I kept the TemporaryFile as a struct, because it contains several useful fields. It's passed as argument to the closure. In case of withTemporaryDirectory, only the directory path is relevant, which is now directly passed to the closure. So no need of a TemporaryDirectory struct.
@eeckstein I'm much more comfortable with this approach now that I see it's not splitting destroys. This is really hidden though. I read the code as if |
… more flexible. Add the possibility to solve with a custom join operator.
It causes a use-after-free problem with more aggressive destroy hoisting in the compiler. Also replace tabs with spaces in this file.
9bc68b0
to
2d447be
Compare
@swift-ci test |
Build failed |
@swift-ci test macOS |
1 similar comment
@swift-ci test macOS |
Build failed |
Build failed |
DestroyHoisting moves destroys of memory locations up the control flow as far as possible. Beside destroy_addr, also "store [assign]" is considered a destroy, because is is equivalent to an destroy_addr + a "store [init]". The main purpose of this optimization is to minimize copy-on-write operations for arrays, etc. Especially if such COW containers are used as enum payloads and modified in-place. E.g. switch e { case .A(var arr): arr.append(x) self = .A(arr) ... In such a case DestroyHoisting can move the destroy of the self-assignment up before the switch and thus let the array buffer only be single-referenced at the time of the append. When we have ownership SIL throughout the pass pipeline this optimization will replace the current destroy hoisting optimization in CopyForwarding. For now, this optimization only runs in the mandatory pipeline (but not for -Onone) where we already have ownership SIL. SR-10605 rdar://problem/50463362
2d447be
to
0e08976
Compare
@swift-ci test |
@swift-ci test |
Build failed |
@swift-ci test macOS |
… files and directories. Replacing the TemporaryFile/TemporaryDirectory classes with withTemporaryFile/withTemporaryDirectory functions. Using classes+deinit for resource management is problematic because the lifetime of classes is not guaranteed to be the lexical scope. Compiler optimizations may shrink the lifetime to the last use. One way to fix this is would be to make sure that the TemporaryFile class instances are referenced as long as the managed resource (= the file or directory) is in use. That could be done e.g. with stdlib's withExtendedLifetime function. But that would be cumbersome. The better way to fix this is to change the ABI to a closure based approach (there are several examples for this in the swift stdlib, e.g. withExtendedLifetime). It does not only fix the lifetime problem, but also makes the lifetime it immediately visible in the source code. Also, it avoids the memory allocation for allocating the class instance. This change is triggered by a new compiler optimization: swiftlang/swift#26803. With this optimization the temporary file in the ManifestLoader.parse() function would break. Although it would be easy to workaround the problem just in ManifestLoader.parse(), I decided to do the right fix, which is more future prove. Implementation note: I kept the TemporaryFile as a struct, because it contains several useful fields. It's passed as argument to the closure. In case of withTemporaryDirectory, only the directory path is relevant, which is now directly passed to the closure. So no need of a TemporaryDirectory struct.
|
||
} // end anonymous namespace | ||
|
||
SILTransform *swift::createDestroyHoisting() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this function is never called...
Edit: nevermind. I see it is used in macros (with Passes.def
).
DestroyHoisting moves destroys of memory locations up the control flow as far as possible.
Beside destroy_addr, also "store [assign]" is considered a destroy, because is is equivalent to an destroy_addr + a "store [init]".
The main purpose of this optimization is to minimize copy-on-write operations for arrays, etc. Especially if such COW containers are used as enum payloads and modified in-place. E.g.
In such a case DestroyHoisting can move the destroy of the self-assignment up before the switch and thus let the array buffer only be single-referenced at the time of the append.
When we have ownership SIL throughout the pass pipeline this optimization will replace the current destroy hoisting optimization in CopyForwarding.
For now, this optimization only runs in the mandatory pipeline (but not for -Onone) where we already have ownership SIL.
https://bugs.swift.org/browse/SR-10605
rdar://problem/50463362