SILCloner and SILInliner rewrite. #19786

atrick · 2018-10-09T02:32:05Z

This can only be reviewed by first reading the new version of
SILCloner.h, SILInliner.h as a whole, along with the SILInlineCloner
class definition in SILInliner.cpp. The diff points out the code that
needs review but can't meaningfully be reviewed by itself.

This is not a from-scratch rewrite. I attempted to make minimal
changes to most existing SILCloner clients/subclasses--they can be
cleaned up separately. Note that I did need to rewrite SILInliner in
this PR because it so deeply violated SILCloner encapsulation.

This is a large PR, but it was already extracted from a much broader
set of changes and I don't see a meaningful way to further subdivide it.

Mostly functionally neutral:

may fix latent bugs.
may reduce useless basic blocks after inlining.

This rewrite encapsulates the cloner's internal state, providing a
clean API for the CRTP subclasses. The subclasses are rewritten to use
the exposed API and extension points. This makes it much easier to
understand, work with, and extend SIL cloners, which are central to
many optimization passes. Basic SIL invariants are now clearly
expressed and enforced. There is no longer a intricate dance between
multiple levels of subclasses operating on underlying low-level data
structures. All of the logic needed to keep the original SIL in a
consistent state is contained within the SILCloner itself. Subclasses
only need to be responsible for their own modifications.

The immediate motiviation is to make CFG updates self-contained so
that SIL remains in a valid state. This will allow the removal of
critical edge splitting hacks and will allow general SIL utilities to
take advantage of the fact that we don't allow critical edges.

This rewrite establishes a simple principal that should be followed
everywhere: aside from the primitive mutation APIs on SIL data types,
each SIL utility is responsibile for leaving SIL in a valid state and
the logic for doing so should exist in one central location.

This includes, for example:

Generating a valid CFG, splitting edges if needed.
Returning a valid instruction iterator if any instructions are removed.
Updating dominance.
Updating SSA (block arguments).

(Dominance info and SSA properties are fundamental to SIL verification).

LoopInfo is also somewhat fundamental to SIL, and should generally be
updated, but it isn't required.

This also fixes some latent bugs related to iterator invalidation in
recursivelyDeleteTriviallyDeadInstructions and SILInliner. Note that
the SILModule deletion callback should be avoided. It can be useful as
a simple cache invalidation mechanism, but it is otherwise bug prone,
too limited to be very useful, and basically bad design. Utilities
that mutate should return a valid instruction iterator and provide
their own deletion callbacks.

Mostly functionally neutral: - may fix latent bugs. - may reduce useless basic blocks after inlining. This rewrite encapsulates the cloner's internal state, providing a clean API for the CRTP subclasses. The subclasses are rewritten to use the exposed API and extension points. This makes it much easier to understand, work with, and extend SIL cloners, which are central to many optimization passes. Basic SIL invariants are now clearly expressed and enforced. There is no longer a intricate dance between multiple levels of subclasses operating on underlying low-level data structures. All of the logic needed to keep the original SIL in a consistent state is contained within the SILCloner itself. Subclasses only need to be responsible for their own modifications. The immediate motiviation is to make CFG updates self-contained so that SIL remains in a valid state. This will allow the removal of critical edge splitting hacks and will allow general SIL utilities to take advantage of the fact that we don't allow critical edges. This rewrite establishes a simple principal that should be followed everywhere: aside from the primitive mutation APIs on SIL data types, each SIL utility is responsibile for leaving SIL in a valid state and the logic for doing so should exist in one central location. This includes, for example: - Generating a valid CFG, splitting edges if needed. - Returning a valid instruction iterator if any instructions are removed. - Updating dominance. - Updating SSA (block arguments). (Dominance info and SSA properties are fundamental to SIL verification). LoopInfo is also somewhat fundamental to SIL, and should generally be updated, but it isn't required. This also fixes some latent bugs related to iterator invalidation in recursivelyDeleteTriviallyDeadInstructions and SILInliner. Note that the SILModule deletion callback should be avoided. It can be useful as a simple cache invalidation mechanism, but it is otherwise bug prone, too limited to be very useful, and basically bad design. Utilities that mutate should return a valid instruction iterator and provide their own deletion callbacks.

atrick · 2018-10-09T02:48:21Z

More lines were added in this PR as a result of many more comments,
wrapper APIs, and boilerplate from moving SILInlineCloner into a .cpp
file (there's less header code now). The implementation
complexity is reduced though, as seen in SILInliner.h:
https://github.com/apple/swift/pull/19786/files#diff-6d61a780d665cd7f48ea545e72acd155L39

And in various SILCLoner extensions:
https://github.com/apple/swift/pull/19786/files#diff-30c46611e101cad3560c1db8ef4820faL776
https://github.com/apple/swift/pull/19786/files#diff-1e2e4ad366d990a0934bd3d8dc405983L1558
https://github.com/apple/swift/pull/19786/files#diff-1e2e4ad366d990a0934bd3d8dc405983L1593
https://github.com/apple/swift/pull/19786/files#diff-1e2e4ad366d990a0934bd3d8dc405983L1770
https://github.com/apple/swift/pull/19786/files#diff-21f77672a667d0107c40f530af4d4860L70

atrick · 2018-10-09T02:48:58Z

@swift-ci test.

atrick · 2018-10-09T02:49:10Z

@swift-ci test source compatibility.

atrick · 2018-10-09T02:49:22Z

@swift-ci benchmark.

rjmccall

I like the core approach of putting the value map directly in the cloner and making SILInliner not directly a cloner. A few minor comments.

rjmccall · 2018-10-09T03:23:06Z

include/swift/SILOptimizer/Utils/Local.h

-      ValueMap[DestPHIArg] = SILValue(BlockArg);
+      // Since we don't call any CFG cloning entry point, we can call
+      // `foldValue` immediately as if cloning has already started. This simply
+      // avoids handling AvailVals during `remap` or defining a custome


typo: custome

And you've copy-pasted this comment a few times.

rjmccall · 2018-10-09T03:25:09Z

include/swift/SILOptimizer/Utils/SILInliner.h

  ///
  /// *NOTE*: This attempts to perform inlining unconditionally and thus asserts
  /// if inlining will fail. All users /must/ check that a function is allowed
-  /// to be inlined using SILInliner::canInlineFunction before calling this
+  /// to be inlined using SILInliner::canInlineApplyAite before calling this


rjmccall · 2018-10-09T03:27:53Z

include/swift/SIL/SILCloner.h

+  /// whenever the visitor that clones an instruction skips doPostProcess().
+  void foldValue(SILValue origValue, SILValue mappedValue) {
+    ValueMap.insert({origValue, mappedValue});
+  }


This doesn't seem like the most intuitive name for this method; whenever I see it, I expect it to be a more involved operation, not just updating some internal state. setMappedValue, maybe?

Should this assert that the insertion took effect? If origValue is already mapped, this call is implicitly a no-op, which sounds like a source of bugs.

Yes, I added an assert that values aren't remapped multiple times.

I'm not attached to the name, but...

the SILCloner public API has no notion of a "map" at all.

during cloning, the visitors have to call one of these two APIs for each original value: doPostProcess or foldValue. foldValue should only be called when there is no new cloned value-producer on which to call doPostProcess. This happens whenever the original value-producer is folded during cloning, typically because of type/constant propagation.

foldValue is definitely more meaningful than doPostProcess but I'd be happy to rename both APIs if I had a better idea.

I added some comments:
81fa786

I see, so it's a more meaningful name when called within the cloner.

I'm not trying to insist that you rename this, but maybe setValueInClone?

The point I'm getting at is that it's important for the cloner implementation not to directly map values unless it has folded away an instruction in the cloned code. Otherwise, it must go through the usual doPostProcess API, which I'm now calling recordClonedInstruction. Hopefully this makes things more clear:

9e440d1

Thanks, that works for me.

rjmccall · 2018-10-09T03:33:01Z

include/swift/SIL/SILCloner.h

+        newF.getBlocks().splice(SILFunction::iterator(insertBeforeBB),
+                                newF.getBlocks(),
+                                SILFunction::iterator(MappedBB));
+      }


There's a SILFunction::createBasicBlockBefore function now.

rjmccall · 2018-10-09T03:34:49Z

test/SILOptimizer/inline_begin_apply.sil

@@ -53,45 +53,42 @@ unwind:
 // CHECK:  [[TEMP:%.*]] = alloc_stack $Indirect<SomeSubclass>
 // CHECK:  [[MK_IND:%.*]] = function_ref @make_indirect
 // CHECK:  apply [[MK_IND]]<SomeSubclass>([[TEMP]])
-// CHECK:  br bb3
+// CHECK:  destroy_addr [[TEMP]] : $*Indirect<SomeSubclass>
+// CHECK:  cond_br %0, bb3, bb5


These changes to test cases are just about visitation order?

No, I tried to ensure the visitation order was the same as before. This is a result of generalizing the fall-through to ReturnBB optimization. So, the inliner does a bit more optimization on the fly now, which just fell out of simplifying the code.

rjmccall · 2018-10-09T03:39:22Z

lib/SILOptimizer/Utils/SILInliner.cpp

-      F.getBlocks().splice(F.getBlocks().end(), F.getBlocks(),
-                           SILFunction::iterator(ReturnToBB));
+    ReturnToBB =
+      callerBB->split(std::next(Apply.getInstruction()->getIterator()));


Is this an intentional decision to not put the return-to block in the same relative position to the inlined code?

This line just splits callerBB in place. SILBasicBlock::split() places ReturnBB right after callerBB. The call to SILCloner::cloneFunctionBody(getCalleeFunction(), callerBB, entryArgs); will insert all inlined code immediately after callerBB and before ReturnBB. So I'm not sure what you're getting at.

Okay, that makes sense.

swift-ci · 2018-10-09T03:47:42Z

Build comment file:

Performance: -O

TEST	OLD	NEW	DELTA	RATIO
Regression
ArrayAppendAsciiSubstring	28165	30592	+8.6%	0.92x
Improvement
Array2D	7505	6908	-8.0%	1.09x
MapReduce	397	368	-7.3%	1.08x
MapReduceAnyCollection	398	369	-7.3%	1.08x (?)

Code size: -O

TEST	OLD	NEW	DELTA	RATIO
Regression
SortStrings.o	102854	104598	+1.7%	0.98x

Performance: -Osize

TEST	OLD	NEW	DELTA	RATIO
Regression
StringAdder	742	818	+10.2%	0.91x
ArrayAppendAsciiSubstring	25853	28019	+8.4%	0.92x
Improvement
StringWalk	2008	1705	-15.1%	1.18x

Code size: -Osize

TEST	OLD	NEW	DELTA	RATIO
Improvement
ArrayAppend.o	39054	37854	-3.1%	1.03x

Performance: -Onone

TEST	OLD	NEW	DELTA	RATIO
Regression
CharIteration_japanese_unicodeScalars_Backwards	319836	428655	+34.0%	0.75x
CharIteration_ascii_unicodeScalars_Backwards	266311	346389	+30.1%	0.77x
CharIteration_tweet_unicodeScalars_Backwards	527884	679852	+28.8%	0.78x
CharIteration_korean_unicodeScalars_Backwards	257577	326399	+26.7%	0.79x
CharIteration_chinese_unicodeScalars_Backwards	201149	254002	+26.3%	0.79x
CharIteration_punctuated_unicodeScalars_Backwards	58381	73652	+26.2%	0.79x
CharIteration_russian_unicodeScalars_Backwards	221732	279530	+26.1%	0.79x
CharIteration_punctuatedJapanese_unicodeScalars_Backwards	46279	58262	+25.9%	0.79x
CharIteration_utf16_unicodeScalars_Backwards	229664	280814	+22.3%	0.82x
Improvement
ArrayAppendAsciiSubstring	148830	76111	-48.9%	1.96x
SubstringFromLongString	24	15	-37.5%	1.60x
ArrayAppendLatin1Substring	206234	175081	-15.1%	1.18x
ArrayAppendUTF16Substring	204399	173939	-14.9%	1.18x
ArrayOfPOD	859	782	-9.0%	1.10x
ArrayOfGenericPOD2	1180	1085	-8.1%	1.09x (?)

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false alarms. Unexpected regressions which are marked with '(?)' are probably noise. If you see regressions which you cannot explain you can try to run the benchmarks again. If regressions still show up, please consult with the performance team (@eeckstein).

Hardware Overview

  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

slavapestov · 2018-10-09T04:17:30Z

FYI the debug source compatibility failure in ‘fluent’ is a known declaration checker but not related to this PR

aschwaighofer · 2018-10-09T17:31:17Z

include/swift/SIL/SILCloner.h

+  // terminator.
+  void visitInstructionsInBlock(SILBasicBlock *BB);
+
+  // Visit a block's terminator. This is called with each block in DFS predorder


typo: s/predorder/preorder/

aschwaighofer · 2018-10-09T17:39:20Z

include/swift/SIL/SILCloner.h

+  clonePhiArgs(startBB);
+
+  // Premap exit blocks to terminate so that visitBlocksDepthFirst terminates
+  // after discovering the cloned region. Mappint an exit block to itself


s/Mappint/Mapping/

aschwaighofer · 2018-10-09T17:56:15Z

include/swift/SILOptimizer/Utils/SILInliner.h

-  friend class SILCloner<SILInliner>;
-  using SuperTy = TypeSubstCloner<SILInliner, SILOptFunctionBuilder>;
-
+class SILInliner {


I believe the #include "...TypeSubstCloner.h" is no longer needed in this file.

Yeah thanks. I did some #include minimization here:
c781d78

…on and recordFoldedValue.

atrick · 2018-10-11T01:05:50Z

@swift-ci test.

atrick · 2018-10-11T01:06:47Z

@swift-ci test.

A recent SILCloner rewrite removed a special case hack for single basic block callee functions: commit c6865c0 Merge: 76e6c41 9e440d1 Author: Andrew Trick <[email protected]> Date: Thu Oct 11 14:23:32 2018 Merge pull request swiftlang#19786 from atrick/silcloner-cleanup SILCloner and SILInliner rewrite. Instead, the new inliner simply merges trivial unconditional branches after inlining the return block. This way, the CFG is always in canonical state after inlining. This is more robust, and avoids interfering with subsequent SIL passes when non-single-block callees are inlined. The problem is that inlining a series of calls within a large block could result in interleaved block splitting and merging operations, which is quadratic in the block size. This showed up when inlining the tens of thousands of array subscript calls emitted for a large array initialization. The first half of the fix is to simply defer block merging until all calls are inlined. We can't expect SimplifyCFG to run immediately after inlining, nor would we want to do that, *especially* for mandatory inlining. This fix instead exposes block merging as a trivial utility. Note: by eliminating some unconditional branches, this change could reduce the number of debug locations emitted. This does not fundamentally change any debug information guarantee, and I was unable to observe any behavior difference in the debugger.

atrick requested review from eeckstein, aschwaighofer, gottesmm, rudkx and rjmccall October 9, 2018 02:32

rjmccall reviewed Oct 9, 2018

View reviewed changes

aschwaighofer reviewed Oct 9, 2018

View reviewed changes

atrick added 3 commits October 9, 2018 20:11

Comments, typos, and cleanup from review.

81fa786

Fix #includes. Arnold's review.

c781d78

Rename SILCloner doPostProcess and foldValue to recordClonedInstructi…

9e440d1

…on and recordFoldedValue.

atrick merged commit c6865c0 into swiftlang:master Oct 11, 2018

SILCloner and SILInliner rewrite. #19786

SILCloner and SILInliner rewrite. #19786

Uh oh!

Conversation

atrick commented Oct 9, 2018

Uh oh!

atrick commented Oct 9, 2018

Uh oh!

atrick commented Oct 9, 2018

Uh oh!

atrick commented Oct 9, 2018

Uh oh!

atrick commented Oct 9, 2018

Uh oh!

rjmccall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjmccall Oct 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swift-ci commented Oct 9, 2018

Build comment file:

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Uh oh!

slavapestov commented Oct 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atrick commented Oct 11, 2018

Uh oh!

atrick commented Oct 11, 2018

Uh oh!

Uh oh!

rjmccall Oct 10, 2018 •

edited

Loading