Skip to content

SILCloner and SILInliner rewrite. #19786

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 11, 2018
Merged

SILCloner and SILInliner rewrite. #19786

merged 4 commits into from
Oct 11, 2018

Conversation

atrick
Copy link
Contributor

@atrick atrick commented Oct 9, 2018

This can only be reviewed by first reading the new version of
SILCloner.h, SILInliner.h as a whole, along with the SILInlineCloner
class definition in SILInliner.cpp. The diff points out the code that
needs review but can't meaningfully be reviewed by itself.

This is not a from-scratch rewrite. I attempted to make minimal
changes to most existing SILCloner clients/subclasses--they can be
cleaned up separately. Note that I did need to rewrite SILInliner in
this PR because it so deeply violated SILCloner encapsulation.

This is a large PR, but it was already extracted from a much broader
set of changes and I don't see a meaningful way to further subdivide it.


Mostly functionally neutral:

  • may fix latent bugs.
  • may reduce useless basic blocks after inlining.

This rewrite encapsulates the cloner's internal state, providing a
clean API for the CRTP subclasses. The subclasses are rewritten to use
the exposed API and extension points. This makes it much easier to
understand, work with, and extend SIL cloners, which are central to
many optimization passes. Basic SIL invariants are now clearly
expressed and enforced. There is no longer a intricate dance between
multiple levels of subclasses operating on underlying low-level data
structures. All of the logic needed to keep the original SIL in a
consistent state is contained within the SILCloner itself. Subclasses
only need to be responsible for their own modifications.

The immediate motiviation is to make CFG updates self-contained so
that SIL remains in a valid state. This will allow the removal of
critical edge splitting hacks and will allow general SIL utilities to
take advantage of the fact that we don't allow critical edges.

This rewrite establishes a simple principal that should be followed
everywhere: aside from the primitive mutation APIs on SIL data types,
each SIL utility is responsibile for leaving SIL in a valid state and
the logic for doing so should exist in one central location.

This includes, for example:

  • Generating a valid CFG, splitting edges if needed.
  • Returning a valid instruction iterator if any instructions are removed.
  • Updating dominance.
  • Updating SSA (block arguments).

(Dominance info and SSA properties are fundamental to SIL verification).

LoopInfo is also somewhat fundamental to SIL, and should generally be
updated, but it isn't required.

This also fixes some latent bugs related to iterator invalidation in
recursivelyDeleteTriviallyDeadInstructions and SILInliner. Note that
the SILModule deletion callback should be avoided. It can be useful as
a simple cache invalidation mechanism, but it is otherwise bug prone,
too limited to be very useful, and basically bad design. Utilities
that mutate should return a valid instruction iterator and provide
their own deletion callbacks.

Mostly functionally neutral:
- may fix latent bugs.
- may reduce useless basic blocks after inlining.

This rewrite encapsulates the cloner's internal state, providing a
clean API for the CRTP subclasses. The subclasses are rewritten to use
the exposed API and extension points. This makes it much easier to
understand, work with, and extend SIL cloners, which are central to
many optimization passes. Basic SIL invariants are now clearly
expressed and enforced. There is no longer a intricate dance between
multiple levels of subclasses operating on underlying low-level data
structures. All of the logic needed to keep the original SIL in a
consistent state is contained within the SILCloner itself. Subclasses
only need to be responsible for their own modifications.

The immediate motiviation is to make CFG updates self-contained so
that SIL remains in a valid state. This will allow the removal of
critical edge splitting hacks and will allow general SIL utilities to
take advantage of the fact that we don't allow critical edges.

This rewrite establishes a simple principal that should be followed
everywhere: aside from the primitive mutation APIs on SIL data types,
each SIL utility is responsibile for leaving SIL in a valid state and
the logic for doing so should exist in one central location.

This includes, for example:
- Generating a valid CFG, splitting edges if needed.
- Returning a valid instruction iterator if any instructions are removed.
- Updating dominance.
- Updating SSA (block arguments).

(Dominance info and SSA properties are fundamental to SIL verification).

LoopInfo is also somewhat fundamental to SIL, and should generally be
updated, but it isn't required.

This also fixes some latent bugs related to iterator invalidation in
recursivelyDeleteTriviallyDeadInstructions and SILInliner. Note that
the SILModule deletion callback should be avoided. It can be useful as
a simple cache invalidation mechanism, but it is otherwise bug prone,
too limited to be very useful, and basically bad design. Utilities
that mutate should return a valid instruction iterator and provide
their own deletion callbacks.
@atrick
Copy link
Contributor Author

atrick commented Oct 9, 2018

@atrick
Copy link
Contributor Author

atrick commented Oct 9, 2018

@swift-ci test.

@atrick
Copy link
Contributor Author

atrick commented Oct 9, 2018

@swift-ci test source compatibility.

@atrick
Copy link
Contributor Author

atrick commented Oct 9, 2018

@swift-ci benchmark.

Copy link
Contributor

@rjmccall rjmccall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the core approach of putting the value map directly in the cloner and making SILInliner not directly a cloner. A few minor comments.

ValueMap[DestPHIArg] = SILValue(BlockArg);
// Since we don't call any CFG cloning entry point, we can call
// `foldValue` immediately as if cloning has already started. This simply
// avoids handling AvailVals during `remap` or defining a custome
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: custome

And you've copy-pasted this comment a few times.

///
/// *NOTE*: This attempts to perform inlining unconditionally and thus asserts
/// if inlining will fail. All users /must/ check that a function is allowed
/// to be inlined using SILInliner::canInlineFunction before calling this
/// to be inlined using SILInliner::canInlineApplyAite before calling this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

/// whenever the visitor that clones an instruction skips doPostProcess().
void foldValue(SILValue origValue, SILValue mappedValue) {
ValueMap.insert({origValue, mappedValue});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like the most intuitive name for this method; whenever I see it, I expect it to be a more involved operation, not just updating some internal state. setMappedValue, maybe?

Should this assert that the insertion took effect? If origValue is already mapped, this call is implicitly a no-op, which sounds like a source of bugs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added an assert that values aren't remapped multiple times.

I'm not attached to the name, but...

  • the SILCloner public API has no notion of a "map" at all.
  • during cloning, the visitors have to call one of these two APIs for each original value: doPostProcess or foldValue. foldValue should only be called when there is no new cloned value-producer on which to call doPostProcess. This happens whenever the original value-producer is folded during cloning, typically because of type/constant propagation.
  • foldValue is definitely more meaningful than doPostProcess but I'd be happy to rename both APIs if I had a better idea.

I added some comments:
81fa786

Copy link
Contributor

@rjmccall rjmccall Oct 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so it's a more meaningful name when called within the cloner.

I'm not trying to insist that you rename this, but maybe setValueInClone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point I'm getting at is that it's important for the cloner implementation not to directly map values unless it has folded away an instruction in the cloned code. Otherwise, it must go through the usual doPostProcess API, which I'm now calling recordClonedInstruction. Hopefully this makes things more clear:

9e440d1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that works for me.

newF.getBlocks().splice(SILFunction::iterator(insertBeforeBB),
newF.getBlocks(),
SILFunction::iterator(MappedBB));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a SILFunction::createBasicBlockBefore function now.

@@ -53,45 +53,42 @@ unwind:
// CHECK: [[TEMP:%.*]] = alloc_stack $Indirect<SomeSubclass>
// CHECK: [[MK_IND:%.*]] = function_ref @make_indirect
// CHECK: apply [[MK_IND]]<SomeSubclass>([[TEMP]])
// CHECK: br bb3
// CHECK: destroy_addr [[TEMP]] : $*Indirect<SomeSubclass>
// CHECK: cond_br %0, bb3, bb5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes to test cases are just about visitation order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I tried to ensure the visitation order was the same as before. This is a result of generalizing the fall-through to ReturnBB optimization. So, the inliner does a bit more optimization on the fly now, which just fell out of simplifying the code.

F.getBlocks().splice(F.getBlocks().end(), F.getBlocks(),
SILFunction::iterator(ReturnToBB));
ReturnToBB =
callerBB->split(std::next(Apply.getInstruction()->getIterator()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an intentional decision to not put the return-to block in the same relative position to the inlined code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line just splits callerBB in place. SILBasicBlock::split() places ReturnBB right after callerBB. The call to SILCloner::cloneFunctionBody(getCalleeFunction(), callerBB, entryArgs); will insert all inlined code immediately after callerBB and before ReturnBB. So I'm not sure what you're getting at.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense.

@swift-ci
Copy link
Contributor

swift-ci commented Oct 9, 2018

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Regression
ArrayAppendAsciiSubstring 28165 30592 +8.6% 0.92x
Improvement
Array2D 7505 6908 -8.0% 1.09x
MapReduce 397 368 -7.3% 1.08x
MapReduceAnyCollection 398 369 -7.3% 1.08x (?)

Code size: -O

TEST OLD NEW DELTA RATIO
Regression
SortStrings.o 102854 104598 +1.7% 0.98x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
StringAdder 742 818 +10.2% 0.91x
ArrayAppendAsciiSubstring 25853 28019 +8.4% 0.92x
Improvement
StringWalk 2008 1705 -15.1% 1.18x

Code size: -Osize

TEST OLD NEW DELTA RATIO
Improvement
ArrayAppend.o 39054 37854 -3.1% 1.03x

Performance: -Onone

TEST OLD NEW DELTA RATIO
Regression
CharIteration_japanese_unicodeScalars_Backwards 319836 428655 +34.0% 0.75x
CharIteration_ascii_unicodeScalars_Backwards 266311 346389 +30.1% 0.77x
CharIteration_tweet_unicodeScalars_Backwards 527884 679852 +28.8% 0.78x
CharIteration_korean_unicodeScalars_Backwards 257577 326399 +26.7% 0.79x
CharIteration_chinese_unicodeScalars_Backwards 201149 254002 +26.3% 0.79x
CharIteration_punctuated_unicodeScalars_Backwards 58381 73652 +26.2% 0.79x
CharIteration_russian_unicodeScalars_Backwards 221732 279530 +26.1% 0.79x
CharIteration_punctuatedJapanese_unicodeScalars_Backwards 46279 58262 +25.9% 0.79x
CharIteration_utf16_unicodeScalars_Backwards 229664 280814 +22.3% 0.82x
Improvement
ArrayAppendAsciiSubstring 148830 76111 -48.9% 1.96x
SubstringFromLongString 24 15 -37.5% 1.60x
ArrayAppendLatin1Substring 206234 175081 -15.1% 1.18x
ArrayAppendUTF16Substring 204399 173939 -14.9% 1.18x
ArrayOfPOD 859 782 -9.0% 1.10x
ArrayOfGenericPOD2 1180 1085 -8.1% 1.09x (?)
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false alarms. Unexpected regressions which are marked with '(?)' are probably noise. If you see regressions which you cannot explain you can try to run the benchmarks again. If regressions still show up, please consult with the performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

@slavapestov
Copy link
Contributor

FYI the debug source compatibility failure in ‘fluent’ is a known declaration checker but not related to this PR

// terminator.
void visitInstructionsInBlock(SILBasicBlock *BB);

// Visit a block's terminator. This is called with each block in DFS predorder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: s/predorder/preorder/

clonePhiArgs(startBB);

// Premap exit blocks to terminate so that visitBlocksDepthFirst terminates
// after discovering the cloned region. Mappint an exit block to itself
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Mappint/Mapping/

friend class SILCloner<SILInliner>;
using SuperTy = TypeSubstCloner<SILInliner, SILOptFunctionBuilder>;

class SILInliner {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the #include "...TypeSubstCloner.h" is no longer needed in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah thanks. I did some #include minimization here:
c781d78

@atrick
Copy link
Contributor Author

atrick commented Oct 11, 2018

@swift-ci test.

1 similar comment
@atrick
Copy link
Contributor Author

atrick commented Oct 11, 2018

@swift-ci test.

@atrick atrick merged commit c6865c0 into swiftlang:master Oct 11, 2018
PopFlamingo pushed a commit to PopFlamingo/swift that referenced this pull request Dec 6, 2018
A recent SILCloner rewrite removed a special case hack for single
basic block callee functions:

commit c6865c0
Merge: 76e6c41 9e440d1
Author: Andrew Trick <[email protected]>
Date:   Thu Oct 11 14:23:32 2018

    Merge pull request swiftlang#19786 from atrick/silcloner-cleanup

    SILCloner and SILInliner rewrite.

Instead, the new inliner simply merges trivial unconditional branches
after inlining the return block. This way, the CFG is always in
canonical state after inlining. This is more robust, and avoids
interfering with subsequent SIL passes when non-single-block callees
are inlined.

The problem is that inlining a series of calls within a large block
could result in interleaved block splitting and merging operations,
which is quadratic in the block size. This showed up when inlining the
tens of thousands of array subscript calls emitted for a large array
initialization.

The first half of the fix is to simply defer block merging until all
calls are inlined. We can't expect SimplifyCFG to run immediately
after inlining, nor would we want to do that, *especially* for
mandatory inlining. This fix instead exposes block merging as a
trivial utility.

Note: by eliminating some unconditional branches, this change could
reduce the number of debug locations emitted. This does not
fundamentally change any debug information guarantee, and I was unable
to observe any behavior difference in the debugger.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants