Skip to content

Mark Differentiable related array methods with inlinable for big performance boost #75778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JaapWijnen
Copy link
Contributor

@JaapWijnen JaapWijnen commented Aug 8, 2024

This PR marks several methods in ArrayDifferentiation.swift with @inlinable adding much more opportunity for specialisation. This leads to huge performance increases and much lower memory usage in a few Differentiable example programs using Array and Array.DifferentiableView.

Main increase is due to being able to specialise Array.DifferentiableView's + operator from it's conformance to AdditiveArithmetic and therefore further being able to inline function calls and not having to go through the protocol witness table.

Technically we do restrict ourselves wrt future changes to the DifferentiableView implementation since the internals are now marked @usableFromInline. We currently think this is acceptable since Differentiation currently is not shipped to an ABI stable platform.

Some performance numbers:
The benchmark I'm using is a reimplementation of the original SwiftForTensorflow example found here: https://github.com/tensorflow/swift-models/tree/2fb0b92e1291b730fd1a5cd8a3b107c8e75c7d7a/Examples/Shallow-Water-PDE
This was implemented on top of the Tensor type that S4TF introduced.
My example uses an Array for storage.

For the benchmark I've set the amount of iterations to 1 so I'm benchmarking the wall clock time and malloc of running a forward pass and pullback once through an array of resolution * resolution and duration amount of time steps
Every tilmestep the benchmark applies the Laplace operator to every cell of the array.
I ran my benchmark for values of 10 and 20 for both of these variables. You can find the results below. As you can see the difference in performance is quite enormous ranging from at least 20x in both categories to 60-70x

----------------------------------------------------------------------------------------------------------------------------
Optimization res: 10, duration: 10 metrics
----------------------------------------------------------------------------------------------------------------------------

╒══════════════════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│         Time (wall clock) (ms) *         │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│                  alpha                   │  323301 │  326631 │  328466 │  331612 │  336331 │  346292 │  346705 │     100 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│               Current_run                │   15549 │   15704 │   15909 │   16204 │   16368 │   17957 │   23918 │     100 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│                    Δ                     │ -307752 │ -310927 │ -312557 │ -315408 │ -319963 │ -328335 │ -322787 │       0 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│              Improvement %               │      95 │      95 │      95 │      95 │      95 │      95 │      93 │       0 │
╘══════════════════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

╒══════════════════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│           Malloc (total) (K) *           │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│                  alpha                   │    3640 │    3640 │    3640 │    3640 │    3640 │    3640 │    3640 │     100 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│               Current_run                │     131 │     131 │     131 │     131 │     131 │     131 │     131 │     100 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│                    Δ                     │   -3509 │   -3509 │   -3509 │   -3509 │   -3509 │   -3509 │   -3509 │       0 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│              Improvement %               │      96 │      96 │      96 │      96 │      96 │      96 │      96 │       0 │
╘══════════════════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

----------------------------------------------------------------------------------------------------------------------------
Optimization res: 10, duration: 20 metrics
----------------------------------------------------------------------------------------------------------------------------

╒══════════════════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│         Time (wall clock) (ms) *         │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│                  alpha                   │  641372 │  646447 │  649593 │  655360 │  662700 │  684196 │  741245 │     100 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│               Current_run                │   30743 │   31080 │   31293 │   31687 │   32195 │   32997 │   33447 │     100 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│                    Δ                     │ -610629 │ -615367 │ -618300 │ -623673 │ -630505 │ -651199 │ -707798 │       0 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│              Improvement %               │      95 │      95 │      95 │      95 │      95 │      95 │      95 │       0 │
╘══════════════════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

╒══════════════════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│           Malloc (total) (K) *           │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│                  alpha                   │    7233 │    7233 │    7233 │    7233 │    7233 │    7233 │    7233 │     100 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│               Current_run                │     261 │     261 │     261 │     261 │     261 │     261 │     261 │     100 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│                    Δ                     │   -6972 │   -6972 │   -6972 │   -6972 │   -6972 │   -6972 │   -6972 │       0 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│              Improvement %               │      96 │      96 │      96 │      96 │      96 │      96 │      96 │       0 │
╘══════════════════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

----------------------------------------------------------------------------------------------------------------------------
Optimization res: 20, duration: 10 metrics
----------------------------------------------------------------------------------------------------------------------------

╒══════════════════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│         Time (wall clock) (ms) *         │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│                  alpha                   │ 5645665 │ 5679088 │ 5708448 │ 5796528 │ 6027215 │ 6033953 │ 6033953 │      18 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│               Current_run                │  180295 │  181535 │  182452 │  184287 │  188350 │  209977 │  213960 │     100 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│                    Δ                     │ -54653… │ -54975… │ -55259… │ -56122… │ -58388… │ -58239… │ -58199… │      82 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│              Improvement %               │      97 │      97 │      97 │      97 │      97 │      97 │      96 │      82 │
╘══════════════════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

╒══════════════════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│           Malloc (total) (K) *           │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│                  alpha                   │      68 │      68 │      68 │      68 │      68 │      68 │      68 │      18 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│               Current_run                │       1 │       1 │       1 │       1 │       1 │       1 │       1 │     100 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│                    Δ                     │     -67 │     -67 │     -67 │     -67 │     -67 │     -67 │     -67 │      82 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│              Improvement %               │      99 │      99 │      99 │      99 │      99 │      99 │      99 │      82 │
╘══════════════════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

----------------------------------------------------------------------------------------------------------------------------
Optimization res: 20, duration: 20 metrics
----------------------------------------------------------------------------------------------------------------------------

╒══════════════════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│         Time (wall clock) (ms) *         │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│                  alpha                   │ 112075… │ 112742… │ 113330… │ 113749… │ 115526… │ 115526… │ 115526… │       9 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│               Current_run                │  358827 │  363856 │  367002 │  370672 │  374604 │  392167 │  408872 │     100 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│                    Δ                     │ -10848… │ -10910… │ -10966… │ -11004… │ -11178… │ -11160… │ -11143… │      91 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│              Improvement %               │      97 │      97 │      97 │      97 │      97 │      97 │      96 │      91 │
╘══════════════════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

╒══════════════════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│           Malloc (total) (K) *           │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│                  alpha                   │     135 │     135 │     135 │     135 │     135 │     135 │     135 │       9 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│               Current_run                │       2 │       2 │       2 │       2 │       2 │       2 │       2 │     100 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│                    Δ                     │    -133 │    -133 │    -133 │    -133 │    -133 │    -133 │    -133 │      91 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│              Improvement %               │      99 │      99 │      99 │      99 │      99 │      99 │      99 │      91 │
╘══════════════════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

@JaapWijnen JaapWijnen requested a review from a team as a code owner August 8, 2024 13:39
@JaapWijnen
Copy link
Contributor Author

tagging @asl

@JaapWijnen
Copy link
Contributor Author

also tagging @rxwei

@asl
Copy link
Contributor

asl commented Aug 8, 2024

@swift-ci please test

@asl
Copy link
Contributor

asl commented Aug 8, 2024

preset=buildbot,tools=RA,stdlib=RA
@swift-ci Please test with preset macOS

@asl
Copy link
Contributor

asl commented Aug 8, 2024

preset=buildbot,tools=RA,stdlib=DA
@swift-ci Please test with preset macOS

1 similar comment
@asl
Copy link
Contributor

asl commented Aug 8, 2024

preset=buildbot,tools=RA,stdlib=DA
@swift-ci Please test with preset macOS

@rxwei
Copy link
Contributor

rxwei commented Aug 8, 2024

How much is the performance boost? Could you quantify that in the PR description?

@JaapWijnen
Copy link
Contributor Author

@rxwei updated my original post with some benchmark numbers

@asl
Copy link
Contributor

asl commented Aug 9, 2024

@rxwei updated my original post with some benchmark numbers

wow!

@JaapWijnen
Copy link
Contributor Author

Anyone specific I could tag for the other required review? @rxwei

@JaapWijnen
Copy link
Contributor Author

@swiftlang/standard-librarians
Kindly requesting a review on this if possible!
A note regarding ABI, no users of the stdlib are affected by this change since we're only differentiation related parts Array.DifferentiableView.
But all differentiation related changes are ABI compatible, we’re only making additions since we're marking more methods inlinable so we're not removing anything.

@asl
Copy link
Contributor

asl commented Aug 15, 2024

@rxwei @swiftlang/standard-librarians What is the policy for such stdlib changes? Should it be additionally reviewed / approved by someone?

@JaapWijnen
Copy link
Contributor Author

@rxwei I've been going through all the methods in stdlib/public/Differentiation and there's also some candidates for adding @inlinable in OptionalDifferentiation.swift and FloatingPointDifferentiation.swift.gyb
Would it be worth including these in this MR or should I make that a separate one?

Copy link
Member

@lorentey lorentey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks plausible to me as a stdlib engineer; the implementations exposed look relatively obvious and unlikely to need to change, and making them inlinable will allow client modules to specialize them.

As noted though, this change is only safe to make as long as Differentiation is not expected to be ABI stable in any context.

@@ -21,27 +21,29 @@ extension Array where Element: Differentiable {
/// multiplied with itself `count` times.
@frozen
public struct DifferentiableView {
@usableFromInline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware, adding @usableFromInline introduces a newly exported symbol without availability. It would not be safe to make this change in a module that's expected to have stable ABI. Newly built code after this change may not be binary compatible with Differentiation in earlier Swift releases.

(However, as I understand it, while Differentation is being build with library evolution enabled, it is not distributed as such on any platform where Swift is ABI stable. It is also unclear if the problematic stored property accessor exports are ever actually called in practice for a @frozen structure like this.)

Copy link
Contributor Author

@JaapWijnen JaapWijnen Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @lorentey !
Just to double check, this is considered unsafe because the struct was already decorated with @frozen? Or is using @usableFromInline without public availability generally ABI unstable? I might not fully grasp the subtlety here but would love to understand better to keep these in mind for future changes.
But indeed Differentiation is not yet distributed to a platform that is ABI stable, so it's not an issue yet!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxwei in your opinion, are we good to go here? As @lorentey points out we break ABI here but that's currently not an issue yet right? I'll add some comments to the main PR message regarding ABI stability.

As an additional question, just for my understanding. Is the following change binary incompatible? And if so why? It seems to me that we're only adding information to the interface not changing the binary layout of the struct. But I also don't have a lot of experience here so would like to understand the details if possible!

@frozen 
struct Thing {
    var storage: Float
}

->

@frozen 
struct Thing {
    @usableFromInline
    var storage: Float
}

Also tagging @asl

@JaapWijnen
Copy link
Contributor Author

@asl Can we merge this? Seems to be fully approved! :)

@asl asl merged commit ccfbc38 into swiftlang:main Nov 7, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants