Skip to content

[TF] Reimplement unbroadcast using on-host axis calculation for performance. #24907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 20, 2019

Conversation

rxwei
Copy link
Contributor

@rxwei rxwei commented May 20, 2019

The inefficiency of unbroadcast(toShape:), unbroadcast(to:), and unbroadcast(like:) has caused significant performance problems during model training because it's performing a lot of TensorFlow operations to achieve axis calculation. We were forced to implement it this way in the early GPE era when neither send/receive nor per-op dispatch was available.

This PR reimplements the unbroadcast operations in terms of host-side logic to compute axes to reduce along. This significantly reduces the TensorFlow opreation dispatch overhead. The base implementation changed from broadcast(toShape:) to broadcast(to:).

With the new implementation, differentiating broadcasting operators is 37% faster (see simple test script here).

Note:

TODO:

@rxwei rxwei added the tensorflow This is for "tensorflow" branch PRs. label May 20, 2019
@rxwei rxwei requested review from dan-zheng and bartchr808 May 20, 2019 01:14
…rmance.

The inefficiency of `unbroadcast(toShape:)`, `unbroadcast(to:)`, and `unbroadcast(like:)` has caused significant performance problems during model training because it's performing a lot of TensorFlow operations to achieve axis calculation. We were forced to implement it this way in the early GPE era when neither send/receive nor per-op dispatch was available.

This PR reimplements the unbroadcast operations in terms of host-side logic to compute axes to reduce along. This significantly reduces the TensorFlow opreation dispatch overhead. The base implementation changed from `broadcast(toShape:)` to `broadcast(to:)`.

With the new implementation, differentiating broadcasting operators is 37% faster (see simple test script [here](https://gist.github.com/rxwei/e1488cac5379ba2bc3aff7490e18158f)).

Note:
- Since we now rely on the TensorFlow runtime less, more precondition checks and assertions are added to the newly implemented `unbroadcast(to:)` method.
- The part of swiftlang#24408 that uses `Raw.broadcastGradientArgs(s0:s1:)` is still necessary for broadcasting binary operations to become faster.

TODO:
- Change `unbroadcast(toShape:)` tests added by swiftlang#24899 to use `unbroadcast(to:)`, since `unbroadcast(to:)` is now the base implementation.
@rxwei rxwei changed the title [TF] Reimplement unbroadcast using on-host axis calculation. [TF] Reimplement unbroadcast using on-host axis calculation for performance. May 20, 2019
@rxwei rxwei force-pushed the efficient-unbroadcast branch from 356e4d4 to 4513fa4 Compare May 20, 2019 01:16
@rxwei
Copy link
Contributor Author

rxwei commented May 20, 2019

@swift-ci please test tensorflow

Copy link
Contributor

@dan-zheng dan-zheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big 👍 to empirical benchmarking!

@rxwei rxwei merged commit 528fb67 into swiftlang:tensorflow May 20, 2019
@rxwei rxwei deleted the efficient-unbroadcast branch May 20, 2019 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tensorflow This is for "tensorflow" branch PRs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants