Skip to content

[Cadence] Add scalar cases for binary ops (add, mul, sub, div) on HiFi #9411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 20, 2025

Conversation

mcremon-meta
Copy link
Contributor

@mcremon-meta mcremon-meta commented Mar 19, 2025

Summary:
As titled. Currently those cases will go to the unoptimized broadcast call, which is extremely inefficient. A simple loop will do much better, and can be further optimized later if needed.

Differential Revision: D71495734

@mcremon-meta mcremon-meta requested a review from tarun292 as a code owner March 19, 2025 20:27
Copy link

pytorch-bot bot commented Mar 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9411

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit b8e3d48 with merge base ea43453 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71495734

@mcremon-meta
Copy link
Contributor Author

cc @cad-audio @dijopaul I'll merge this to unblock a couple internal models that showed pretty bad regressions, but I guess these "scalar" cases can be further optimized so I would leave it to you to assess that! No particular rush, this is such a simple op that compiler vectorization is doing pretty well apparently (e.g. we've seen 40M to 123k cycles on one model using mul)

facebook-github-bot pushed a commit that referenced this pull request Mar 19, 2025
Summary:

As titled. Currently those cases will go to the unoptimized broadcast call, which is extremely inefficient. A simple loop will do much better, and can be further optimized later if needed.
Example of gains: mul op goes from 40M to 123k on the 27M ASR encoder.

Differential Revision: D71495734
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71495734

@mcremon-meta mcremon-meta changed the title Add scalar cases for add and mul on HiFi Add scalar cases for binary ops (add, mul, sub, div) on HiFi Mar 19, 2025
@mcremon-meta mcremon-meta changed the title Add scalar cases for binary ops (add, mul, sub, div) on HiFi [CadenceAdd scalar cases for binary ops (add, mul, sub, div) on HiFi Mar 19, 2025
@mcremon-meta mcremon-meta changed the title [CadenceAdd scalar cases for binary ops (add, mul, sub, div) on HiFi [Cadence] Add scalar cases for binary ops (add, mul, sub, div) on HiFi Mar 19, 2025
facebook-github-bot pushed a commit that referenced this pull request Mar 19, 2025
Summary:

As titled. Currently those cases will go to the unoptimized broadcast call, which is extremely inefficient. A simple loop will do much better, and can be further optimized later if needed.
Example of gains: mul op goes from 40M to 123k on the 27M ASR encoder.

Differential Revision: D71495734
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71495734

facebook-github-bot pushed a commit that referenced this pull request Mar 19, 2025
Summary:

As titled. Currently those cases will go to the unoptimized broadcast call, which is extremely inefficient. A simple loop will do much better, and can be further optimized later if needed.
Example of gains: mul op goes from 40M to 123k on the 27M ASR encoder.

Differential Revision: D71495734
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71495734

Summary:

As titled. Currently those cases will go to the unoptimized broadcast call, which is extremely inefficient. A simple loop will do much better, and can be further optimized later if needed.
Example of gains: mul op goes from 40M to 123k on the 27M ASR encoder.

Differential Revision: D71495734
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71495734

@facebook-github-bot facebook-github-bot merged commit 87dd81a into main Mar 20, 2025
79 of 82 checks passed
@facebook-github-bot facebook-github-bot deleted the export-D71495734 branch March 20, 2025 16:44
oscarandersson8218 pushed a commit to oscarandersson8218/executorch that referenced this pull request Mar 21, 2025
Differential Revision: D71495734

Pull Request resolved: pytorch#9411
DannyYuyang-quic pushed a commit to CodeLinaro/executorch that referenced this pull request Apr 2, 2025
Differential Revision: D71495734

Pull Request resolved: pytorch#9411
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported topic: not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants