-
Notifications
You must be signed in to change notification settings - Fork 608
[Cadence] Add scalar cases for binary ops (add, mul, sub, div) on HiFi #9411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9411
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit b8e3d48 with merge base ea43453 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D71495734 |
cc @cad-audio @dijopaul I'll merge this to unblock a couple internal models that showed pretty bad regressions, but I guess these "scalar" cases can be further optimized so I would leave it to you to assess that! No particular rush, this is such a simple op that compiler vectorization is doing pretty well apparently (e.g. we've seen 40M to 123k cycles on one model using mul) |
fb0a6e1
to
fed29b4
Compare
Summary: As titled. Currently those cases will go to the unoptimized broadcast call, which is extremely inefficient. A simple loop will do much better, and can be further optimized later if needed. Example of gains: mul op goes from 40M to 123k on the 27M ASR encoder. Differential Revision: D71495734
This pull request was exported from Phabricator. Differential Revision: D71495734 |
Summary: As titled. Currently those cases will go to the unoptimized broadcast call, which is extremely inefficient. A simple loop will do much better, and can be further optimized later if needed. Example of gains: mul op goes from 40M to 123k on the 27M ASR encoder. Differential Revision: D71495734
fed29b4
to
cf6497c
Compare
This pull request was exported from Phabricator. Differential Revision: D71495734 |
Summary: As titled. Currently those cases will go to the unoptimized broadcast call, which is extremely inefficient. A simple loop will do much better, and can be further optimized later if needed. Example of gains: mul op goes from 40M to 123k on the 27M ASR encoder. Differential Revision: D71495734
cf6497c
to
c5317e4
Compare
This pull request was exported from Phabricator. Differential Revision: D71495734 |
Summary: As titled. Currently those cases will go to the unoptimized broadcast call, which is extremely inefficient. A simple loop will do much better, and can be further optimized later if needed. Example of gains: mul op goes from 40M to 123k on the 27M ASR encoder. Differential Revision: D71495734
c5317e4
to
b8e3d48
Compare
This pull request was exported from Phabricator. Differential Revision: D71495734 |
Differential Revision: D71495734 Pull Request resolved: pytorch#9411
Differential Revision: D71495734 Pull Request resolved: pytorch#9411
Summary:
As titled. Currently those cases will go to the unoptimized broadcast call, which is extremely inefficient. A simple loop will do much better, and can be further optimized later if needed.
Differential Revision: D71495734